Plan-then-Generate

Sneiderman, Robby

LLM Construction

Plan-then-Generate

Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.

ResearchTier 3FrontierSupporting~45 min

Prerequisites

Transformer Architecture

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 3. This page has 1 direct prerequisite and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard autoregressive language models generate text one token at a time, left to right, with no lookahead. Each token is conditioned on everything before it and nothing after it. This is a strong architectural constraint that creates a fundamental problem: the model must commit to early tokens before knowing what will come later.

For short outputs, this is fine. For long outputs (essays, code, proofs, stories), the lack of planning leads to incoherence, contradictions, and structural problems that cannot be fixed by making the model larger. Humans do not write this way. Humans plan outlines, draft sections, revise, and restructure. Plan-then-generate methods attempt to give language models similar capabilities.

Mental Model

Think of autoregressive generation as writing a novel by starting at the first word and never looking back. You might produce locally fluent text, but the plot will wander, characters will be forgotten, and the ending will not connect to the beginning. Plan-then-generate is more like writing with an outline: first decide the structure (sections, key points, argument flow), then fill in each section with awareness of the whole plan.

The fundamental tension: autoregressive models are trained to maximize $P(x_t \mid x_{<t})$ , which is a local objective. Coherence is a global property. Planning bridges this gap by introducing an intermediate representation that captures global structure.

Formal Setup and Notation

Let $x = (x_1, \ldots, x_T)$ be a token sequence. Standard autoregressive generation factors $P(x \mid c) = \prod_{t=1}^T P(x_t \mid x_{<t}, c)$ given context $c$ . Plan-then-generate introduces a latent plan $z$ and marginalizes:

$P(x \mid c) = \sum_z P(z \mid c) \, P(x \mid z, c) = \sum_z P(z \mid c) \prod_{t=1}^T P(x_t \mid x_{<t}, z, c)$

This identity is just marginalization and by itself implies nothing about generation quality. It becomes useful only when two conditions hold. First, the plan $z$ acts as a near-sufficient statistic for global structure, so $P(x_t \mid x_{<t}, z, c)$ is closer to the true conditional than $P(x_t \mid x_{<t}, c)$ under a fixed model class. Second, the plan space is small enough that a trained model approximates $P(z \mid c)$ well. Both conditions are empirical claims, not consequences of the factorization.

The plan $z$ can be an outline, a set of key points, a tree structure, or a sequence of future token predictions.

Definition

Planning as Latent Variable

A plan is a latent variable $z$ that captures the high-level structure of the output before token-level generation begins. The generative process is:

Sample or construct a plan: $z \sim P(z \mid \text{prompt})$
Generate tokens conditioned on the plan: $x \sim P(x \mid z, \text{prompt})$

The plan can be discrete (an outline with bullet points), continuous (a sequence of embeddings), or hierarchical (a tree of increasingly detailed specifications).

Key Approaches

Definition

Outline-then-Fill

Outline-then-fill generates a structured outline first, then fills in each section independently (or with cross-attention to other sections). The outline acts as a discrete plan that constrains the global structure.

Example: for a 5-paragraph essay, first generate 5 topic sentences, then expand each into a full paragraph conditioned on all topic sentences. This ensures the essay has coherent global structure even though each paragraph is generated locally.

Definition

Hierarchical Generation

Hierarchical generation produces output at multiple levels of granularity. Level 0: generate a high-level summary or skeleton. Level 1: expand each element of the skeleton into a detailed outline. Level 2: expand each outline item into full text. Each level conditions on the entire output of the previous level.

This mirrors how humans write complex documents: thesis statement, then section headers, then paragraphs, then sentences.

Definition

Multi-Token Prediction

Multi-token prediction trains the model to predict not just the next token $x_{t+1}$ but the next $k$ tokens $(x_{t+1}, \ldots, x_{t+k})$ simultaneously. The training loss is:

$\mathcal{L}_k = -\sum_{t=1}^{T-k} \sum_{j=1}^{k} \log P_j(x_{t+j} \mid x_{\leq t})$

where $P_j$ is the $j$ -th prediction head. This forces the model to build internal representations that capture future context, acting as an implicit planning mechanism.

Formal Result: CoT Expanded Computational Expressivity

The strongest formal statement in this area is due to Merrill and Sabharwal (2024), which treats chain-of-thought (a visible plan) as a computational resource, not as a Bayesian smoothing trick.

Theorem

Chain of Thought Expands Transformer Expressivity

Statement

A decoder-only transformer with no intermediate tokens recognizes a class contained in $\mathsf{TC}^0$ (constant-depth threshold circuits). Allowing the model to emit $t(n)$ intermediate chain-of-thought tokens before its final answer strictly increases this class. With $\Theta(n)$ intermediate tokens the recognizable class contains problems outside $\mathsf{TC}^0$ , and with a polynomial number of intermediate tokens the class is contained in $\mathsf{P}/\mathrm{poly}$ .

Intuition

Attention without intermediate tokens computes a fixed-depth, uniform function of the input. Intermediate tokens let the model reuse its own previous outputs as scratch space, which is a form of serial computation. The number of scratch tokens caps how much serial computation the model can perform.

Proof Sketch

Upper bound: a single forward pass of a fixed-precision transformer can be simulated by a $\mathsf{TC}^0$ circuit (Merrill-Sabharwal 2023). Lower bound: with $n$ intermediate tokens the model can simulate the step-by-step execution of a Turing machine for $n$ steps, because each token lets the next forward pass condition on prior scratch. Together these place the CoT-augmented class strictly above $\mathsf{TC}^0$ and inside $\mathsf{P}/\mathrm{poly}$ when the scratch budget is polynomial.

Why It Matters

This is the first formal statement that "thinking out loud" is not cosmetic. It makes intermediate tokens a circuit-depth resource. It gives a principled reason why reasoning models such as o1 and R1, which spend many tokens on scratch before answering, can solve problems that a single forward pass cannot.

Failure Mode

The result says nothing about whether training finds the right program. A transformer may be capable of computing something with $n$ scratch tokens without actually learning to do so from next-token data. Capacity is not reachability.

report a correction →

Heuristic: Plans as Global Scratchpad

The coherence argument often made for planning is not a theorem. It is a modeling heuristic that follows from attention dilution over long contexts, together with empirical observations about long-form generation quality.

Watch Out

Planning does not reduce coherence error by the DPI

An earlier version of this page invoked the data processing inequality to argue that conditioning on a plan $p$ lowers generation entropy. That argument does not go through. The DPI states that for a Markov chain $X \to Y \to Z$ , $I(X; Z) \leq I(X; Y)$ . In plan-then-generate the plan is not a deterministic function of the context, nor a sufficient statistic for the target, so the Markov condition does not apply. The correct framing is that an explicit plan reduces the effective distance over which global constraints must propagate through hidden states, which is an assumption about model inductive bias, not an information-theoretic bound.

A cleaner way to state the heuristic: let $d$ be the token distance over which a coherence constraint must hold and let $w$ be the effective attention range over which the model reliably uses context. Empirically, generation quality degrades when $d \gg w$ . An explicit plan tokenized near the start of the context shortens the effective distance from $d$ to a small constant, because the plan stays inside the attention window throughout generation. This is a statement about model behavior, not a provable bound.

Multi-Token Prediction: What Can Be Said

The claim "multi-token prediction improves representation quality by DPI" is not valid either, for the same reason: the hidden state is not a sufficient statistic for the future, and training optima are not jointly comparable across loss functions. What can be said is much weaker and is consistent with the empirical results in Gloeckle et al. (2024).

The $k$ -token loss is

$\mathcal{L}_k = -\sum_{t} \sum_{j=1}^{k} \log P_j(x_{t+j} \mid x_{\leq t}),$

where each $P_j$ is a separate head sharing the backbone. Because the $j=1$ head is trained jointly with heads that must predict farther-ahead targets, the shared representation $h_t$ receives gradient signal about $x_{t+j}$ for $j > 1$ . This is a regularization effect on the backbone, not a DPI-style bound. Whether the joint optimum has $I(h_t; x_{t+1:t+k}) \geq$ the 1-token optimum is an empirical question about the optimization landscape. Gloeckle et al. (2024) report gains on code, where strict long-range dependencies (matching braces, variable references) make far-ahead targets informative. For natural language the gains are smaller and not consistent.

The right mental model is: multi-token prediction is an auxiliary task that shapes the backbone toward future-aware features. It is not a sufficiency argument, and it does not reduce coherence error by any provable amount.

Current Research Directions

Planning in language models is an active research area with several threads:

Explicit planning via search. Systems like Tree-of-Thoughts (Yao et al. 2023, arXiv:2305.10601) and Graph-of-Thoughts (Besta et al. 2023, arXiv:2308.09687) use the model itself to generate and evaluate multiple candidate plans before committing to generation. This is expensive but effective for reasoning tasks.

Learned planning tokens. Some approaches train the model to produce special "planning tokens" that are not part of the output but influence the hidden state. These tokens act as a learned, continuous plan that the model constructs before generating visible output.

Diffusion-based text generation. Instead of generating left-to-right, diffusion models generate all tokens simultaneously and refine them iteratively. This naturally allows global planning because all positions are updated together.

Revision and editing. Rather than getting the output right in one pass, allow the model to generate a draft and then revise. This decomposes planning into initial generation (fast, possibly incoherent) and revision (fixing global structure).

Common Confusions

Watch Out

Chain-of-thought is not the same as planning

Chain-of-thought prompting produces intermediate reasoning steps that are part of the output. Planning produces a structure that guides generation but may not appear in the final output. Chain-of-thought is a special case where the plan is exposed, but true planning can use latent representations that are never shown to the user.

Watch Out

Multi-token prediction is not speculative decoding

Multi-token prediction trains multiple prediction heads during training to improve representation quality. Speculative decoding uses a draft model at inference time to speed up generation. They are complementary: a model trained with multi-token prediction can also use speculative decoding for faster inference.

Summary

Autoregressive generation has no lookahead, which causes global incoherence for long outputs
Planning introduces a latent or visible structure $z$ that guides token-level generation
Outline-then-fill: explicit discrete plans, then conditioned generation
Multi-token prediction: implicit planning via future-aware representations, an auxiliary task rather than a provable sufficiency argument
Merrill and Sabharwal (2024) give the one real expressivity theorem: chain-of-thought tokens strictly expand what a fixed-precision transformer can recognize, from $\mathsf{TC}^0$ upward into $\mathsf{P}/\mathrm{poly}$
Reasoning-trained models (o1, DeepSeek R1) turn CoT from a prompting trick into a trained capability

Exercises

ExerciseCore

Problem

Explain why a standard autoregressive model can produce locally fluent but globally incoherent text. Give a concrete example where planning would help.

ExerciseAdvanced

Problem

Multi-token prediction with $k = 4$ heads produces 4 logit distributions at each position. At inference time, you can only emit tokens one at a time. How would you use the extra heads during inference, and what advantage does this provide over a model trained with $k = 1$ ?

ExerciseResearch

Problem

Design a plan-then-generate training procedure for code generation. What should the plan contain? How would you obtain plan-code pairs for training? How would you evaluate whether planning improves over standard autoregressive generation?

References

Formal expressivity:

Merrill and Sabharwal, "The Expressive Power of Transformers with Chain of Thought," ICLR 2024. Shows CoT tokens move decoder-only transformers from $\mathsf{TC}^0$ toward $\mathsf{P}/\mathrm{poly}$ .
Merrill and Sabharwal, "The Parallelism Tradeoff: Limitations of Log-Precision Transformers," TACL 2023. The $\mathsf{TC}^0$ upper bound used by the above.

Chain-of-thought and planning prompts:

Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," NeurIPS 2022. Foundational CoT paper.
Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models," ICLR 2023. Sampling multiple CoT paths and majority-voting.
Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," NeurIPS 2023. Explicit search over partial plans.
Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," NeurIPS 2023. Self-revision as a planning mechanism.

Training for reasoning:

Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning," NeurIPS 2022. Rationale-augmented fine-tuning.
Lightman et al., "Let's Verify Step by Step," ICLR 2024. Process reward models for step-level supervision.
OpenAI, "OpenAI o1 System Card," September 2024. First deployed reasoning model trained for long CoT.
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," arXiv:2501.12948, 2025.

Multi-token prediction:

Gloeckle et al., "Better and Faster Large Language Models via Multi-Token Prediction," ICML 2024. Auxiliary multi-head loss, strongest gains on code.

Next Topics

Natural extensions from plan-then-generate:

Inference systems overview: how planning fits into broader LLM deployment
Context engineering: managing context windows for effective planning

Last reviewed: May 29, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Transformer Architecturelayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.