LLM Construction
Plan-then-Generate
Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.
Prerequisites
Why This Matters
Standard autoregressive language models generate text one token at a time, left to right, with no lookahead. Each token is conditioned on everything before it and nothing after it. This is a strong architectural constraint that creates a fundamental problem: the model must commit to early tokens before knowing what will come later.
For short outputs, this is fine. For long outputs (essays, code, proofs, stories), the lack of planning leads to incoherence, contradictions, and structural problems that cannot be fixed by making the model larger. Humans do not write this way. Humans plan outlines, draft sections, revise, and restructure. Plan-then-generate methods attempt to give language models similar capabilities.
Mental Model
Think of autoregressive generation as writing a novel by starting at the first word and never looking back. You might produce locally fluent text, but the plot will wander, characters will be forgotten, and the ending will not connect to the beginning. Plan-then-generate is more like writing with an outline: first decide the structure (sections, key points, argument flow), then fill in each section with awareness of the whole plan.
The fundamental tension: autoregressive models are trained to maximize , which is a local objective. Coherence is a global property. Planning bridges this gap by introducing an intermediate representation that captures global structure.
Formal Setup and Notation
Let be a token sequence. Standard autoregressive generation factors . Plan-then-generate introduces a latent plan and factors as:
The plan can be an outline, a set of key points, a tree structure, or a sequence of future token predictions.
Planning as Latent Variable
A plan is a latent variable that captures the high-level structure of the output before token-level generation begins. The generative process is:
- Sample or construct a plan:
- Generate tokens conditioned on the plan:
The plan can be discrete (an outline with bullet points), continuous (a sequence of embeddings), or hierarchical (a tree of increasingly detailed specifications).
Key Approaches
Outline-then-Fill
Outline-then-fill generates a structured outline first, then fills in each section independently (or with cross-attention to other sections). The outline acts as a discrete plan that constrains the global structure.
Example: for a 5-paragraph essay, first generate 5 topic sentences, then expand each into a full paragraph conditioned on all topic sentences. This ensures the essay has coherent global structure even though each paragraph is generated locally.
Hierarchical Generation
Hierarchical generation produces output at multiple levels of granularity. Level 0: generate a high-level summary or skeleton. Level 1: expand each element of the skeleton into a detailed outline. Level 2: expand each outline item into full text. Each level conditions on the entire output of the previous level.
This mirrors how humans write complex documents: thesis statement, then section headers, then paragraphs, then sentences.
Multi-Token Prediction
Multi-token prediction trains the model to predict not just the next token but the next tokens simultaneously. The training loss is:
where is the -th prediction head. This forces the model to build internal representations that capture future context, acting as an implicit planning mechanism.
Main Theorems
Planning Reduces Coherence Error
Statement
Consider generating a sequence of tokens where coherence requires maintaining consistency between tokens at distance (the effective context window for coherence, which may be shorter than the model's context length due to attention dilution). Without planning, the expected number of coherence violations grows as . With a plan that summarizes global structure in tokens, coherence violations reduce to when the plan is attended to throughout generation.
Intuition
Without a plan, the model must maintain all global constraints in its hidden state, which degrades with distance. With a plan, global constraints are explicitly encoded in a compact representation that the model can attend to at every step. The plan acts as a "global memory" that prevents the model from contradicting earlier decisions.
Proof Sketch
Model the generation process as a sequence of decisions. Each decision at position must be consistent with decisions at positions for . Without a plan, consistency depends on information propagating through intermediate steps, each with probability of preserving the constraint. Total success probability decays as . With a plan, each decision directly accesses the global constraint, so consistency holds with probability independent of .
Why It Matters
This formalizes the intuition that autoregressive models degrade for long outputs. It also explains why chain-of-thought prompting helps: the "chain" acts as a lightweight plan that keeps global context accessible.
Failure Mode
The plan itself must be correct. If the plan is incoherent or underspecified, conditioned generation inherits its flaws. Planning shifts the problem from "generate coherent text" to "generate a coherent plan," which may be easier but is not trivially solved.
Multi-Token Prediction Improves Representation Quality
Statement
A model trained with -token prediction must learn representations at layer that are sufficient for predicting tokens given . By the data processing inequality, these representations capture at least as much information about the future as representations trained with 1-token prediction:
where is the hidden state of the -token model and is the hidden state of the 1-token model, assuming both achieve their respective training optima.
Intuition
If you train a model to predict 4 tokens ahead, it must encode information about longer-range structure in its hidden states. A model that only predicts 1 token ahead has no gradient signal pushing it to capture what comes 4 tokens later. Multi-token prediction is a form of self-supervised planning: the model learns to plan implicitly because planning helps predict farther ahead.
Proof Sketch
The -token prediction loss includes the 1-token loss as a component. At the optimum, the hidden state must be sufficient for each of the prediction heads. Since mutual information is monotone under sufficient statistics, the hidden state captures weakly more information about the future than the 1-token hidden state.
Why It Matters
Multi-token prediction has been shown empirically to improve coding performance disproportionately (Gloeckle et al., 2024). This makes theoretical sense: code has strict long-range dependencies (matching braces, variable references, type constraints) that benefit from representations encoding future structure.
Failure Mode
Larger is not always better. For very large , predicting far-ahead tokens from current context becomes noisy and may degrade learning of local patterns. There is an optimal that balances local precision with global awareness, and it depends on the data distribution.
Current Research Directions
Planning in language models is an active research area with several threads:
Explicit planning via search. Systems like Tree-of-Thought and Graph-of-Thought use the model itself to generate and evaluate multiple candidate plans before committing to generation. This is expensive but effective for reasoning tasks.
Learned planning tokens. Some approaches train the model to produce special "planning tokens" that are not part of the output but influence the hidden state. These tokens act as a learned, continuous plan that the model constructs before generating visible output.
Diffusion-based text generation. Instead of generating left-to-right, diffusion models generate all tokens simultaneously and refine them iteratively. This naturally allows global planning because all positions are updated together.
Revision and editing. Rather than getting the output right in one pass, allow the model to generate a draft and then revise. This decomposes planning into initial generation (fast, possibly incoherent) and revision (fixing global structure).
Common Confusions
Chain-of-thought is not the same as planning
Chain-of-thought prompting produces intermediate reasoning steps that are part of the output. Planning produces a structure that guides generation but may not appear in the final output. Chain-of-thought is a special case where the plan is exposed, but true planning can use latent representations that are never shown to the user.
Multi-token prediction is not speculative decoding
Multi-token prediction trains multiple prediction heads during training to improve representation quality. Speculative decoding uses a draft model at inference time to speed up generation. They are complementary: a model trained with multi-token prediction can also use speculative decoding for faster inference.
Summary
- Autoregressive generation has no lookahead, causing incoherence for long outputs
- Planning introduces a latent structure that guides token-level generation
- Outline-then-fill: explicit discrete plans, then conditioned generation
- Multi-token prediction: implicit planning via future-aware representations
- Planning reduces coherence violations from to
- This is mostly a research-stage area, no dominant paradigm yet
Exercises
Problem
Explain why a standard autoregressive model can produce locally fluent but globally incoherent text. Give a concrete example where planning would help.
Problem
Multi-token prediction with heads produces 4 logit distributions at each position. At inference time, you can only emit tokens one at a time. How would you use the extra heads during inference, and what advantage does this provide over a model trained with ?
Problem
Design a plan-then-generate training procedure for code generation. What should the plan contain? How would you obtain plan-code pairs for training? How would you evaluate whether planning improves over standard autoregressive generation?
References
Canonical:
- Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (2023)
- Gloeckle et al., "Better & Faster Large Language Models via Multi-Token Prediction" (2024)
Current:
- Meta AI, "Multi-Token Prediction" (2024), internal representations and code performance
- Li et al., "Pre-Writing and Pre-Planning for Long-Form Generation" (2023)
Next Topics
Natural extensions from plan-then-generate:
- Inference systems overview: how planning fits into broader LLM deployment
- Context engineering: managing context windows for effective planning
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1