Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Plan-then-Generate

Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.

ResearchTier 3Frontier~45 min
0

Why This Matters

Standard autoregressive language models generate text one token at a time, left to right, with no lookahead. Each token is conditioned on everything before it and nothing after it. This is a strong architectural constraint that creates a fundamental problem: the model must commit to early tokens before knowing what will come later.

For short outputs, this is fine. For long outputs (essays, code, proofs, stories), the lack of planning leads to incoherence, contradictions, and structural problems that cannot be fixed by making the model larger. Humans do not write this way. Humans plan outlines, draft sections, revise, and restructure. Plan-then-generate methods attempt to give language models similar capabilities.

Mental Model

Think of autoregressive generation as writing a novel by starting at the first word and never looking back. You might produce locally fluent text, but the plot will wander, characters will be forgotten, and the ending will not connect to the beginning. Plan-then-generate is more like writing with an outline: first decide the structure (sections, key points, argument flow), then fill in each section with awareness of the whole plan.

The fundamental tension: autoregressive models are trained to maximize P(xtx<t)P(x_t \mid x_{<t}), which is a local objective. Coherence is a global property. Planning bridges this gap by introducing an intermediate representation that captures global structure.

Formal Setup and Notation

Let x=(x1,,xT)x = (x_1, \ldots, x_T) be a token sequence. Standard autoregressive generation factors P(x)=t=1TP(xtx<t)P(x) = \prod_{t=1}^T P(x_t \mid x_{<t}). Plan-then-generate introduces a latent plan zz and factors as:

P(x)=zP(z)P(xz)=zP(z)t=1TP(xtx<t,z)P(x) = \sum_z P(z) P(x \mid z) = \sum_z P(z) \prod_{t=1}^T P(x_t \mid x_{<t}, z)

The plan zz can be an outline, a set of key points, a tree structure, or a sequence of future token predictions.

Definition

Planning as Latent Variable

A plan is a latent variable zz that captures the high-level structure of the output before token-level generation begins. The generative process is:

  1. Sample or construct a plan: zP(zprompt)z \sim P(z \mid \text{prompt})
  2. Generate tokens conditioned on the plan: xP(xz,prompt)x \sim P(x \mid z, \text{prompt})

The plan can be discrete (an outline with bullet points), continuous (a sequence of embeddings), or hierarchical (a tree of increasingly detailed specifications).

Key Approaches

Definition

Outline-then-Fill

Outline-then-fill generates a structured outline first, then fills in each section independently (or with cross-attention to other sections). The outline acts as a discrete plan that constrains the global structure.

Example: for a 5-paragraph essay, first generate 5 topic sentences, then expand each into a full paragraph conditioned on all topic sentences. This ensures the essay has coherent global structure even though each paragraph is generated locally.

Definition

Hierarchical Generation

Hierarchical generation produces output at multiple levels of granularity. Level 0: generate a high-level summary or skeleton. Level 1: expand each element of the skeleton into a detailed outline. Level 2: expand each outline item into full text. Each level conditions on the entire output of the previous level.

This mirrors how humans write complex documents: thesis statement, then section headers, then paragraphs, then sentences.

Definition

Multi-Token Prediction

Multi-token prediction trains the model to predict not just the next token xt+1x_{t+1} but the next kk tokens (xt+1,,xt+k)(x_{t+1}, \ldots, x_{t+k}) simultaneously. The training loss is:

Lk=t=1Tkj=1klogPj(xt+jxt)\mathcal{L}_k = -\sum_{t=1}^{T-k} \sum_{j=1}^{k} \log P_j(x_{t+j} \mid x_{\leq t})

where PjP_j is the jj-th prediction head. This forces the model to build internal representations that capture future context, acting as an implicit planning mechanism.

Main Theorems

Proposition

Planning Reduces Coherence Error

Statement

Consider generating a sequence of TT tokens where coherence requires maintaining consistency between tokens at distance d>wd > w (the effective context window for coherence, which may be shorter than the model's context length due to attention dilution). Without planning, the expected number of coherence violations grows as Ω(T/w)\Omega(T / w). With a plan that summarizes global structure in O(logT)O(\log T) tokens, coherence violations reduce to O(1)O(1) when the plan is attended to throughout generation.

Intuition

Without a plan, the model must maintain all global constraints in its hidden state, which degrades with distance. With a plan, global constraints are explicitly encoded in a compact representation that the model can attend to at every step. The plan acts as a "global memory" that prevents the model from contradicting earlier decisions.

Proof Sketch

Model the generation process as a sequence of decisions. Each decision at position tt must be consistent with decisions at positions tdt - d for d>wd > w. Without a plan, consistency depends on information propagating through d/wd/w intermediate steps, each with probability p<1p < 1 of preserving the constraint. Total success probability decays as pd/wp^{d/w}. With a plan, each decision directly accesses the global constraint, so consistency holds with probability independent of dd.

Why It Matters

This formalizes the intuition that autoregressive models degrade for long outputs. It also explains why chain-of-thought prompting helps: the "chain" acts as a lightweight plan that keeps global context accessible.

Failure Mode

The plan itself must be correct. If the plan is incoherent or underspecified, conditioned generation inherits its flaws. Planning shifts the problem from "generate coherent text" to "generate a coherent plan," which may be easier but is not trivially solved.

Proposition

Multi-Token Prediction Improves Representation Quality

Statement

A model trained with kk-token prediction must learn representations at layer ll that are sufficient for predicting tokens xt+1,,xt+kx_{t+1}, \ldots, x_{t+k} given xtx_{\leq t}. By the data processing inequality, these representations capture at least as much information about the future as representations trained with 1-token prediction:

I(ht(k);xt+1:t+k)I(ht(1);xt+1:t+k)I(h_t^{(k)}; x_{t+1:t+k}) \geq I(h_t^{(1)}; x_{t+1:t+k})

where ht(k)h_t^{(k)} is the hidden state of the kk-token model and ht(1)h_t^{(1)} is the hidden state of the 1-token model, assuming both achieve their respective training optima.

Intuition

If you train a model to predict 4 tokens ahead, it must encode information about longer-range structure in its hidden states. A model that only predicts 1 token ahead has no gradient signal pushing it to capture what comes 4 tokens later. Multi-token prediction is a form of self-supervised planning: the model learns to plan implicitly because planning helps predict farther ahead.

Proof Sketch

The kk-token prediction loss includes the 1-token loss as a component. At the optimum, the hidden state ht(k)h_t^{(k)} must be sufficient for each of the kk prediction heads. Since mutual information is monotone under sufficient statistics, the hidden state captures weakly more information about the future than the 1-token hidden state.

Why It Matters

Multi-token prediction has been shown empirically to improve coding performance disproportionately (Gloeckle et al., 2024). This makes theoretical sense: code has strict long-range dependencies (matching braces, variable references, type constraints) that benefit from representations encoding future structure.

Failure Mode

Larger kk is not always better. For very large kk, predicting far-ahead tokens from current context becomes noisy and may degrade learning of local patterns. There is an optimal kk that balances local precision with global awareness, and it depends on the data distribution.

Current Research Directions

Planning in language models is an active research area with several threads:

Explicit planning via search. Systems like Tree-of-Thought and Graph-of-Thought use the model itself to generate and evaluate multiple candidate plans before committing to generation. This is expensive but effective for reasoning tasks.

Learned planning tokens. Some approaches train the model to produce special "planning tokens" that are not part of the output but influence the hidden state. These tokens act as a learned, continuous plan that the model constructs before generating visible output.

Diffusion-based text generation. Instead of generating left-to-right, diffusion models generate all tokens simultaneously and refine them iteratively. This naturally allows global planning because all positions are updated together.

Revision and editing. Rather than getting the output right in one pass, allow the model to generate a draft and then revise. This decomposes planning into initial generation (fast, possibly incoherent) and revision (fixing global structure).

Common Confusions

Watch Out

Chain-of-thought is not the same as planning

Chain-of-thought prompting produces intermediate reasoning steps that are part of the output. Planning produces a structure that guides generation but may not appear in the final output. Chain-of-thought is a special case where the plan is exposed, but true planning can use latent representations that are never shown to the user.

Watch Out

Multi-token prediction is not speculative decoding

Multi-token prediction trains multiple prediction heads during training to improve representation quality. Speculative decoding uses a draft model at inference time to speed up generation. They are complementary: a model trained with multi-token prediction can also use speculative decoding for faster inference.

Summary

  • Autoregressive generation has no lookahead, causing incoherence for long outputs
  • Planning introduces a latent structure that guides token-level generation
  • Outline-then-fill: explicit discrete plans, then conditioned generation
  • Multi-token prediction: implicit planning via future-aware representations
  • Planning reduces coherence violations from O(T/w)O(T/w) to O(1)O(1)
  • This is mostly a research-stage area, no dominant paradigm yet

Exercises

ExerciseCore

Problem

Explain why a standard autoregressive model can produce locally fluent but globally incoherent text. Give a concrete example where planning would help.

ExerciseAdvanced

Problem

Multi-token prediction with k=4k = 4 heads produces 4 logit distributions at each position. At inference time, you can only emit tokens one at a time. How would you use the extra heads during inference, and what advantage does this provide over a model trained with k=1k = 1?

ExerciseResearch

Problem

Design a plan-then-generate training procedure for code generation. What should the plan contain? How would you obtain plan-code pairs for training? How would you evaluate whether planning improves over standard autoregressive generation?

References

Canonical:

  • Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (2023)
  • Gloeckle et al., "Better & Faster Large Language Models via Multi-Token Prediction" (2024)

Current:

  • Meta AI, "Multi-Token Prediction" (2024), internal representations and code performance
  • Li et al., "Pre-Writing and Pre-Planning for Long-Form Generation" (2023)

Next Topics

Natural extensions from plan-then-generate:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.