LLM Construction
Decoding Strategies
How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.
Why This Matters
A language model outputs a probability distribution over the vocabulary at each step, computed via the softmax function. The decoding strategy determines how to select the next token from this distribution. This choice has a large effect on output quality. Greedy decoding produces repetitive, degenerate text. Pure random sampling produces incoherent text. Every deployed LLM uses a carefully tuned decoding strategy between these extremes.
Understanding decoding strategies requires understanding how temperature, top-k, and top-p reshape the output distribution, and why these reshapings improve generation quality.
Formal Setup
Let be the vocabulary with tokens. At each time step , the model produces logits . The probability of token is:
A decoding strategy is a rule that maps this distribution to a token selection.
Core Definitions
Greedy Decoding
Select the highest-probability token at each step:
This is a deterministic procedure. The same input always produces the same output.
Beam Search
Maintain a set of partial sequences (the beam). At each step, extend each sequence by all tokens, score the resulting candidates by their log-probabilities, and keep only the top :
Greedy decoding is beam search with .
Temperature Scaling
Divide logits by temperature before applying softmax:
: the original distribution. : sharpens the distribution (higher peaks, lower tails). : flattens the distribution (more uniform). : converges to greedy decoding. : converges to uniform sampling.
Top-k Sampling
Sample from only the most probable tokens. Zero out all other probabilities and renormalize:
where is the set of tokens with highest probability.
Nucleus (Top-p) Sampling
Sample from the smallest set of tokens whose cumulative probability exceeds :
where tokens are added in decreasing order of probability. Renormalize and sample from . Unlike top-k, the number of candidate tokens adapts to the shape of the distribution.
Main Theorems
Temperature Limits of the Softmax Distribution
Statement
Let with a unique maximum at index . Define . Then:
- (point mass on the argmax)
- (uniform distribution)
- The entropy (as defined in information theory) is strictly increasing in .
Intuition
Temperature controls the peakedness of the distribution. At , only the most likely token has nonzero probability: this is greedy decoding. At , all tokens are equally likely: this is uniform random sampling. The entropy monotonicity means temperature smoothly interpolates between deterministic and maximally random behavior.
Proof Sketch
For (1): divide numerator and denominator by . Each ratio goes to for since and . For (2): for all , so and . For (3): the scaled logits have strictly decreasing variance as increases, and entropy of a softmax distribution is strictly increasing when the logit variance decreases.
Why It Matters
This result formalizes the temperature-quality tradeoff. Low temperature produces high-confidence, repetitive outputs. High temperature produces diverse but potentially incoherent outputs. Practitioners tune to find the sweet spot. This same temperature mechanism appears in knowledge distillation, where soft targets at elevated temperature transfer information between models.
Failure Mode
If multiple tokens share the maximum logit value, the limit is a uniform distribution over the tied maxima, not a point mass. In practice, exact ties are rare with floating-point logits. The result also assumes a finite vocabulary; it does not apply to continuous distributions.
Why Greedy Decoding Fails
Greedy decoding selects the locally optimal token at each step. This does not yield the globally most probable sequence. The most probable sequence under the model is:
Greedy decoding solves a different problem: it picks at each step independently.
The key failure mode is repetition. Once a greedy decoder begins repeating a phrase, the conditional probability of continuing the repetition is often higher than the probability of breaking out. The model assigns high probability to "The cat sat on the mat. The cat sat on the mat. The cat..." because each repetition is locally plausible given the context.
Beam search mitigates this somewhat (the globally best sequence often avoids loops), but beam search with large tends to produce generic, high-frequency text rather than diverse or interesting text.
Why Sampling Helps
Sampling introduces randomness that breaks repetition loops. But pure sampling from the full distribution includes low-probability tokens that produce incoherent output. The tail of a language model distribution often contains tokens that are syntactically or semantically incompatible with the context.
Top-k and top-p truncate this tail:
-
Top-k sets a fixed cutoff. The problem: when the model is confident (one token has probability 0.95), still samples from 49 unlikely tokens. When the model is uncertain (flat distribution), may cut off reasonable alternatives.
-
Top-p adapts the cutoff to the distribution shape. If one token has probability 0.95 and , only that one token is sampled. If the top 100 tokens each have probability 0.009 and , all 100 are candidates. This adaptive behavior is why top-p generally outperforms top-k in practice.
Canonical Examples
Temperature effect on a 4-token vocabulary
Suppose logits are . The softmax probabilities at different temperatures:
At : logits become , giving probabilities . The top token dominates.
At : probabilities . More spread out.
At : logits become , giving probabilities . Nearly uniform.
Lower temperature concentrates probability on the top token. Higher temperature spreads probability across all tokens.
Common Confusions
Top-p is not the same as top-k with adaptive k
While top-p can be described as choosing an adaptive number of tokens, the selection criterion is structurally different. Top-k always selects exactly tokens regardless of the distribution. Top-p selects the minimum set exceeding a cumulative probability threshold. In a near-uniform distribution over 50,000 tokens, top-p with might include 45,000 tokens. Top-k with would include exactly 50.
Temperature does not change the ranking of tokens
Dividing all logits by the same preserves their ordering. The most probable token under is the same as under . Temperature only changes the gap between probabilities. Greedy decoding gives the same result regardless of temperature.
Beam search does not find the most probable sequence
Beam search with beam width is an approximation. The exact most probable sequence requires searching over possibilities (exponential in sequence length). Beam search prunes this to candidates. For large , beam search tends toward the mode of the distribution, which often produces bland, generic text rather than the most informative or useful text.
Key Takeaways
- Greedy decoding is deterministic and produces repetitive, degenerate text
- Beam search improves over greedy but tends toward generic outputs
- Temperature controls distribution sharpness: low is conservative, high is random
- Top-k sampling truncates the tail but uses a fixed cutoff regardless of confidence
- Top-p (nucleus) sampling adapts the cutoff to the distribution shape
- In practice, deployed LLMs typically combine temperature scaling with top-p sampling
- The temperature-quality tradeoff is the central design decision in text generation
Exercises
Problem
A language model produces logits over a 3-token vocabulary. Compute the softmax probabilities at and . Which token does greedy decoding select in each case?
Problem
With the distribution , which tokens are included in top-k sampling with ? Which tokens are included in top-p sampling with ?
Problem
Prove that for a softmax distribution with finite logits (not all equal), the entropy is strictly increasing in for .
References
Canonical:
- Holtzman et al., "The Curious Case of Neural Text Degeneration" (2020), Sections 1-4
- Fan, Lewis, Dauphin, "Hierarchical Neural Story Generation" (2018)
Current:
-
Meister, Cotterell, Vieira, "Locally Typical Sampling" (2023)
-
Welleck et al., "Neural Text Generation with Unlikelihood Training" (2020)
-
Renze & Guven, "The Effect of Sampling Temperature on Problem Solving in Large Language Models" (2024)
-
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1