Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Decoding Strategies

How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.

CoreTier 2Current~45 min

Why This Matters

top-k = 5top-p = 0.6522%the15%a11%one9%his8%my7%that6%this5%her4%no4%an3%its2%every2%which1%their5%10%15%20%top-k onlytop-p onlybothexcluded

A language model outputs a probability distribution over the vocabulary at each step, computed via the softmax function. The decoding strategy determines how to select the next token from this distribution. This choice has a large effect on output quality. Greedy decoding produces repetitive, degenerate text. Pure random sampling produces incoherent text. Every deployed LLM uses a carefully tuned decoding strategy between these extremes.

Understanding decoding strategies requires understanding how temperature, top-k, and top-p reshape the output distribution, and why these reshapings improve generation quality.

Formal Setup

Let VV be the vocabulary with V=v|V| = v tokens. At each time step tt, the model produces logits ztRvz_t \in \mathbb{R}^v. The probability of token ii is:

p(xt=ix<t)=softmax(zt)i=ezt,ij=1vezt,jp(x_t = i \mid x_{<t}) = \text{softmax}(z_t)_i = \frac{e^{z_{t,i}}}{\sum_{j=1}^{v} e^{z_{t,j}}}

A decoding strategy is a rule that maps this distribution to a token selection.

Core Definitions

Definition

Greedy Decoding

Select the highest-probability token at each step:

xt=argmaxiVp(xt=ix<t)x_t = \arg\max_{i \in V} \, p(x_t = i \mid x_{<t})

This is a deterministic procedure. The same input always produces the same output.

Definition

Beam Search

Maintain a set of kk partial sequences (the beam). At each step, extend each sequence by all vv tokens, score the resulting kvkv candidates by their log-probabilities, and keep only the top kk:

score(x1:t)=s=1tlogp(xsx<s)\text{score}(x_{1:t}) = \sum_{s=1}^{t} \log p(x_s \mid x_{<s})

Greedy decoding is beam search with k=1k = 1.

Definition

Temperature Scaling

Divide logits by temperature T>0T > 0 before applying softmax:

pT(xt=ix<t)=ezt,i/Tj=1vezt,j/Tp_T(x_t = i \mid x_{<t}) = \frac{e^{z_{t,i}/T}}{\sum_{j=1}^{v} e^{z_{t,j}/T}}

T=1T = 1: the original distribution. T<1T < 1: sharpens the distribution (higher peaks, lower tails). T>1T > 1: flattens the distribution (more uniform). T0T \to 0: converges to greedy decoding. TT \to \infty: converges to uniform sampling.

Definition

Top-k Sampling

Sample from only the kk most probable tokens. Zero out all other probabilities and renormalize:

ptop-k(i)={p(i)/jVkp(j)if iVk0otherwisep_{\text{top-k}}(i) = \begin{cases} p(i) / \sum_{j \in V_k} p(j) & \text{if } i \in V_k \\ 0 & \text{otherwise} \end{cases}

where VkV_k is the set of kk tokens with highest probability.

Definition

Nucleus (Top-p) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds pp:

Vp=smallest SV such that iSp(i)pV_p = \text{smallest } S \subseteq V \text{ such that } \sum_{i \in S} p(i) \geq p

where tokens are added in decreasing order of probability. Renormalize and sample from VpV_p. Unlike top-k, the number of candidate tokens adapts to the shape of the distribution.

Main Theorems

Proposition

Temperature Limits of the Softmax Distribution

Statement

Let zRvz \in \mathbb{R}^v with a unique maximum at index i=argmaxjzji^* = \arg\max_j z_j. Define pT(i)=ezi/T/jezj/Tp_T(i) = e^{z_i/T} / \sum_j e^{z_j/T}. Then:

  1. limT0+pT(i)=1[i=i]\lim_{T \to 0^+} p_T(i) = \mathbf{1}[i = i^*] (point mass on the argmax)
  2. limTpT(i)=1/v\lim_{T \to \infty} p_T(i) = 1/v (uniform distribution)
  3. The entropy H(pT)=ipT(i)logpT(i)H(p_T) = -\sum_i p_T(i) \log p_T(i) (as defined in information theory) is strictly increasing in TT.

Intuition

Temperature controls the peakedness of the distribution. At T0T \to 0, only the most likely token has nonzero probability: this is greedy decoding. At TT \to \infty, all tokens are equally likely: this is uniform random sampling. The entropy monotonicity means temperature smoothly interpolates between deterministic and maximally random behavior.

Proof Sketch

For (1): divide numerator and denominator by ezi/Te^{z_{i^*}/T}. Each ratio e(zjzi)/Te^{(z_j - z_{i^*})/T} goes to 00 for jij \neq i^* since zjzi<0z_j - z_{i^*} < 0 and 1/T1/T \to \infty. For (2): zj/T0z_j/T \to 0 for all jj, so ezj/T1e^{z_j/T} \to 1 and pT(j)1/vp_T(j) \to 1/v. For (3): the scaled logits z/Tz/T have strictly decreasing variance as TT increases, and entropy of a softmax distribution is strictly increasing when the logit variance decreases.

Why It Matters

This result formalizes the temperature-quality tradeoff. Low temperature produces high-confidence, repetitive outputs. High temperature produces diverse but potentially incoherent outputs. Practitioners tune TT to find the sweet spot. This same temperature mechanism appears in knowledge distillation, where soft targets at elevated temperature transfer information between models.

Failure Mode

If multiple tokens share the maximum logit value, the T0T \to 0 limit is a uniform distribution over the tied maxima, not a point mass. In practice, exact ties are rare with floating-point logits. The result also assumes a finite vocabulary; it does not apply to continuous distributions.

Why Greedy Decoding Fails

Greedy decoding selects the locally optimal token at each step. This does not yield the globally most probable sequence. The most probable sequence under the model is:

x=argmaxx1:Tt=1Tp(xtx<t)x^* = \arg\max_{x_{1:T}} \prod_{t=1}^{T} p(x_t \mid x_{<t})

Greedy decoding solves a different problem: it picks argmaxxtp(xtx<t)\arg\max_{x_t} p(x_t \mid x_{<t}) at each step independently.

The key failure mode is repetition. Once a greedy decoder begins repeating a phrase, the conditional probability of continuing the repetition is often higher than the probability of breaking out. The model assigns high probability to "The cat sat on the mat. The cat sat on the mat. The cat..." because each repetition is locally plausible given the context.

Beam search mitigates this somewhat (the globally best sequence often avoids loops), but beam search with large kk tends to produce generic, high-frequency text rather than diverse or interesting text.

Why Sampling Helps

Sampling introduces randomness that breaks repetition loops. But pure sampling from the full distribution includes low-probability tokens that produce incoherent output. The tail of a language model distribution often contains tokens that are syntactically or semantically incompatible with the context.

Top-k and top-p truncate this tail:

  • Top-k sets a fixed cutoff. The problem: when the model is confident (one token has probability 0.95), k=50k = 50 still samples from 49 unlikely tokens. When the model is uncertain (flat distribution), k=50k = 50 may cut off reasonable alternatives.

  • Top-p adapts the cutoff to the distribution shape. If one token has probability 0.95 and p=0.9p = 0.9, only that one token is sampled. If the top 100 tokens each have probability 0.009 and p=0.9p = 0.9, all 100 are candidates. This adaptive behavior is why top-p generally outperforms top-k in practice.

Canonical Examples

Example

Temperature effect on a 4-token vocabulary

Suppose logits are z=[2.0,1.0,0.5,0.1]z = [2.0, 1.0, 0.5, 0.1]. The softmax probabilities at different temperatures:

At T=0.5T = 0.5: logits become [4.0,2.0,1.0,0.2][4.0, 2.0, 1.0, 0.2], giving probabilities [0.84,0.11,0.04,0.01]\approx [0.84, 0.11, 0.04, 0.01]. The top token dominates.

At T=1.0T = 1.0: probabilities [0.47,0.17,0.10,0.07]\approx [0.47, 0.17, 0.10, 0.07]. More spread out.

At T=2.0T = 2.0: logits become [1.0,0.5,0.25,0.05][1.0, 0.5, 0.25, 0.05], giving probabilities [0.33,0.20,0.16,0.13]\approx [0.33, 0.20, 0.16, 0.13]. Nearly uniform.

Lower temperature concentrates probability on the top token. Higher temperature spreads probability across all tokens.

Common Confusions

Watch Out

Top-p is not the same as top-k with adaptive k

While top-p can be described as choosing an adaptive number of tokens, the selection criterion is structurally different. Top-k always selects exactly kk tokens regardless of the distribution. Top-p selects the minimum set exceeding a cumulative probability threshold. In a near-uniform distribution over 50,000 tokens, top-p with p=0.9p = 0.9 might include 45,000 tokens. Top-k with k=50k = 50 would include exactly 50.

Watch Out

Temperature does not change the ranking of tokens

Dividing all logits by the same T>0T > 0 preserves their ordering. The most probable token under T=0.5T = 0.5 is the same as under T=2.0T = 2.0. Temperature only changes the gap between probabilities. Greedy decoding gives the same result regardless of temperature.

Watch Out

Beam search does not find the most probable sequence

Beam search with beam width kk is an approximation. The exact most probable sequence requires searching over vTv^T possibilities (exponential in sequence length). Beam search prunes this to O(kvT)O(kvT) candidates. For large kk, beam search tends toward the mode of the distribution, which often produces bland, generic text rather than the most informative or useful text.

Key Takeaways

  • Greedy decoding is deterministic and produces repetitive, degenerate text
  • Beam search improves over greedy but tends toward generic outputs
  • Temperature controls distribution sharpness: low TT is conservative, high TT is random
  • Top-k sampling truncates the tail but uses a fixed cutoff regardless of confidence
  • Top-p (nucleus) sampling adapts the cutoff to the distribution shape
  • In practice, deployed LLMs typically combine temperature scaling with top-p sampling
  • The temperature-quality tradeoff is the central design decision in text generation

Exercises

ExerciseCore

Problem

A language model produces logits z=[3.0,1.0,0.5]z = [3.0, 1.0, 0.5] over a 3-token vocabulary. Compute the softmax probabilities at T=1T = 1 and T=0.5T = 0.5. Which token does greedy decoding select in each case?

ExerciseCore

Problem

With the distribution p=[0.7,0.15,0.08,0.04,0.03]p = [0.7, 0.15, 0.08, 0.04, 0.03], which tokens are included in top-k sampling with k=3k = 3? Which tokens are included in top-p sampling with p=0.9p = 0.9?

ExerciseAdvanced

Problem

Prove that for a softmax distribution with finite logits z1,,zvz_1, \ldots, z_v (not all equal), the entropy H(pT)H(p_T) is strictly increasing in TT for T>0T > 0.

References

Canonical:

  • Holtzman et al., "The Curious Case of Neural Text Degeneration" (2020), Sections 1-4
  • Fan, Lewis, Dauphin, "Hierarchical Neural Story Generation" (2018)

Current:

  • Meister, Cotterell, Vieira, "Locally Typical Sampling" (2023)

  • Welleck et al., "Neural Text Generation with Unlikelihood Training" (2020)

  • Renze & Guven, "The Effect of Sampling Temperature on Problem Solving in Large Language Models" (2024)

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics