Decoding Strategies

Sneiderman, Robby

LLM Construction

Decoding Strategies

How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.

CoreTier 2CurrentCore spine~45 min

Prerequisites

Softmax and Numerical Stability

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 3 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Chain-of-Thought and Reasoning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A language model outputs a probability distribution over the vocabulary at each step, computed via the softmax function. The decoding strategy determines how to select the next token from this distribution. This choice has a large effect on output quality. Greedy decoding produces repetitive, degenerate text. Pure random sampling produces incoherent text. Every deployed LLM uses a carefully tuned decoding strategy between these extremes.

Understanding decoding strategies requires understanding how temperature, top-k, and top-p reshape the output distribution, and why these reshapings improve generation quality.

Formal Setup

Let $V$ be the vocabulary with $|V| = v$ tokens. At each time step $t$ , the model produces logits $z_t \in \mathbb{R}^v$ . The probability of token $i$ is:

$p(x_t = i \mid x_{<t}) = \text{softmax}(z_t)_i = \frac{e^{z_{t,i}}}{\sum_{j=1}^{v} e^{z_{t,j}}}$

A decoding strategy is a rule that maps this distribution to a token selection.

Core Definitions

Definition

Greedy Decoding

Select the highest-probability token at each step:

$x_t = \arg\max_{i \in V} \, p(x_t = i \mid x_{<t})$

This is a deterministic procedure. The same input always produces the same output.

Definition

Beam Search

Maintain a set of $k$ partial sequences (the beam). At each step, extend each sequence by all $v$ tokens, score the resulting $kv$ candidates by their log-probabilities, and keep only the top $k$ :

$\text{score}(x_{1:t}) = \sum_{s=1}^{t} \log p(x_s \mid x_{<s})$

Greedy decoding is beam search with $k = 1$ .

Definition

Temperature Scaling $T$

Divide logits by temperature $T > 0$ before applying softmax:

$p_T(x_t = i \mid x_{<t}) = \frac{e^{z_{t,i}/T}}{\sum_{j=1}^{v} e^{z_{t,j}/T}}$

$T = 1$ : the original distribution. $T < 1$ : sharpens the distribution (higher peaks, lower tails). $T > 1$ : flattens the distribution (more uniform). $T \to 0$ : converges to greedy decoding. $T \to \infty$ : converges to uniform sampling.

Definition

Top-k Sampling

Sample from only the $k$ most probable tokens. Zero out all other probabilities and renormalize:

$p_{\text{top-k}}(i) = \begin{cases} p(i) / \sum_{j \in V_k} p(j) & \text{if } i \in V_k \\ 0 & \text{otherwise} \end{cases}$

where $V_k$ is the set of $k$ tokens with highest probability.

Definition

Nucleus (Top-p) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds $p$ :

$V_p = \text{smallest } S \subseteq V \text{ such that } \sum_{i \in S} p(i) \geq p$

where tokens are added in decreasing order of probability. Renormalize and sample from $V_p$ . Unlike top-k, the number of candidate tokens adapts to the shape of the distribution.

Main Theorems

Proposition

Temperature Limits of the Softmax Distribution

Statement

Let $z \in \mathbb{R}^v$ with a unique maximum at index $i^* = \arg\max_j z_j$ . Define $p_T(i) = e^{z_i/T} / \sum_j e^{z_j/T}$ . Then:

$\lim_{T \to 0^+} p_T(i) = \mathbf{1}[i = i^*]$ (point mass on the argmax)
$\lim_{T \to \infty} p_T(i) = 1/v$ (uniform distribution)
The entropy $H(p_T) = -\sum_i p_T(i) \log p_T(i)$ (as defined in information theory) is strictly increasing in $T$ .

Intuition

Temperature controls the peakedness of the distribution. At $T \to 0$ , only the most likely token has nonzero probability: this is greedy decoding. At $T \to \infty$ , all tokens are equally likely: this is uniform random sampling. The entropy monotonicity means temperature smoothly interpolates between deterministic and maximally random behavior.

Proof Sketch

For (1): divide numerator and denominator by $e^{z_{i^*}/T}$ . Each ratio $e^{(z_j - z_{i^*})/T}$ goes to $0$ for $j \neq i^*$ since $z_j - z_{i^*} < 0$ and $1/T \to \infty$ . For (2): $z_j/T \to 0$ for all $j$ , so $e^{z_j/T} \to 1$ and $p_T(j) \to 1/v$ . For (3): the scaled logits $z/T$ have strictly decreasing variance as $T$ increases, and entropy of a softmax distribution is strictly increasing when the logit variance decreases.

Why It Matters

This result formalizes the temperature-quality tradeoff. Low temperature produces high-confidence, repetitive outputs. High temperature produces diverse but potentially incoherent outputs. Practitioners tune $T$ to find the sweet spot. This same temperature mechanism appears in knowledge distillation, where soft targets at elevated temperature transfer information between models.

Failure Mode

If multiple tokens share the maximum logit value, the $T \to 0$ limit is a uniform distribution over the tied maxima, not a point mass. In practice, exact ties are rare with floating-point logits. The result also assumes a finite vocabulary; it does not apply to continuous distributions.

report a correction →

Why Greedy Decoding Fails

Greedy decoding selects the locally optimal token at each step. This does not yield the globally most probable sequence. The most probable sequence under the model is:

$x^* = \arg\max_{x_{1:T}} \prod_{t=1}^{T} p(x_t \mid x_{<t})$

Greedy decoding solves a different problem: it picks $\arg\max_{x_t} p(x_t \mid x_{<t})$ at each step independently.

The key failure mode is repetition. Once a greedy decoder begins repeating a phrase, the conditional probability of continuing the repetition is often higher than the probability of breaking out. The model assigns high probability to "The cat sat on the mat. The cat sat on the mat. The cat..." because each repetition is locally plausible given the context.

Beam search mitigates this somewhat (the globally best sequence often avoids loops), but beam search with large $k$ tends to produce generic, high-frequency text rather than diverse or interesting text.

Why Sampling Helps

Sampling introduces randomness that breaks repetition loops. But pure sampling from the full distribution includes low-probability tokens that produce incoherent output. The tail of a language model distribution often contains tokens that are syntactically or semantically incompatible with the context.

Top-k and top-p truncate this tail:

Top-k sets a fixed cutoff. The problem: when the model is confident (one token has probability 0.95), $k = 50$ still samples from 49 unlikely tokens. When the model is uncertain (flat distribution), $k = 50$ may cut off reasonable alternatives.
Top-p adapts the cutoff to the distribution shape. If one token has probability 0.95 and $p = 0.9$ , only that one token is sampled. If the top 100 tokens each have probability 0.009 and $p = 0.9$ , all 100 are candidates. This adaptive behavior is why top-p generally outperforms top-k in practice.

Canonical Examples

Example

Temperature effect on a 4-token vocabulary

Suppose logits are $z = [2.0, 1.0, 0.5, 0.1]$ . The softmax probabilities at different temperatures:

At $T = 0.5$ : logits become $[4.0, 2.0, 1.0, 0.2]$ , giving probabilities $\approx [0.83, 0.11, 0.04, 0.02]$ . The top token dominates.

At $T = 1.0$ : probabilities $\approx [0.57, 0.21, 0.13, 0.09]$ . More spread out.

At $T = 2.0$ : logits become $[1.0, 0.5, 0.25, 0.05]$ , giving probabilities $\approx [0.41, 0.25, 0.19, 0.16]$ . Nearly uniform.

Lower temperature concentrates probability on the top token. Higher temperature spreads probability across all tokens.

Common Confusions

Watch Out

Top-p is not the same as top-k with adaptive k

While top-p can be described as choosing an adaptive number of tokens, the selection criterion is structurally different. Top-k always selects exactly $k$ tokens regardless of the distribution. Top-p selects the minimum set exceeding a cumulative probability threshold. In a near-uniform distribution over 50,000 tokens, top-p with $p = 0.9$ might include 45,000 tokens. Top-k with $k = 50$ would include exactly 50.

Watch Out

Temperature does not change the ranking of tokens

Dividing all logits by the same $T > 0$ preserves their ordering. The most probable token under $T = 0.5$ is the same as under $T = 2.0$ . Temperature only changes the gap between probabilities. Greedy decoding gives the same result regardless of temperature.

Watch Out

Beam search does not find the most probable sequence

Beam search with beam width $k$ is an approximation. The exact most probable sequence requires searching over $v^T$ possibilities (exponential in sequence length). Beam search prunes this to $O(kvT)$ candidates. For large $k$ , beam search tends toward the mode of the distribution, which often produces bland, generic text rather than the most informative or useful text.

Summary

Greedy decoding is deterministic and produces repetitive, degenerate text
Beam search improves over greedy but tends toward generic outputs
Temperature controls distribution sharpness: low $T$ is conservative, high $T$ is random
Top-k sampling truncates the tail but uses a fixed cutoff regardless of confidence
Top-p (nucleus) sampling adapts the cutoff to the distribution shape
In practice, deployed LLMs typically combine temperature scaling with top-p sampling
The temperature-quality tradeoff is the central design decision in text generation

Exercises

ExerciseCore

Problem

A language model produces logits $z = [3.0, 1.0, 0.5]$ over a 3-token vocabulary. Compute the softmax probabilities at $T = 1$ and $T = 0.5$ . Which token does greedy decoding select in each case?

ExerciseCore

Problem

With the distribution $p = [0.7, 0.15, 0.08, 0.04, 0.03]$ , which tokens are included in top-k sampling with $k = 3$ ? Which tokens are included in top-p sampling with $p = 0.9$ ?

ExerciseAdvanced

Problem

Prove that for a softmax distribution with finite logits $z_1, \ldots, z_v$ (not all equal), the entropy $H(p_T)$ is strictly increasing in $T$ for $T > 0$ .

References

Canonical:

Holtzman et al., "The Curious Case of Neural Text Degeneration" (2020), Sections 1-4
Fan, Lewis, Dauphin, "Hierarchical Neural Story Generation" (2018)

Current:

Meister, Cotterell, Vieira, "Locally Typical Sampling" (2023)
Welleck et al., "Neural Text Generation with Unlikelihood Training" (2020)
Renze & Guven, "The Effect of Sampling Temperature on Problem Solving in Large Language Models" (2024)
Minh et al., "Min P Sampling: Balancing Creativity and Coherence at High Temperature" (2024), arXiv:2407.01082
Basu et al., "Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity" (2021), ICLR
Li et al., "Contrastive Decoding: Open-ended Text Generation as Optimization" (2023), arXiv:2210.15097
Chuang et al., "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models" (2024), ICLR
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Softmax and Numerical Stabilitylayer 1 · tier 1

Derived topics

2

Chain-of-Thought and Reasoninglayer 5 · tier 1
Calibration and Uncertainty Quantificationlayer 3 · tier 2

Graph-backed continuations

Chain-of-Thought and Reasoning Calibration and Uncertainty Quantification