Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Stability

Log-Probability Computation

Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log(sum(exp(x_i))), and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs.

CoreTier 1Stable~35 min

Why This Matters

A single probability can be small. A product of 100 probabilities is astronomically small. In a language model, the probability of a 100-token sequence might be 1015010^{-150}. In 64-bit floating point, the smallest representable positive number is roughly 5×103245 \times 10^{-324}. Multiply enough probabilities and you hit zero exactly. This is underflow, and it silently corrupts every downstream computation.

The fix is simple: work in log space. Replace products with sums, and never exponentiate until the very end (if ever). Every ML framework implements this pattern. Understanding why and how is necessary for writing correct numerical code. This connects directly to softmax stability, cross-entropy loss, information theory, and floating-point arithmetic.

The Core Problem

Definition

Underflow in Probability Products

Given probabilities p1,,pnp_1, \ldots, p_n with each pi(0,1)p_i \in (0, 1):

P=i=1npiP = \prod_{i=1}^n p_i

In log space: logP=i=1nlogpi\log P = \sum_{i=1}^n \log p_i. Each logpi\log p_i is a moderate negative number (e.g., 3-3 to 10-10). The sum is a large negative number (e.g., 500-500), which is perfectly representable in floating point. The product PP itself (e50010217\approx e^{-500} \approx 10^{-217}) would underflow to zero in many contexts.

The Log-Sum-Exp Trick

The difficulty arises when you need to compute logiexp(xi)\log \sum_i \exp(x_i). This appears in softmax, partition functions, marginal likelihoods, and the forward algorithm. Naive computation fails because exp(xi)\exp(x_i) overflows for large xix_i or underflows for very negative xix_i.

Definition

Log-Sum-Exp (Stable Form)

For values x1,,xnx_1, \ldots, x_n, define c=maxixic = \max_i x_i. Then:

LSE(x1,,xn)=c+logi=1nexp(xic)\text{LSE}(x_1, \ldots, x_n) = c + \log \sum_{i=1}^n \exp(x_i - c)

This is algebraically identical to logiexp(xi)\log \sum_i \exp(x_i), but each exp(xic)1\exp(x_i - c) \leq 1 (since xic0x_i - c \leq 0), so no overflow occurs. At least one term (i=argmaxi = \arg\max) equals exp(0)=1\exp(0) = 1, so the sum is at least 1 and log\log does not receive a near-zero argument.

Main Theorem

Proposition

Numerical Stability of Log-Sum-Exp

Statement

Let x1,,xnx_1, \ldots, x_n be finite floating-point numbers and c=maxixic = \max_i x_i. The computation:

y^=c+logi=1nexp(xic)\hat{y} = c + \log \sum_{i=1}^n \exp(x_i - c)

satisfies the following properties:

  1. No overflow: Each exp(xic)(0,1]\exp(x_i - c) \in (0, 1], so no exponential overflows.
  2. No catastrophic underflow: The sum iexp(xic)1\sum_i \exp(x_i - c) \geq 1, so the logarithm receives an argument 1\geq 1.
  3. Relative error: The result has relative error O(nϵmach)O(n\epsilon_{\text{mach}}) where ϵmach1016\epsilon_{\text{mach}} \approx 10^{-16} for float64.

The naive computation logiexp(xi)\log \sum_i \exp(x_i) can produce ++\infty (overflow) or -\infty (underflow) for the same inputs.

Intuition

Subtracting c=max(xi)c = \max(x_i) shifts all exponents to 0\leq 0, capping exp(xic)\exp(x_i - c) at 1. The largest term contributes exactly 1, guaranteeing the sum is well-behaved. The shift is then added back as cc outside the logarithm, restoring the correct value.

Failure Mode

If xicx_i - c is very negative (say, <750< -750 for float64), then exp(xic)\exp(x_i - c) rounds to zero. This is benign: such terms contribute negligibly to the sum. The result is still accurate because the dominant terms are preserved. The only true failure is when all inputs are -\infty, producing -\infty.

Applications

Stable log-softmax

The softmax of logits z1,,zKz_1, \ldots, z_K is softmax(z)k=exp(zk)/jexp(zj)\text{softmax}(z)_k = \exp(z_k) / \sum_j \exp(z_j). In log space:

logsoftmax(z)k=zkLSE(z1,,zK)\log \text{softmax}(z)_k = z_k - \text{LSE}(z_1, \ldots, z_K)

This is how PyTorch's F.log_softmax and JAX's jax.nn.log_softmax are implemented. Computing log(softmax(z)) by first computing softmax and then taking the log is numerically inferior: softmax can produce values so close to zero that log returns -\infty. See numerical stability for the broader context.

Log-likelihood computation

For a model with parameters θ\theta and i.i.d. data x1,,xnx_1, \ldots, x_n:

(θ)=logi=1npθ(xi)=i=1nlogpθ(xi)\ell(\theta) = \log \prod_{i=1}^n p_\theta(x_i) = \sum_{i=1}^n \log p_\theta(x_i)

Always compute the sum of log-probabilities, never the log of the product. ML frameworks compute logpθ(xi)\log p_\theta(x_i) directly (via log-softmax for categorical distributions, or closed-form log-densities for continuous ones) without ever materializing pθ(xi)p_\theta(x_i) as a floating-point number.

Log-domain forward algorithm for HMMs

The forward algorithm computes αt(j)=p(observations 1:t,statet=j)\alpha_t(j) = p(\text{observations } 1:t, \text{state}_t = j). The recurrence is:

αt(j)=bj(ot)iαt1(i)aij\alpha_t(j) = b_j(o_t) \sum_i \alpha_{t-1}(i) \cdot a_{ij}

where aija_{ij} are transition probabilities and bj(ot)b_j(o_t) are emission probabilities. For long sequences, αt(j)\alpha_t(j) underflows to zero because it is a product of many small probabilities.

In log space, define αˉt(j)=logαt(j)\bar{\alpha}_t(j) = \log \alpha_t(j):

αˉt(j)=logbj(ot)+LSEi(αˉt1(i)+logaij)\bar{\alpha}_t(j) = \log b_j(o_t) + \text{LSE}_i(\bar{\alpha}_{t-1}(i) + \log a_{ij})

This replaces multiplication with addition and summation with LSE. No underflow occurs regardless of sequence length.

Canonical Examples

Example

Why naive probability multiplication fails

Consider a bigram language model assigning p(wordtwordt1)0.01p(\text{word}_t | \text{word}_{t-1}) \approx 0.01 to each word in a 200-word document. The sequence probability is 0.01200=104000.01^{200} = 10^{-400}, which is below the float64 minimum (10308\approx 10^{-308}) and rounds to zero.

In log space: logP=200×log(0.01)=200×(4.605)=921.0\log P = 200 \times \log(0.01) = 200 \times (-4.605) = -921.0. This is a perfectly representable float64 value.

Example

Log-sum-exp in practice

Suppose you need log(e1000+e1001)\log(e^{1000} + e^{1001}). Naive computation: e1000e^{1000} overflows to ++\infty in float64. With LSE: c=1001c = 1001, so:

LSE=1001+log(e1+e0)=1001+log(1+e1)1001.313\text{LSE} = 1001 + \log(e^{-1} + e^{0}) = 1001 + \log(1 + e^{-1}) \approx 1001.313

No overflow, correct result.

Common Log-Probability Operations

OperationNaive formLog-space formNumerical trick
Product of probabilitiesipi\prod_i p_iilogpi\sum_i \log p_iReplace multiply with add
Sum of probabilitiesipi\sum_i p_iLSE(logp1,,logpn)\text{LSE}(\log p_1, \ldots, \log p_n)Max-subtract for stability
Softmaxezk/jezje^{z_k} / \sum_j e^{z_j}zkLSE(z1,,zK)z_k - \text{LSE}(z_1, \ldots, z_K)Never compute softmax then log
Weighted averageiwixi\sum_i w_i x_iLSE(logwi+logxi)\text{LSE}(\log w_i + \log x_i)Keep weights in log space
Bayes update$p(AB) \propto p(BA) p(A)$
Mixture probabilitykπkpk(x)\sum_k \pi_k p_k(x)LSEk(logπk+logpk(x))\text{LSE}_k(\log \pi_k + \log p_k(x))LSE over mixture components
Ratio of probabilitiesp/qp/qlogplogq\log p - \log qSubtraction in log space
Power of probabilitypαp^{\alpha}αlogp\alpha \log pScalar multiply in log space

Log-Probabilities in LLM Inference

Modern language models operate almost entirely in log space during inference. Understanding this is necessary for working with LLM outputs.

Token-level scoring

A language model outputs logits z1,,zVz_1, \ldots, z_V over a vocabulary of size VV (typically 32K-128K tokens). The log-probability of a token is:

logp(tkt<k)=zkLSE(z1,,zV)\log p(t_k | t_{<k}) = z_k - \text{LSE}(z_1, \ldots, z_V)

This is one log-softmax operation per generation step. The result is a vector of VV log-probabilities, all negative, summing to 0 in probability space (summing to log(1)=0\log(1) = 0 via LSE in log space).

Sequence scoring and perplexity

The log-probability of a full sequence t1,,tnt_1, \ldots, t_n is:

logp(t1,,tn)=k=1nlogp(tkt<k)\log p(t_1, \ldots, t_n) = \sum_{k=1}^n \log p(t_k | t_{<k})

Perplexity is the exponentiated average negative log-probability:

PPL=exp(1nk=1nlogp(tkt<k))\text{PPL} = \exp\left(-\frac{1}{n} \sum_{k=1}^n \log p(t_k | t_{<k})\right)

Lower perplexity means the model assigns higher probability to the observed sequence. The connection to cross-entropy is direct: perplexity equals 2H2^{H} where HH is the cross-entropy in bits, or equivalently eHe^{H} when using nats. See bits, nats, perplexity, BPB for unit conversions.

Decoding with log-probs

Decoding strategies operate on log-probabilities. Temperature scaling divides logits by TT before log-softmax. Top-kk and top-pp (nucleus) sampling threshold on cumulative probability, which requires sorting log-probs and computing cumulative sums via LSE. Beam search maintains kk partial sequences scored by summed log-probabilities.

Common Confusions

Watch Out

Log space does not solve all numerical problems

Log space prevents underflow in probability products and overflow in exponentials. It does not help with cancellation errors (subtracting two nearly equal numbers) or with computations that are inherently ill-conditioned. For example, computing log(1+x)\log(1 + x) for small xx requires log1p(x), not log(1 + x), to avoid cancellation.

Watch Out

You should almost never exponentiate log-probabilities

If you are computing log-probabilities only to exponentiate them later, reconsider your pipeline. Most downstream operations (comparison, argmax, addition of independent log-probs) can stay in log space. Exponentiating reintroduces the underflow problem you solved.

Watch Out

The max subtraction is not a heuristic

Subtracting c=max(xi)c = \max(x_i) in LSE is not an approximation or heuristic. It is an algebraic identity: logexi=c+logexic\log \sum e^{x_i} = c + \log \sum e^{x_i - c} for any cc. Choosing c=max(xi)c = \max(x_i) is the standard choice because it guarantees no overflow and at least one unit-magnitude term in the sum.

Exercises

ExerciseCore

Problem

Compute LSE(100,200,300)\text{LSE}(100, 200, 300) by hand using the stable formula. What would happen if you tried to compute e100+e200+e300e^{100} + e^{200} + e^{300} directly in float64?

ExerciseAdvanced

Problem

In the log-domain forward algorithm for an HMM with KK hidden states and sequence length TT, what is the time complexity? How does it compare to the standard (non-log) forward algorithm?

References

Canonical:

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Section 3.5.3
  • Rabiner, A Tutorial on Hidden Markov Models and Selected Applications (1989)
  • Higham, Accuracy and Stability of Numerical Algorithms (2002), Chapters 1-2. The definitive reference on floating-point error analysis.

For LLM context:

  • Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020). Nucleus sampling operates on log-probabilities.
  • Meister et al., "Language Model Evaluation Beyond Perplexity" (ACL 2021). Uses log-probs for evaluation metrics.

Implementation:

  • PyTorch documentation: torch.logsumexp, F.log_softmax, F.cross_entropy
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.1
  • JAX documentation: jax.nn.log_softmax, jax.scipy.special.logsumexp

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics