Numerical Stability
Log-Probability Computation
Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log(sum(exp(x_i))), and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs.
Prerequisites
Why This Matters
A single probability can be small. A product of 100 probabilities is astronomically small. In a language model, the probability of a 100-token sequence might be . In 64-bit floating point, the smallest representable positive number is roughly . Multiply enough probabilities and you hit zero exactly. This is underflow, and it silently corrupts every downstream computation.
The fix is simple: work in log space. Replace products with sums, and never exponentiate until the very end (if ever). Every ML framework implements this pattern. Understanding why and how is necessary for writing correct numerical code. This connects directly to softmax stability, cross-entropy loss, information theory, and floating-point arithmetic.
The Core Problem
Underflow in Probability Products
Given probabilities with each :
In log space: . Each is a moderate negative number (e.g., to ). The sum is a large negative number (e.g., ), which is perfectly representable in floating point. The product itself () would underflow to zero in many contexts.
The Log-Sum-Exp Trick
The difficulty arises when you need to compute . This appears in softmax, partition functions, marginal likelihoods, and the forward algorithm. Naive computation fails because overflows for large or underflows for very negative .
Log-Sum-Exp (Stable Form)
For values , define . Then:
This is algebraically identical to , but each (since ), so no overflow occurs. At least one term () equals , so the sum is at least 1 and does not receive a near-zero argument.
Main Theorem
Numerical Stability of Log-Sum-Exp
Statement
Let be finite floating-point numbers and . The computation:
satisfies the following properties:
- No overflow: Each , so no exponential overflows.
- No catastrophic underflow: The sum , so the logarithm receives an argument .
- Relative error: The result has relative error where for float64.
The naive computation can produce (overflow) or (underflow) for the same inputs.
Intuition
Subtracting shifts all exponents to , capping at 1. The largest term contributes exactly 1, guaranteeing the sum is well-behaved. The shift is then added back as outside the logarithm, restoring the correct value.
Failure Mode
If is very negative (say, for float64), then rounds to zero. This is benign: such terms contribute negligibly to the sum. The result is still accurate because the dominant terms are preserved. The only true failure is when all inputs are , producing .
Applications
Stable log-softmax
The softmax of logits is . In log space:
This is how PyTorch's F.log_softmax and JAX's jax.nn.log_softmax are implemented. Computing log(softmax(z)) by first computing softmax and then taking the log is numerically inferior: softmax can produce values so close to zero that log returns . See numerical stability for the broader context.
Log-likelihood computation
For a model with parameters and i.i.d. data :
Always compute the sum of log-probabilities, never the log of the product. ML frameworks compute directly (via log-softmax for categorical distributions, or closed-form log-densities for continuous ones) without ever materializing as a floating-point number.
Log-domain forward algorithm for HMMs
The forward algorithm computes . The recurrence is:
where are transition probabilities and are emission probabilities. For long sequences, underflows to zero because it is a product of many small probabilities.
In log space, define :
This replaces multiplication with addition and summation with LSE. No underflow occurs regardless of sequence length.
Canonical Examples
Why naive probability multiplication fails
Consider a bigram language model assigning to each word in a 200-word document. The sequence probability is , which is below the float64 minimum () and rounds to zero.
In log space: . This is a perfectly representable float64 value.
Log-sum-exp in practice
Suppose you need . Naive computation: overflows to in float64. With LSE: , so:
No overflow, correct result.
Common Log-Probability Operations
| Operation | Naive form | Log-space form | Numerical trick |
|---|---|---|---|
| Product of probabilities | Replace multiply with add | ||
| Sum of probabilities | Max-subtract for stability | ||
| Softmax | Never compute softmax then log | ||
| Weighted average | Keep weights in log space | ||
| Bayes update | $p(A | B) \propto p(B | A) p(A)$ |
| Mixture probability | LSE over mixture components | ||
| Ratio of probabilities | Subtraction in log space | ||
| Power of probability | Scalar multiply in log space |
Log-Probabilities in LLM Inference
Modern language models operate almost entirely in log space during inference. Understanding this is necessary for working with LLM outputs.
Token-level scoring
A language model outputs logits over a vocabulary of size (typically 32K-128K tokens). The log-probability of a token is:
This is one log-softmax operation per generation step. The result is a vector of log-probabilities, all negative, summing to 0 in probability space (summing to via LSE in log space).
Sequence scoring and perplexity
The log-probability of a full sequence is:
Perplexity is the exponentiated average negative log-probability:
Lower perplexity means the model assigns higher probability to the observed sequence. The connection to cross-entropy is direct: perplexity equals where is the cross-entropy in bits, or equivalently when using nats. See bits, nats, perplexity, BPB for unit conversions.
Decoding with log-probs
Decoding strategies operate on log-probabilities. Temperature scaling divides logits by before log-softmax. Top- and top- (nucleus) sampling threshold on cumulative probability, which requires sorting log-probs and computing cumulative sums via LSE. Beam search maintains partial sequences scored by summed log-probabilities.
Common Confusions
Log space does not solve all numerical problems
Log space prevents underflow in probability products and overflow in exponentials. It does not help with cancellation errors (subtracting two nearly equal numbers) or with computations that are inherently ill-conditioned. For example, computing for small requires log1p(x), not log(1 + x), to avoid cancellation.
You should almost never exponentiate log-probabilities
If you are computing log-probabilities only to exponentiate them later, reconsider your pipeline. Most downstream operations (comparison, argmax, addition of independent log-probs) can stay in log space. Exponentiating reintroduces the underflow problem you solved.
The max subtraction is not a heuristic
Subtracting in LSE is not an approximation or heuristic. It is an algebraic identity: for any . Choosing is the standard choice because it guarantees no overflow and at least one unit-magnitude term in the sum.
Exercises
Problem
Compute by hand using the stable formula. What would happen if you tried to compute directly in float64?
Problem
In the log-domain forward algorithm for an HMM with hidden states and sequence length , what is the time complexity? How does it compare to the standard (non-log) forward algorithm?
References
Canonical:
- Murphy, Machine Learning: A Probabilistic Perspective (2012), Section 3.5.3
- Rabiner, A Tutorial on Hidden Markov Models and Selected Applications (1989)
- Higham, Accuracy and Stability of Numerical Algorithms (2002), Chapters 1-2. The definitive reference on floating-point error analysis.
For LLM context:
- Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020). Nucleus sampling operates on log-probabilities.
- Meister et al., "Language Model Evaluation Beyond Perplexity" (ACL 2021). Uses log-probs for evaluation metrics.
Implementation:
- PyTorch documentation:
torch.logsumexp,F.log_softmax,F.cross_entropy - Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.1
- JAX documentation:
jax.nn.log_softmax,jax.scipy.special.logsumexp
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.