Log-Probability Computation

Sneiderman, Robby

Numerical Optimization

Log-Probability Computation

Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log of a sum of exponentials, and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs.

CoreTier 1StableSupporting~35 min

Prerequisites

Softmax and Numerical Stability

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 1 | tier 1. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Cross-Entropy Loss: MLE, KL Divergence, and Classification

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A single probability can be small. A product of 100 probabilities is astronomically small. In a language model, the probability of a 100-token sequence might be $10^{-150}$ . In 64-bit floating point, the smallest representable positive number is roughly $5 \times 10^{-324}$ . Multiply enough probabilities and you hit zero exactly. This is underflow, and it silently corrupts every downstream computation.

The fix is simple: work in log space. Replace products with sums, and never exponentiate until the very end (if ever). Every ML framework implements this pattern. Understanding why and how is necessary for writing correct numerical code. This connects directly to softmax stability, cross-entropy loss, information theory, and floating-point arithmetic.

theorem visual

Keep probabilities in log space

$Products of tiny probabilities collapse to zero; sums of log-probabilities stay representable and keep the same ordering.$

products

$\prod_{i} p_{i} becomes \sum_{i} lo g p_{i}, so long sequences do not vanish numerically.$

sums

$Use LSE (x) = c + lo g \sum_{i} e^{x_{i} - c} with c = max_{i} x_{i} .$

softmax

$Stable log-softmax is z_{k} - LSE (z), not lo g (softmax (z_{k})) .$

The Core Problem

Definition

Underflow in Probability Products

Given probabilities $p_1, \ldots, p_n$ with each $p_i \in (0, 1)$ :

$P = \prod_{i=1}^n p_i$

In log space: $\log P = \sum_{i=1}^n \log p_i$ . Each $\log p_i$ is a moderate negative number (e.g., $-3$ to $-10$ ). The sum is a large negative number (e.g., $-500$ ), which is perfectly representable in floating point. The product $P$ itself ( $\approx e^{-500} \approx 10^{-217}$ ) would underflow to zero in many contexts.

The Log-Sum-Exp Trick

The difficulty arises when you need to compute $\log \sum_i \exp(x_i)$ . This appears in softmax, partition functions, marginal likelihoods, and the forward algorithm. Naive computation fails because $\exp(x_i)$ overflows for large $x_i$ or underflows for very negative $x_i$ .

Definition

Log-Sum-Exp (Stable Form)

For values $x_1, \ldots, x_n$ , define $c = \max_i x_i$ . Then:

$\text{LSE}(x_1, \ldots, x_n) = c + \log \sum_{i=1}^n \exp(x_i - c)$

This is algebraically identical to $\log \sum_i \exp(x_i)$ , but each $\exp(x_i - c) \leq 1$ (since $x_i - c \leq 0$ ), so no overflow occurs. At least one term ( $i = \arg\max$ ) equals $\exp(0) = 1$ , so the sum is at least 1 and $\log$ does not receive a near-zero argument.

Main Theorem

Proposition

Numerical Stability of Log-Sum-Exp

Statement

Let $x_1, \ldots, x_n$ be finite floating-point numbers and $c = \max_i x_i$ . The computation:

$\hat{y} = c + \log \sum_{i=1}^n \exp(x_i - c)$

satisfies the following properties:

No overflow: Each $\exp(x_i - c) \in (0, 1]$ , so no exponential overflows.
No catastrophic underflow: The sum $\sum_i \exp(x_i - c) \geq 1$ , so the logarithm receives an argument $\geq 1$ .
Relative error: The result has relative error $O(n\epsilon_{\text{mach}})$ where $\epsilon_{\text{mach}} \approx 10^{-16}$ for float64.

The naive computation $\log \sum_i \exp(x_i)$ can produce $+\infty$ (overflow) or $-\infty$ (underflow) for the same inputs.

Intuition

Subtracting $c = \max(x_i)$ shifts all exponents to $\leq 0$ , capping $\exp(x_i - c)$ at 1. The largest term contributes exactly 1, guaranteeing the sum is well-behaved. The shift is then added back as $c$ outside the logarithm, restoring the correct value.

Failure Mode

If $x_i - c$ is very negative (say, $< -750$ for float64), then $\exp(x_i - c)$ rounds to zero. This is benign: such terms contribute negligibly to the sum. The result is still accurate because the dominant terms are preserved. The only true failure is when all inputs are $-\infty$ , producing $-\infty$ .

report a correction →

Applications

Stable log-softmax

The softmax of logits $z_1, \ldots, z_K$ is $\text{softmax}(z)_k = \exp(z_k) / \sum_j \exp(z_j)$ . In log space:

$\log \text{softmax}(z)_k = z_k - \text{LSE}(z_1, \ldots, z_K)$

This is how PyTorch's F.log_softmax and JAX's jax.nn.log_softmax are implemented. Computing log(softmax(z)) by first computing softmax and then taking the log is numerically inferior: softmax can produce values so close to zero that log returns $-\infty$ . See numerical stability for the broader context.

Log-likelihood computation

For a model with parameters $\theta$ and i.i.d. data $x_1, \ldots, x_n$ :

$\ell(\theta) = \log \prod_{i=1}^n p_\theta(x_i) = \sum_{i=1}^n \log p_\theta(x_i)$

Always compute the sum of log-probabilities, never the log of the product. ML frameworks compute $\log p_\theta(x_i)$ directly (via log-softmax for categorical distributions, or closed-form log-densities for continuous ones) without ever materializing $p_\theta(x_i)$ as a floating-point number.

Log-domain forward algorithm for HMMs

The forward algorithm computes $\alpha_t(j) = p(\text{observations } 1:t, \text{state}_t = j)$ . The recurrence is:

$\alpha_t(j) = b_j(o_t) \sum_i \alpha_{t-1}(i) \cdot a_{ij}$

where $a_{ij}$ are transition probabilities and $b_j(o_t)$ are emission probabilities. For long sequences, $\alpha_t(j)$ underflows to zero because it is a product of many small probabilities.

In log space, define $\bar{\alpha}_t(j) = \log \alpha_t(j)$ :

$\bar{\alpha}_t(j) = \log b_j(o_t) + \text{LSE}_i(\bar{\alpha}_{t-1}(i) + \log a_{ij})$

This replaces multiplication with addition and summation with LSE. No underflow occurs regardless of sequence length.

Canonical Examples

Example

Why naive probability multiplication fails

Consider a bigram language model assigning $p(\text{word}_t | \text{word}_{t-1}) \approx 0.01$ to each word in a 200-word document. The sequence probability is $0.01^{200} = 10^{-400}$ , which is below the float64 minimum ( $\approx 10^{-308}$ ) and rounds to zero.

In log space: $\log P = 200 \times \log(0.01) = 200 \times (-4.605) = -921.0$ . This is a perfectly representable float64 value.

Example

Log-sum-exp in practice

Suppose you need $\log(e^{1000} + e^{1001})$ . Naive computation: $e^{1000}$ overflows to $+\infty$ in float64. With LSE: $c = 1001$ , so:

$\text{LSE} = 1001 + \log(e^{-1} + e^{0}) = 1001 + \log(1 + e^{-1}) \approx 1001.313$

No overflow, correct result.

Common Log-Probability Operations

Operation	Naive form	Log-space form	Numerical trick
Product of probabilities	$\prod_i p_i$	$\sum_i \log p_i$	Replace multiply with add
Sum of probabilities	$\sum_i p_i$	$\text{LSE}(\log p_1, \ldots, \log p_n)$	Max-subtract for stability
Softmax	$e^{z_k} / \sum_j e^{z_j}$	$z_k - \text{LSE}(z_1, \ldots, z_K)$	Never compute softmax then log
Weighted average	$\sum_i w_i x_i$	$\text{LSE}(\log w_i + \log x_i)$	Keep weights in log space
Bayes update	$p(A \mid B) \propto p(B \mid A) p(A)$	$\log p(A \mid B) = \log p(B \mid A) + \log p(A) - \log Z$	Add log-prior to log-likelihood
Mixture probability	$\sum_k \pi_k p_k(x)$	$\text{LSE}_k(\log \pi_k + \log p_k(x))$	LSE over mixture components
Ratio of probabilities	$p/q$	$\log p - \log q$	Subtraction in log space
Power of probability	$p^{\alpha}$	$\alpha \log p$	Scalar multiply in log space

Log-Probabilities in LLM Inference

Modern language models operate almost entirely in log space during inference. Understanding this is necessary for working with LLM outputs.

Token-level scoring

A language model outputs logits $z_1, \ldots, z_V$ over a vocabulary of size $V$ (typically 32K-128K tokens). The log-probability of a token is:

$\log p(t_k | t_{<k}) = z_k - \text{LSE}(z_1, \ldots, z_V)$

This is one log-softmax operation per generation step. The result is a vector of $V$ log-probabilities, all negative, summing to 0 in probability space (summing to $\log(1) = 0$ via LSE in log space).

Sequence scoring and perplexity

The log-probability of a full sequence $t_1, \ldots, t_n$ is:

$\log p(t_1, \ldots, t_n) = \sum_{k=1}^n \log p(t_k | t_{<k})$

Perplexity is the exponentiated average negative log-probability:

$\text{PPL} = \exp\left(-\frac{1}{n} \sum_{k=1}^n \log p(t_k | t_{<k})\right)$

Lower perplexity means the model assigns higher probability to the observed sequence. The connection to cross-entropy is direct: perplexity equals $2^{H}$ where $H$ is the cross-entropy in bits, or equivalently $e^{H}$ when using nats. See bits, nats, perplexity, BPB for unit conversions.

Decoding with log-probs

Decoding strategies operate on log-probabilities. Temperature scaling divides logits by $T$ before log-softmax. Top- $k$ and top- $p$ (nucleus) sampling threshold on cumulative probability, which requires sorting log-probs and computing cumulative sums via LSE. Beam search maintains $k$ partial sequences scored by summed log-probabilities.

Common Confusions

Watch Out

Log space does not solve all numerical problems

Log space prevents underflow in probability products and overflow in exponentials. It does not help with cancellation errors (subtracting two nearly equal numbers) or with computations that are inherently ill-conditioned. For example, computing $\log(1 + x)$ for small $x$ requires log1p(x), not log(1 + x), to avoid cancellation.

Watch Out

You should almost never exponentiate log-probabilities

If you are computing log-probabilities only to exponentiate them later, reconsider your pipeline. Most downstream operations (comparison, argmax, addition of independent log-probs) can stay in log space. Exponentiating reintroduces the underflow problem you solved.

Watch Out

The max subtraction is not a heuristic

Subtracting $c = \max(x_i)$ in LSE is not an approximation or heuristic. It is an algebraic identity: $\log \sum e^{x_i} = c + \log \sum e^{x_i - c}$ for any $c$ . Choosing $c = \max(x_i)$ is the standard choice because it guarantees no overflow and at least one unit-magnitude term in the sum.

Exercises

ExerciseCore

Problem

Compute $\text{LSE}(100, 200, 300)$ by hand using the stable formula. What would happen if you tried to compute $e^{100} + e^{200} + e^{300}$ directly in float64?

ExerciseAdvanced

Problem

In the log-domain forward algorithm for an HMM with $K$ hidden states and sequence length $T$ , what is the time complexity? How does it compare to the standard (non-log) forward algorithm?

References

Canonical:

Murphy, Machine Learning: A Probabilistic Perspective (2012), Section 3.5.3
Rabiner, A Tutorial on Hidden Markov Models and Selected Applications (1989)
Higham, Accuracy and Stability of Numerical Algorithms (2002), Chapters 1-2. The definitive reference on floating-point error analysis.

For LLM context:

Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020). Nucleus sampling operates on log-probabilities.
Meister et al., "Language Model Evaluation Beyond Perplexity" (ACL 2021). Uses log-probs for evaluation metrics.

Implementation:

PyTorch documentation: torch.logsumexp, F.log_softmax, F.cross_entropy
Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.1
JAX documentation: jax.nn.log_softmax, jax.scipy.special.logsumexp

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Softmax and Numerical Stabilitylayer 1 · tier 1

Derived topics

2

Cross-Entropy Loss: MLE, KL Divergence, and Classificationlayer 1 · tier 1
Perplexity and Language Model Evaluationlayer 3 · tier 2

Graph-backed continuations

Cross-Entropy Loss: MLE, KL Divergence, and Classification Perplexity and Language Model Evaluation