Softmax and Numerical Stability

Sneiderman, Robby

Numerical Optimization

Softmax and Numerical Stability

The softmax function maps arbitrary reals to a probability distribution. Getting it right numerically: avoiding overflow and underflow: is the first lesson in writing ML code that actually works.

CoreTier 1StableCore spine~45 min

Start 8-question practice · 11 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 1 | tier 1. This page has 0 direct prerequisites and 8 published dependents.

Open Atlas Prerequisites Leads to

What next

Conditioning and Condition Number

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Temperature:1.0

Low temperature sharpens the distribution (the top token dominates). High temperature flattens it (all tokens become equally likely). T=1 is the unmodified softmax.

The softmax function appears in virtually every classification neural network, every attention mechanism, every language model, and every reinforcement learning policy. It converts a vector of real numbers (logits) into a probability distribution. This sounds simple, but a naive implementation will produce NaN or Inf on perfectly reasonable inputs.

Understanding why softmax breaks numerically, and how the log-sum-exp trick fixes it, is the first real lesson in numerical computing for ML. If you have ever seen NaN losses during training, there is a good chance a softmax-related instability was the cause.

Mental Model

You have a vector of scores $z = (z_1, \ldots, z_K)$ called logits. You want to convert them into probabilities that sum to 1. Softmax exponentiates each score and normalizes:

$p_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$

The problem: if $z_i = 1000$ , then $e^{1000}$ overflows to Inf in floating point. If $z_i = -1000$ , then $e^{-1000}$ underflows to $0$ . Both are fatal for the computation.

The Softmax Function

Definition

Softmax Function $softmax (z)_{i}$

The softmax function maps $z \in \mathbb{R}^K$ to a probability vector:

$\text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}, \quad i = 1, \ldots, K$

Properties:

Output is a valid probability distribution: $\text{softmax}(z)_i > 0$ and $\sum_i \text{softmax}(z)_i = 1$
Monotone: $z_i > z_j$ if and only if $\text{softmax}(z)_i > \text{softmax}(z)_j$
Translation invariant: $\text{softmax}(z + c\mathbf{1}) = \text{softmax}(z)$ for any scalar $c$

The translation invariance property is the key to numerical stability. Shifting all logits by the same constant does not change the output. We will exploit this shortly.

Why It Overflows and Underflows

In IEEE 754 double precision floating point:

$e^{709} \approx 8.2 \times 10^{307}$ (near DBL_MAX)
$e^{710}$ overflows to Inf
$e^{-745}$ underflows to 0

In single precision (float32, the ML default):

$e^{88} \approx 1.65 \times 10^{38}$ (still within FLT_MAX $\approx 3.40 \times 10^{38}$ )
$e^{89}$ overflows to Inf (since $\ln \text{FLT\_MAX} \approx 88.72$ )
$e^{-104}$ underflows to 0

A logit vector like $z = (100, 200, 300)$ in float32 produces $\exp(z) = (\text{Inf}, \text{Inf}, \text{Inf})$ , and $\text{Inf}/\text{Inf} = \text{NaN}$ .

Even with more modest logits, the ratio of a very small exponential to a very large sum gives underflow in the numerator, producing $0$ probabilities when they should be small but nonzero. This matters for cross-entropy loss, where $\log(0) = -\infty$ .

The Log-Sum-Exp Trick

Proposition

Numerical Stability of the Shifted Log-Sum-Exp

Statement

For any $c \in \mathbb{R}$ :

$\log \sum_{j=1}^K \exp(z_j) = c + \log \sum_{j=1}^K \exp(z_j - c)$

Setting $c = \max_j z_j$ ensures that the largest exponent is $\exp(0) = 1$ , preventing overflow. The other exponents are $\leq 1$ , so they cannot overflow either. The sum is in $[1, K]$ , so the log is in $[0, \log K]$ .

Intuition

By factoring out $\exp(c)$ from the sum, we pull the "dangerously large" exponential outside as an additive constant in log space. What remains inside the log-sum-exp are manageable numbers between 0 and 1.

Proof Sketch

$\log \sum_j \exp(z_j) = \log \left[\exp(c) \sum_j \exp(z_j - c)\right] = c + \log \sum_j \exp(z_j - c)$

With $c = \max_j z_j$ , every $z_j - c \leq 0$ , so $\exp(z_j - c) \in (0, 1]$ . The largest term equals $1$ , so the sum is at least $1$ and the log is nonneg. No overflow occurs.

Why It Matters

This trick is used everywhere in numerical computing for ML. Cross-entropy loss, KL divergence, log-likelihood of categorical distributions, attention scores. all involve log-sum-exp internally. Every major ML framework (PyTorch, JAX, TensorFlow) implements this automatically in functions like log_softmax, cross_entropy, and logsumexp.

Failure Mode

Some of the $\exp(z_j - c)$ terms may still underflow to $0$ when $z_j \ll c$ . This is usually acceptable: those terms contribute negligibly to the sum. The stability guarantee is against overflow (which produces NaN) rather than underflow (which produces a small approximation error).

report a correction →

Stable Softmax and Log-Softmax

Using the trick, the stable softmax computation is:

$\text{softmax}(z)_i = \frac{\exp(z_i - c)}{\sum_j \exp(z_j - c)}, \qquad c = \max_j z_j$

Even more important is log-softmax, because cross-entropy loss uses $\log p_i$ directly:

$\log \text{softmax}(z)_i = z_i - c - \log \sum_j \exp(z_j - c)$

$= z_i - \text{logsumexp}(z)$

Computing $\log \text{softmax}$ directly (without first computing softmax and then taking $\log$ ) avoids the catastrophic loss of precision that occurs when softmax outputs a number very close to 1 and $\log$ maps it to nearly 0.

This is why PyTorch has separate F.softmax and F.log_softmax functions, and why F.cross_entropy takes raw logits rather than probabilities.

Temperature Scaling

Definition

Temperature-Scaled Softmax $softmax (z / T)$

The temperature parameter $T > 0$ controls the "sharpness" of the distribution:

$\text{softmax}(z / T)_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

As $T \to \infty$ : the distribution approaches uniform (maximum entropy)
As $T \to 0^+$ : the distribution concentrates on the argmax (minimum entropy)
$T = 1$ : standard softmax

Temperature is used in:

Knowledge distillation: high $T$ softens teacher outputs to reveal inter-class relationships
Language model sampling: $T < 1$ makes generation more deterministic, $T > 1$ makes it more random
Reinforcement learning: Boltzmann exploration policies use temperature to trade off exploitation vs. exploration

Connection to Probability Distributions

Softmax arises naturally as the canonical link function for the categorical (multinoulli) distribution in the exponential family. If $X \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$ with natural parameters $\eta_i = \log \pi_i$ , then the mean parameters are recovered by:

$\pi_i = \frac{\exp(\eta_i)}{\sum_j \exp(\eta_j)} = \text{softmax}(\eta)_i$

In physics, this is the Boltzmann distribution (or Gibbs distribution): a system at temperature $T$ with energy levels $E_1, \ldots, E_K$ occupies state $i$ with probability:

$p_i = \frac{\exp(-E_i / T)}{\sum_j \exp(-E_j / T)} = \text{softmax}(-E / T)_i$

The partition function $Z = \sum_j \exp(-E_j / T)$ is precisely the denominator of softmax. Computing $\log Z$ is the log-sum-exp.

Gumbel-Softmax for Differentiable Sampling

A key problem: sampling from a categorical distribution is not differentiable, so you cannot backpropagate through it. The Gumbel-softmax trick provides a differentiable approximation.

Definition

Gumbel-Softmax (Concrete Distribution)

To approximately sample from $\text{Categorical}(\text{softmax}(z))$ :

Draw i.i.d. Gumbel noise: $g_i \sim \text{Gumbel}(0, 1)$ (equivalently, $g_i = -\log(-\log(u_i))$ where $u_i \sim \text{Uniform}(0,1)$ )
Compute the relaxed sample:

$y_i = \text{softmax}\left(\frac{z_i + g_i}{\tau}\right)$

where $\tau > 0$ is a temperature parameter.

As $\tau \to 0$ , $y$ approaches a one-hot vector (exact sample). For $\tau > 0$ , $y$ is a "soft" sample on the simplex that admits gradients via backpropagation.

This is used extensively in VAEs with discrete latent variables, neural architecture search, and any setting where you need to "differentiate through a categorical choice."

The Gumbel-max theorem guarantees that $\arg\max_i (z_i + g_i)$ is an exact sample from $\text{Categorical}(\text{softmax}(z))$ . Gumbel-softmax relaxes the $\arg\max$ to a $\text{softmax}$ , making it differentiable at the cost of approximation.

Common Confusions

Watch Out

Softmax IS argmax as temperature goes to zero

A common description is that softmax is a "soft version of argmax." This is correct but the implication often drawn is wrong. People sometimes think softmax approximates argmax but is structurally different. In fact, $\lim_{T \to 0^+} \text{softmax}(z/T)$ is exactly the argmax (as a one-hot vector), assuming a unique maximum. Softmax is a one-parameter family that interpolates continuously between the uniform distribution ( $T \to \infty$ ) and the argmax ( $T \to 0$ ). It is not an approximation. It is a temperature-parameterized generalization.

Watch Out

Never compute softmax then log. use log-softmax directly

If $p_i = \text{softmax}(z)_i$ is very close to 1 (say $1 - 10^{-15}$ ), then in float64 $\log(p_i) \approx \log(1) = 0$ , losing all the information in the small correction. Computing $\log \text{softmax}$ directly as $z_i - \text{logsumexp}(z)$ avoids this entirely, because the subtraction happens before any precision is lost. This is not a minor optimization. It is the difference between working and broken training.

Summary

Softmax: $p_i = \exp(z_i) / \sum_j \exp(z_j)$ . maps logits to probabilities
Naive implementation overflows for large logits, underflows for very negative logits
Log-sum-exp trick: subtract $\max_j z_j$ before exponentiating
Always compute log_softmax directly, never log(softmax(z))
Temperature $T$ : high = uniform, low = peaked at argmax
Softmax is the canonical link for the categorical distribution and the Boltzmann distribution
Gumbel-softmax provides differentiable approximate sampling from categoricals

Exercises

ExerciseCore

Problem

Implement a numerically stable log_softmax function. Given a vector $z \in \mathbb{R}^K$ , compute $\log \text{softmax}(z)_i$ for all $i$ without overflow or unnecessary precision loss.

Write the formula and explain each step.

ExerciseAdvanced

Problem

Show that for the Gumbel-max trick, $\arg\max_i (z_i + g_i)$ where $g_i \sim \text{Gumbel}(0,1)$ i.i.d. is distributed as $\text{Categorical}(\text{softmax}(z))$ . That is, prove:

$P\left[\arg\max_i (z_i + g_i) = k\right] = \text{softmax}(z)_k$

References

Canonical:

Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.1 (numerical stability)
Bishop, Pattern Recognition and Machine Learning (2006), Section 4.3.4
Higham, Accuracy and Stability of Numerical Algorithms (2002), Chapters 1 and 2
Blanchard, Higham, Higham, "Accurately Computing the Log-Sum-Exp and Softmax Functions" (IMA Journal of Numerical Analysis, 2021)

Current:

Jang, Gu, Poole, "Categorical Reparameterization with Gumbel-Softmax," ICLR 2017 (arXiv:1611.01144)
Maddison, Mnih, Teh, "The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables," ICLR 2017 (arXiv:1611.00712)
Kool, van Hoof, Welling, "Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement," ICML 2019 (arXiv:1903.06059)
Vaswani et al., "Attention Is All You Need" (NeurIPS, 2017), Section 3.2.1

Next Topics

The natural next step from numerical stability:

Conditioning and condition number: formalizing when a computation is inherently sensitive to perturbations, beyond specific tricks

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

8

Conditioning and Condition Numberlayer 1 · tier 1
Log-Probability Computationlayer 1 · tier 1
Decoding Strategieslayer 3 · tier 2
Token Prediction and Language Modelinglayer 3 · tier 2
Attention Mechanism Theorylayer 4 · tier 2

+3 more on the derived-topics page.

Graph-backed continuations

Conditioning and Condition Number Attention Mechanism Theory Decoding Strategies Flash Attention Log-Probability Computation Quantization Theory Token Prediction and Language Modeling Transformer Architecture