Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Numerical Stability and Conditioning

Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.

CoreTier 1Stable~45 min

Why This Matters

Every computation in ML runs on finite-precision hardware. Gradients are computed in float32 or float16. Matrix multiplications accumulate rounding errors. Softmax overflows without careful implementation. Loss functions produce NaN when the logarithm receives a zero.

The theorems you learn in analysis assume exact real arithmetic. Numerical stability theory tells you what happens when those assumptions fail. A numerically unstable algorithm can produce garbage output even when the underlying mathematical problem is well-posed. A poorly conditioned problem amplifies input perturbations regardless of the algorithm.

Understanding the distinction between conditioning (a property of the problem) and stability (a property of the algorithm) is required for debugging training failures, choosing implementations, and understanding why tricks like log-sum-exp, BatchNorm, residual connections, and attention scaling exist.

Prerequisites

This page assumes familiarity with floating-point arithmetic (IEEE 754, machine epsilon, rounding) and matrix operations (norms, eigenvalues, singular values).

Core Definitions

Definition

Forward Error

The forward error of a computation is the difference between the computed result y^\hat{y} and the true result yy:

forward error=y^y\text{forward error} = \|\hat{y} - y\|

Relative forward error: y^y/y\|\hat{y} - y\| / \|y\|.

Definition

Backward Error

The backward error of a computation asks: for what perturbed input x^\hat{x} does the exact algorithm ff produce the computed output y^\hat{y}? That is, find the smallest Δx\Delta x such that:

y^=f(x+Δx)\hat{y} = f(x + \Delta x)

The backward error is Δx/x\|\Delta x\| / \|x\|. A small backward error means the computed answer is the exact answer to a nearby problem.

Definition

Condition Number

The condition number of a problem measures how sensitive the output is to small perturbations in the input. For a differentiable function ff at input xx:

κ=supΔxϵf(x+Δx)f(x)/f(x)Δx/x\kappa = \sup_{\|\Delta x\| \leq \epsilon} \frac{\|f(x + \Delta x) - f(x)\| / \|f(x)\|}{\|\Delta x\| / \|x\|}

For a matrix AA, the condition number with respect to solving Ax=bAx = b is:

κ(A)=AA1=σmax(A)σmin(A)\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where σmax\sigma_{\max} and σmin\sigma_{\min} are the largest and smallest singular values. A large condition number means the problem is ill-conditioned: small input changes produce large output changes.

Definition

Numerical Stability

An algorithm is numerically stable if it produces results with small backward error for all inputs. An algorithm is backward stable if the computed result is the exact result for a slightly perturbed input:

y^=f(x+Δx),Δx/x=O(ϵmach)\hat{y} = f(x + \Delta x), \quad \|\Delta x\| / \|x\| = O(\epsilon_{\text{mach}})

where ϵmach\epsilon_{\text{mach}} is machine epsilon. Backward stability is the gold standard for numerical algorithms.

The Central Relationship

The forward error is bounded by the product of condition number and backward error:

y^yyκΔxx\frac{\|\hat{y} - y\|}{\|y\|} \leq \kappa \cdot \frac{\|\Delta x\|}{\|x\|}

This means:

  • Well-conditioned + backward stable = accurate result. Small backward error, small amplification.
  • Ill-conditioned + backward stable = large forward error, but the best you can do. The problem itself is sensitive; no algorithm can avoid the amplification.
  • Well-conditioned + unstable = avoidable inaccuracy. Switch to a better algorithm.
  • Ill-conditioned + unstable = disaster. Both the problem and the algorithm contribute to error.

Main Theorems

Theorem

Backward Stability of Gaussian Elimination with Partial Pivoting

Statement

Gaussian elimination with partial pivoting computes the solution x^\hat{x} of Ax=bAx = b such that:

(A+ΔA)x^=b,ΔAg(n)uA(A + \Delta A)\hat{x} = b, \quad \|\Delta A\| \leq g(n) \cdot u \cdot \|A\|

where uu is the unit roundoff and g(n)g(n) is a growth factor. For partial pivoting, g(n)2n1g(n) \leq 2^{n-1} in the worst case, but in practice g(n)=O(n2/3)g(n) = O(n^{2/3}) for almost all matrices.

The forward error satisfies:

x^xxκ(A)g(n)u\frac{\|\hat{x} - x\|}{\|x\|} \leq \kappa(A) \cdot g(n) \cdot u

Intuition

Gaussian elimination with partial pivoting is backward stable: the computed solution x^\hat{x} is the exact solution of a slightly perturbed system (A+ΔA)x^=b(A + \Delta A)\hat{x} = b. The perturbation is small (on the order of machine epsilon times the matrix norm). The forward error is this backward error amplified by the condition number κ(A)\kappa(A). If AA is well-conditioned, the result is accurate. If AA is ill-conditioned, the error can be large, but no direct method can do better.

Proof Sketch

Track the rounding errors through the elimination process. Each elementary row operation introduces a relative error of O(u)O(u). Partial pivoting ensures that the multipliers are bounded by 1, preventing individual steps from amplifying errors excessively. The accumulated error after O(n2)O(n^2) operations is bounded by g(n)uAg(n) u \|A\|, where the growth factor g(n)g(n) accounts for the worst-case accumulation through the pivot sequence.

The full proof appears in Higham, Accuracy and Stability of Numerical Algorithms, Chapter 9.

Why It Matters

This theorem explains why Gaussian elimination works well in practice despite theoretical worst-case growth factors of 2n12^{n-1}. It also explains why condition number matters: even with a perfect algorithm, an ill-conditioned matrix produces large forward error. In ML, condition numbers arise everywhere: the Hessian condition number determines gradient descent convergence speed, the condition number of the data matrix affects linear regression stability, and the condition number of attention logits affects softmax precision.

Failure Mode

Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. The growth factor can be exponential. Example: the matrix (ϵ111)\begin{pmatrix} \epsilon & 1 \\ 1 & 1 \end{pmatrix} for small ϵ\epsilon produces enormous multipliers (1/ϵ1/\epsilon) without pivoting, leading to complete loss of precision.

Partial pivoting fails in rare pathological cases (Wilkinson matrices) where g(n)=2n1g(n) = 2^{n-1}. Complete pivoting reduces g(n)g(n) further but is more expensive.

Proposition

Catastrophic Cancellation

Statement

Let aa and bb be floating-point numbers with aba \approx b. The relative error in computing c^=fl(ab)\hat{c} = \text{fl}(a - b) can be as large as:

c^(ab)ab2umax(a,b)ab\frac{|\hat{c} - (a - b)|}{|a - b|} \leq \frac{2u \cdot \max(|a|, |b|)}{|a - b|}

When aba|a - b| \ll |a|, this relative error is much larger than uu. In the extreme, when aa and bb agree in kk leading digits, the result aba - b has only pkp - k significant digits (where pp is the total precision).

Intuition

Subtraction of nearly equal numbers cancels the leading significant digits, leaving only the noisy trailing digits. The absolute error in c^\hat{c} is no worse than before (bounded by umax(a,b)u \cdot \max(|a|, |b|)), but the result ab|a - b| is tiny, so the relative error explodes.

Example: in 7-digit arithmetic, 1.2345671.234566=0.0000011.234567 - 1.234566 = 0.000001. But if the last digits were rounded, the result might be 0.0000020.000002 or 0.0000000.000000, giving 100% relative error.

Proof Sketch

The absolute error in aba - b is bounded by u(a+b)2umax(a,b)u(|a| + |b|) \leq 2u \max(|a|, |b|) (from the rounding errors already present in aa and bb). Dividing by ab|a - b| gives the relative error bound. When ab|a - b| is small compared to a|a|, the ratio is large.

Why It Matters

Catastrophic cancellation explains many numerical failures in ML:

  • Computing Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 fails when the variance is small relative to the mean, because both terms are large and nearly equal. The stable alternative: compute the variance using the two-pass algorithm or Welford's online algorithm.
  • Softmax: exp(xi)/jexp(xj)\exp(x_i) / \sum_j \exp(x_j) overflows when xix_i is large. The fix: subtract maxjxj\max_j x_j from all entries. This is numerically equivalent but avoids overflow.
  • LogSumExp: logjexp(xj)\log \sum_j \exp(x_j) requires the same trick to avoid overflow.

Key Stability Tricks in ML

Log-Sum-Exp

The naive computation of logj=1nexp(xj)\log \sum_{j=1}^n \exp(x_j) overflows when any xj>709x_j > 709 (for float64) or xj>88x_j > 88 (for float32). The stable version:

LogSumExp(x)=m+logj=1nexp(xjm),m=maxjxj\text{LogSumExp}(x) = m + \log \sum_{j=1}^n \exp(x_j - m), \quad m = \max_j x_j

Subtracting mm ensures all exponents are 0\leq 0, so exp(xjm)(0,1]\exp(x_j - m) \in (0, 1]. This is mathematically identical but numerically safe.

Softmax Stability

Similarly, softmax(x)i=exp(xi)/jexp(xj)\text{softmax}(x)_i = \exp(x_i) / \sum_j \exp(x_j) is computed as:

softmax(x)i=exp(xim)/jexp(xjm)\text{softmax}(x)_i = \exp(x_i - m) / \sum_j \exp(x_j - m)

This prevents overflow in the numerator and denominator.

Variance Computation

The textbook formula Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 suffers from catastrophic cancellation when the mean is large and the variance is small. Welford's algorithm computes the variance in a single pass with O(1)O(1) memory and avoids this:

Update running mean xˉn\bar{x}_n and sum of squared differences MnM_n:

  • δ=xnxˉn1\delta = x_n - \bar{x}_{n-1}
  • xˉn=xˉn1+δ/n\bar{x}_n = \bar{x}_{n-1} + \delta / n
  • Mn=Mn1+δ(xnxˉn)M_n = M_{n-1} + \delta (x_n - \bar{x}_n)
  • Var=Mn/(n1)\text{Var} = M_n / (n - 1)

Attention Scaling

In the transformer, attention logits are QKT/dkQK^T / \sqrt{d_k} where dkd_k is the head dimension. Without the dk\sqrt{d_k} scaling, the dot products grow as O(dk)O(d_k) (by the CLT, if entries have unit variance), pushing the softmax into its saturated regime where gradients vanish. The scaling keeps the logits in a range where softmax has meaningful gradients.

Mixed-Precision Training

Modern ML training uses float16 (or bfloat16) for forward and backward passes and float32 for weight updates. The accumulation in float32 prevents gradient underflow: gradients in float16 can underflow to zero when they are smaller than 2246×1082^{-24} \approx 6 \times 10^{-8}. Loss scaling (multiplying the loss by a large constant, computing gradients, then dividing by the same constant) shifts the gradient distribution into the representable range.

Residual Connections and BatchNorm

Residual connections (y=x+f(x)y = x + f(x)) add a skip path that preserves signal magnitude. Without them, signals in deep networks can shrink exponentially (vanishing gradients) or grow exponentially (exploding gradients). BatchNorm normalizes intermediate activations to zero mean and unit variance, preventing the internal covariate shift that causes training instability. Both are engineering responses to numerical conditioning problems in deep networks.

Common Confusions

Watch Out

Conditioning is a property of the problem, not the algorithm

A matrix with condition number 101510^{15} produces large forward errors under ANY algorithm, not just unstable ones. No amount of algorithmic cleverness can compensate for ill-conditioning. The only solution is to reformulate the problem (regularization, preconditioning) or accept the limited precision.

Watch Out

Small residual does not mean accurate solution

For the linear system Ax=bAx = b, the residual r=bAx^r = b - A\hat{x} can be small even when x^\hat{x} is far from the true xx, if AA is ill-conditioned. The bound is xx^/xκ(A)r/b\|x - \hat{x}\| / \|x\| \leq \kappa(A) \cdot \|r\| / \|b\|. With κ(A)=1010\kappa(A) = 10^{10} and r/b=1016\|r\|/\|b\| = 10^{-16} (machine epsilon), the solution error can be 10610^{-6}, which is only 6 digits of accuracy despite a tiny residual.

Watch Out

float16 is not just float32 with less precision

float16 has a much smaller dynamic range (10810^{-8} to 6550465504) compared to float32 (103810^{-38} to 103810^{38}). Underflow and overflow are not just precision issues; they produce zeros and infinities. bfloat16 has the same dynamic range as float32 (same exponent bits) but less precision (7 vs 24 significand bits). The choice between float16 and bfloat16 depends on whether precision or dynamic range is more important for the application.

Watch Out

Numerical stability and algorithmic stability are different concepts

Numerical stability concerns rounding errors in finite-precision arithmetic. Algorithmic stability (in the ML sense) concerns how much the learned model changes when one training example is added or removed. Both use the word "stability" but they are different properties. A learning algorithm can be algorithmically stable (small model change per sample) while being numerically unstable (rounding errors corrupt the gradient), or vice versa.

Key Takeaways

  • Forward error = condition number times backward error. This decomposition separates the problem's sensitivity from the algorithm's accuracy.
  • Gaussian elimination with partial pivoting is backward stable; the forward error is controlled by the condition number.
  • Catastrophic cancellation occurs when subtracting nearly equal numbers, destroying relative precision.
  • The log-sum-exp trick, softmax centering, and Welford variance are standard defenses against numerical instability.
  • Attention scaling by 1/dk1/\sqrt{d_k} prevents softmax saturation.
  • Mixed-precision training uses float16 for speed and float32 for accumulation to avoid gradient underflow.
  • Conditioning is a problem property; stability is an algorithm property. Know which one is causing your trouble.

Exercises

ExerciseCore

Problem

The matrix A=(1111.0001)A = \begin{pmatrix} 1 & 1 \\ 1 & 1.0001 \end{pmatrix} has eigenvalues approximately 2.00012.0001 and 0.00010.0001. Compute its condition number κ2(A)=σmax/σmin\kappa_2(A) = \sigma_{\max}/\sigma_{\min}. If you solve Ax=bAx = b with a backward stable algorithm in float64 (u1016u \approx 10^{-16}), what is the approximate worst-case relative error in xx?

ExerciseCore

Problem

Show that computing f(x)=x+1xf(x) = \sqrt{x+1} - \sqrt{x} for large xx suffers from catastrophic cancellation. Derive a mathematically equivalent form that is numerically stable.

ExerciseAdvanced

Problem

In mixed-precision training, gradients are computed in float16 and accumulated in float32. Explain why loss scaling is necessary. If the typical gradient magnitude is 10510^{-5} and float16 has minimum positive normal 6×108\approx 6 \times 10^{-8}, what loss scale factor prevents underflow?

ExerciseAdvanced

Problem

The two-pass variance algorithm computes xˉ=1nxi\bar{x} = \frac{1}{n}\sum x_i first, then s2=1n1(xixˉ)2s^2 = \frac{1}{n-1}\sum (x_i - \bar{x})^2. The one-pass formula s2=1n1(xi2nxˉ2)s^2 = \frac{1}{n-1}(\sum x_i^2 - n\bar{x}^2) uses catastrophic cancellation. Construct a dataset of 4 numbers where the one-pass formula gives a negative variance in 4-digit arithmetic but the two-pass formula gives a positive result.

References

Canonical:

  • Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., 2002), Chapters 1-3, 9
  • Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-16

ML Connections:

  • Micikevicius et al., "Mixed Precision Training" (ICLR 2018)
  • Vaswani et al., "Attention Is All You Need" (2017), Section 3.2.1 (scaling by dk\sqrt{d_k})

Accessible:

  • Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991)
  • Ioffe & Szegedy, "Batch Normalization: Accelerating Deep Network Training" (2015)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics