Foundations
Numerical Stability and Conditioning
Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.
Why This Matters
Every computation in ML runs on finite-precision hardware. Gradients are computed in float32 or float16. Matrix multiplications accumulate rounding errors. Softmax overflows without careful implementation. Loss functions produce NaN when the logarithm receives a zero.
The theorems you learn in analysis assume exact real arithmetic. Numerical stability theory tells you what happens when those assumptions fail. A numerically unstable algorithm can produce garbage output even when the underlying mathematical problem is well-posed. A poorly conditioned problem amplifies input perturbations regardless of the algorithm.
Understanding the distinction between conditioning (a property of the problem) and stability (a property of the algorithm) is required for debugging training failures, choosing implementations, and understanding why tricks like log-sum-exp, BatchNorm, residual connections, and attention scaling exist.
Prerequisites
This page assumes familiarity with floating-point arithmetic (IEEE 754, machine epsilon, rounding) and matrix operations (norms, eigenvalues, singular values).
Core Definitions
Forward Error
The forward error of a computation is the difference between the computed result and the true result :
Relative forward error: .
Backward Error
The backward error of a computation asks: for what perturbed input does the exact algorithm produce the computed output ? That is, find the smallest such that:
The backward error is . A small backward error means the computed answer is the exact answer to a nearby problem.
Condition Number
The condition number of a problem measures how sensitive the output is to small perturbations in the input. For a differentiable function at input :
For a matrix , the condition number with respect to solving is:
where and are the largest and smallest singular values. A large condition number means the problem is ill-conditioned: small input changes produce large output changes.
Numerical Stability
An algorithm is numerically stable if it produces results with small backward error for all inputs. An algorithm is backward stable if the computed result is the exact result for a slightly perturbed input:
where is machine epsilon. Backward stability is the gold standard for numerical algorithms.
The Central Relationship
The forward error is bounded by the product of condition number and backward error:
This means:
- Well-conditioned + backward stable = accurate result. Small backward error, small amplification.
- Ill-conditioned + backward stable = large forward error, but the best you can do. The problem itself is sensitive; no algorithm can avoid the amplification.
- Well-conditioned + unstable = avoidable inaccuracy. Switch to a better algorithm.
- Ill-conditioned + unstable = disaster. Both the problem and the algorithm contribute to error.
Main Theorems
Backward Stability of Gaussian Elimination with Partial Pivoting
Statement
Gaussian elimination with partial pivoting computes the solution of such that:
where is the unit roundoff and is a growth factor. For partial pivoting, in the worst case, but in practice for almost all matrices.
The forward error satisfies:
Intuition
Gaussian elimination with partial pivoting is backward stable: the computed solution is the exact solution of a slightly perturbed system . The perturbation is small (on the order of machine epsilon times the matrix norm). The forward error is this backward error amplified by the condition number . If is well-conditioned, the result is accurate. If is ill-conditioned, the error can be large, but no direct method can do better.
Proof Sketch
Track the rounding errors through the elimination process. Each elementary row operation introduces a relative error of . Partial pivoting ensures that the multipliers are bounded by 1, preventing individual steps from amplifying errors excessively. The accumulated error after operations is bounded by , where the growth factor accounts for the worst-case accumulation through the pivot sequence.
The full proof appears in Higham, Accuracy and Stability of Numerical Algorithms, Chapter 9.
Why It Matters
This theorem explains why Gaussian elimination works well in practice despite theoretical worst-case growth factors of . It also explains why condition number matters: even with a perfect algorithm, an ill-conditioned matrix produces large forward error. In ML, condition numbers arise everywhere: the Hessian condition number determines gradient descent convergence speed, the condition number of the data matrix affects linear regression stability, and the condition number of attention logits affects softmax precision.
Failure Mode
Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. The growth factor can be exponential. Example: the matrix for small produces enormous multipliers () without pivoting, leading to complete loss of precision.
Partial pivoting fails in rare pathological cases (Wilkinson matrices) where . Complete pivoting reduces further but is more expensive.
Catastrophic Cancellation
Statement
Let and be floating-point numbers with . The relative error in computing can be as large as:
When , this relative error is much larger than . In the extreme, when and agree in leading digits, the result has only significant digits (where is the total precision).
Intuition
Subtraction of nearly equal numbers cancels the leading significant digits, leaving only the noisy trailing digits. The absolute error in is no worse than before (bounded by ), but the result is tiny, so the relative error explodes.
Example: in 7-digit arithmetic, . But if the last digits were rounded, the result might be or , giving 100% relative error.
Proof Sketch
The absolute error in is bounded by (from the rounding errors already present in and ). Dividing by gives the relative error bound. When is small compared to , the ratio is large.
Why It Matters
Catastrophic cancellation explains many numerical failures in ML:
- Computing fails when the variance is small relative to the mean, because both terms are large and nearly equal. The stable alternative: compute the variance using the two-pass algorithm or Welford's online algorithm.
- Softmax: overflows when is large. The fix: subtract from all entries. This is numerically equivalent but avoids overflow.
- LogSumExp: requires the same trick to avoid overflow.
Key Stability Tricks in ML
Log-Sum-Exp
The naive computation of overflows when any (for float64) or (for float32). The stable version:
Subtracting ensures all exponents are , so . This is mathematically identical but numerically safe.
Softmax Stability
Similarly, is computed as:
This prevents overflow in the numerator and denominator.
Variance Computation
The textbook formula suffers from catastrophic cancellation when the mean is large and the variance is small. Welford's algorithm computes the variance in a single pass with memory and avoids this:
Update running mean and sum of squared differences :
Attention Scaling
In the transformer, attention logits are where is the head dimension. Without the scaling, the dot products grow as (by the CLT, if entries have unit variance), pushing the softmax into its saturated regime where gradients vanish. The scaling keeps the logits in a range where softmax has meaningful gradients.
Mixed-Precision Training
Modern ML training uses float16 (or bfloat16) for forward and backward passes and float32 for weight updates. The accumulation in float32 prevents gradient underflow: gradients in float16 can underflow to zero when they are smaller than . Loss scaling (multiplying the loss by a large constant, computing gradients, then dividing by the same constant) shifts the gradient distribution into the representable range.
Residual Connections and BatchNorm
Residual connections () add a skip path that preserves signal magnitude. Without them, signals in deep networks can shrink exponentially (vanishing gradients) or grow exponentially (exploding gradients). BatchNorm normalizes intermediate activations to zero mean and unit variance, preventing the internal covariate shift that causes training instability. Both are engineering responses to numerical conditioning problems in deep networks.
Common Confusions
Conditioning is a property of the problem, not the algorithm
A matrix with condition number produces large forward errors under ANY algorithm, not just unstable ones. No amount of algorithmic cleverness can compensate for ill-conditioning. The only solution is to reformulate the problem (regularization, preconditioning) or accept the limited precision.
Small residual does not mean accurate solution
For the linear system , the residual can be small even when is far from the true , if is ill-conditioned. The bound is . With and (machine epsilon), the solution error can be , which is only 6 digits of accuracy despite a tiny residual.
float16 is not just float32 with less precision
float16 has a much smaller dynamic range ( to ) compared to float32 ( to ). Underflow and overflow are not just precision issues; they produce zeros and infinities. bfloat16 has the same dynamic range as float32 (same exponent bits) but less precision (7 vs 24 significand bits). The choice between float16 and bfloat16 depends on whether precision or dynamic range is more important for the application.
Numerical stability and algorithmic stability are different concepts
Numerical stability concerns rounding errors in finite-precision arithmetic. Algorithmic stability (in the ML sense) concerns how much the learned model changes when one training example is added or removed. Both use the word "stability" but they are different properties. A learning algorithm can be algorithmically stable (small model change per sample) while being numerically unstable (rounding errors corrupt the gradient), or vice versa.
Key Takeaways
- Forward error = condition number times backward error. This decomposition separates the problem's sensitivity from the algorithm's accuracy.
- Gaussian elimination with partial pivoting is backward stable; the forward error is controlled by the condition number.
- Catastrophic cancellation occurs when subtracting nearly equal numbers, destroying relative precision.
- The log-sum-exp trick, softmax centering, and Welford variance are standard defenses against numerical instability.
- Attention scaling by prevents softmax saturation.
- Mixed-precision training uses float16 for speed and float32 for accumulation to avoid gradient underflow.
- Conditioning is a problem property; stability is an algorithm property. Know which one is causing your trouble.
Exercises
Problem
The matrix has eigenvalues approximately and . Compute its condition number . If you solve with a backward stable algorithm in float64 (), what is the approximate worst-case relative error in ?
Problem
Show that computing for large suffers from catastrophic cancellation. Derive a mathematically equivalent form that is numerically stable.
Problem
In mixed-precision training, gradients are computed in float16 and accumulated in float32. Explain why loss scaling is necessary. If the typical gradient magnitude is and float16 has minimum positive normal , what loss scale factor prevents underflow?
Problem
The two-pass variance algorithm computes first, then . The one-pass formula uses catastrophic cancellation. Construct a dataset of 4 numbers where the one-pass formula gives a negative variance in 4-digit arithmetic but the two-pass formula gives a positive result.
References
Canonical:
- Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., 2002), Chapters 1-3, 9
- Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-16
ML Connections:
- Micikevicius et al., "Mixed Precision Training" (ICLR 2018)
- Vaswani et al., "Attention Is All You Need" (2017), Section 3.2.1 (scaling by )
Accessible:
- Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991)
- Ioffe & Szegedy, "Batch Normalization: Accelerating Deep Network Training" (2015)
Next Topics
- Conditioning and condition number: deeper treatment of condition numbers for specific problems in optimization and linear algebra
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Floating-Point ArithmeticLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A