Numerical Stability
Floating-Point Arithmetic
How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.
Why This Matters
Every number in your neural network. every weight, gradient, activation, and loss value. is a floating-point number with finite precision. When you train a model and the loss becomes NaN, when gradients explode or vanish, when two numbers that should be equal are not, floating-point arithmetic is the root cause.
Understanding floating-point is not optional for ML practitioners. It explains why we use log-sum-exp instead of summing exponentials, why bfloat16 works for training but float16 sometimes does not, and why numerical stability is a real engineering constraint.
Mental Model
A floating-point number is scientific notation for computers. Just as has a significand (6.022) and an exponent (23), a floating-point number has a mantissa and an exponent, but in base 2. The mantissa gives you precision (how many significant digits), and the exponent gives you range (how large or small the number can be).
The fundamental limitation: you have a fixed number of bits for the mantissa, so most real numbers cannot be represented exactly. They get rounded to the nearest representable number.
Formal Setup and Notation
IEEE 754 Floating-Point Representation
A floating-point number in IEEE 754 format is stored as three fields:
where is the sign bit (0 for positive, 1 for negative), is the mantissa (also called significand or fraction), and is the exponent. The leading 1 in "" is implicit (not stored), giving one extra bit of precision for free.
For float32: 1 sign bit, 8 exponent bits, 23 mantissa bits (32 total). For float64: 1 sign bit, 11 exponent bits, 52 mantissa bits (64 total).
Machine Epsilon
Machine epsilon is the smallest such that , where denotes rounding to the nearest floating-point number.
Equivalently, it is where is the number of mantissa bits (including the implicit leading 1). It represents the worst-case relative rounding error for a single operation.
For float32: (about 7 decimal digits). For float64: (about 16 decimal digits).
ULP (Unit in the Last Place)
The ULP of a floating-point number is the spacing between and the next representable floating-point number. For in float32, the ULP is . ULP grows with the magnitude of : large numbers have larger gaps between representable values.
Core Definitions
Overflow occurs when the result of a computation exceeds the largest representable number. In float32, the maximum is approximately . Overflow produces infinity ( or ). Computing in float32 overflows.
Underflow occurs when the result is closer to zero than the smallest representable positive number. In float32, the smallest normal number is approximately . Underflow produces zero or a denormalized number. Computing in float32 underflows to zero.
Catastrophic cancellation occurs when subtracting two nearly equal numbers. If and in float32, then . But if and each have 7 digits of precision, has only 1 digit of precision. The relative error explodes.
This is why computing is numerically unstable: if the mean is large relative to the standard deviation, you subtract two large, nearly equal numbers.
Why ML Uses Reduced Precision
| Format | Sign | Exponent | Mantissa | Total bits | Precision | Range |
|---|---|---|---|---|---|---|
| float32 | 1 | 8 | 23 | 32 | ~7 digits | ~ |
| float16 | 1 | 5 | 10 | 16 | ~3 digits | ~ |
| bfloat16 | 1 | 8 | 7 | 16 | ~2 digits | ~ |
| int8 | - | - | - | 8 | 256 values |
float16 has limited range (max ~65504) but decent precision. Gradients and activations in deep networks can exceed 65504, causing overflow.
bfloat16 sacrifices precision for range. It has the same exponent range as float32, so overflow is rare. This is why bfloat16 is preferred for training: gradients rarely overflow.
int8 quantization maps continuous values to 256 discrete levels. Used for inference (not training) to reduce memory and compute by 4x vs float32.
Main Theorems
Fundamental Axiom of Floating-Point Arithmetic
Statement
For any real number in the representable range, the floating-point representation satisfies:
For any arithmetic operation :
Each floating-point operation introduces a relative error of at most .
Intuition
Every floating-point operation is "almost right": the result is within a factor of of the true answer. The problem is that errors accumulate over many operations. After operations, the relative error can be as large as in the worst case (and on average for random errors).
Proof Sketch
The nearest floating-point number to differs from by at most half a ULP. The ULP of is where is the mantissa precision. So . Dividing by gives the relative error bound.
Why It Matters
This axiom is the foundation of numerical analysis. Every error analysis in scientific computing starts from this bound. It tells you that single operations are accurate, and the challenge is controlling error accumulation over long computations (like training a neural network for millions of steps).
Failure Mode
The bound assumes no overflow or underflow. In the overflow regime (result exceeds max representable), you get infinity. In the underflow regime (result is too small), you lose relative accuracy because denormalized numbers have fewer significant bits.
Worked Examples of Precision Loss
Catastrophic cancellation in variance computation
Compute the variance of in float32.
Naive formula: .
(exact in float32). .
In float32, is exact, but has 9 significant digits while float32 provides only 7. The stored value is approximately , losing the last digit. Then . Both numbers agree in their first 7 significant digits, so the subtraction produces a result with approximately 0 reliable digits. The computed variance might be 0.0, 1.0, or any value in between, depending on rounding.
The stable one-pass algorithm (Welford's method) computes incrementally, avoiding the subtraction of two large numbers. The differences are small, so no cancellation occurs.
Gradient accumulation in mixed precision training
In mixed-precision training, gradients are computed in float16 or bfloat16, then accumulated in float32. Consider adding a gradient of magnitude to a running sum that has reached . In float16, the smallest representable change to is about . Since , the addition rounds back to . The small gradient is lost entirely.
In float32, the smallest representable change to is about , which is still larger than , so even float32 loses this gradient. The fix: use Kahan summation or accumulate in float64 for critical quantities like loss values.
This is why loss scaling is used in mixed-precision training: multiply the loss by a large constant (e.g., ) before backpropagation, which scales all gradients by the same factor, then divide by the scale factor when updating weights in float32. This shifts small gradients into the representable range of float16.
Canonical Examples
Why 0.1 + 0.2 does not equal 0.3
The decimal number 0.1 has no exact binary representation (it is a repeating fraction in base 2, like 1/3 in base 10). In float64, , and similarly for 0.2. Their sum is , which rounds to a different float64 than
Log-sum-exp trick
Computing directly overflows if any is large. The log-sum-exp trick: let , then . Now the largest exponent is , so no overflow. This is how softmax is computed in every deep learning framework.
Common Confusions
Floating-point is not interval arithmetic
Floating-point numbers are not uniformly spaced. Near zero they are dense; near the maximum they are sparse. The gap between and the next float32 is ; the gap between and the next float32 is . Relative error is constant, but absolute error grows with magnitude.
Double precision does not fix all problems
Switching from float32 to float64 gives more precision but does not fix Unstable algorithms. If an algorithm amplifies errors by (ill-conditioned), float64 just delays the problem. Fix the algorithm, not the precision.
Summary
- IEEE 754:
- Machine epsilon: worst-case relative error per operation; for float32, for float64
- Overflow: number too large, becomes infinity
- Underflow: number too small, becomes zero
- Catastrophic cancellation: subtracting nearly equal numbers destroys precision
- bfloat16 has float32's range but lower precision; preferred for training
- Always use log-space arithmetic for products of probabilities
Exercises
Problem
In float32 (), you compute and . How many significant digits does have?
Problem
Explain why bfloat16 (8 exponent bits, 7 mantissa bits) is preferred over float16 (5 exponent bits, 10 mantissa bits) for training neural networks, despite having less precision.
References
Canonical:
- Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991)
- IEEE Standard 754-2019 for Floating-Point Arithmetic
Current:
- Micikevicius et al., "Mixed Precision Training" (2018)
- Dettmers et al., "8-bit Optimizers via Block-wise Quantization" (2022)
Next Topics
The natural next steps from floating-point arithmetic:
- Whitening and decorrelation: improving numerical conditioning of data
- Numerical linear algebra: stable algorithms for solving linear systems
Last reviewed: April 2026