Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Stability

Floating-Point Arithmetic

How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.

CoreTier 1Stable~40 min

Why This Matters

Every number in your neural network. every weight, gradient, activation, and loss value. is a floating-point number with finite precision. When you train a model and the loss becomes NaN, when gradients explode or vanish, when two numbers that should be equal are not, floating-point arithmetic is the root cause.

Understanding floating-point is not optional for ML practitioners. It explains why we use log-sum-exp instead of summing exponentials, why bfloat16 works for training but float16 sometimes does not, and why numerical stability is a real engineering constraint.

Mental Model

A floating-point number is scientific notation for computers. Just as 6.022×10236.022 \times 10^{23} has a significand (6.022) and an exponent (23), a floating-point number has a mantissa and an exponent, but in base 2. The mantissa gives you precision (how many significant digits), and the exponent gives you range (how large or small the number can be).

The fundamental limitation: you have a fixed number of bits for the mantissa, so most real numbers cannot be represented exactly. They get rounded to the nearest representable number.

Formal Setup and Notation

Definition

IEEE 754 Floating-Point Representation

A floating-point number in IEEE 754 format is stored as three fields:

(1)s×1.m×2ebias(-1)^s \times 1.m \times 2^{e - \text{bias}}

where ss is the sign bit (0 for positive, 1 for negative), mm is the mantissa (also called significand or fraction), and ee is the exponent. The leading 1 in "1.m1.m" is implicit (not stored), giving one extra bit of precision for free.

For float32: 1 sign bit, 8 exponent bits, 23 mantissa bits (32 total). For float64: 1 sign bit, 11 exponent bits, 52 mantissa bits (64 total).

Definition

Machine Epsilon

Machine epsilon is the smallest ϵ>0\epsilon > 0 such that fl(1+ϵ)>1\mathrm{fl}(1 + \epsilon) > 1, where fl()\mathrm{fl}(\cdot) denotes rounding to the nearest floating-point number.

Equivalently, it is 2p2^{-p} where pp is the number of mantissa bits (including the implicit leading 1). It represents the worst-case relative rounding error for a single operation.

For float32: ϵmach=2245.96×108\epsilon_{\text{mach}} = 2^{-24} \approx 5.96 \times 10^{-8} (about 7 decimal digits). For float64: ϵmach=2531.11×1016\epsilon_{\text{mach}} = 2^{-53} \approx 1.11 \times 10^{-16} (about 16 decimal digits).

Definition

ULP (Unit in the Last Place)

The ULP of a floating-point number xx is the spacing between xx and the next representable floating-point number. For x=1.0x = 1.0 in float32, the ULP is 2231.19×1072^{-23} \approx 1.19 \times 10^{-7}. ULP grows with the magnitude of xx: large numbers have larger gaps between representable values.

Core Definitions

Overflow occurs when the result of a computation exceeds the largest representable number. In float32, the maximum is approximately 3.4×10383.4 \times 10^{38}. Overflow produces infinity (++\infty or -\infty). Computing e100e^{100} in float32 overflows.

Underflow occurs when the result is closer to zero than the smallest representable positive number. In float32, the smallest normal number is approximately 1.18×10381.18 \times 10^{-38}. Underflow produces zero or a denormalized number. Computing e100e^{-100} in float32 underflows to zero.

Catastrophic cancellation occurs when subtracting two nearly equal numbers. If a=1.000001a = 1.000001 and b=1.000000b = 1.000000 in float32, then ab=0.000001a - b = 0.000001. But if aa and bb each have 7 digits of precision, aba - b has only 1 digit of precision. The relative error explodes.

This is why computing Var(X)=E[X2](E[X])2\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 is numerically unstable: if the mean is large relative to the standard deviation, you subtract two large, nearly equal numbers.

Why ML Uses Reduced Precision

FormatSignExponentMantissaTotal bitsPrecisionRange
float32182332~7 digits~103810^{38}
float16151016~3 digits~6.5×1046.5 \times 10^4
bfloat1618716~2 digits~103810^{38}
int8---8256 values[128,127][-128, 127]

float16 has limited range (max ~65504) but decent precision. Gradients and activations in deep networks can exceed 65504, causing overflow.

bfloat16 sacrifices precision for range. It has the same exponent range as float32, so overflow is rare. This is why bfloat16 is preferred for training: gradients rarely overflow.

int8 quantization maps continuous values to 256 discrete levels. Used for inference (not training) to reduce memory and compute by 4x vs float32.

Main Theorems

Proposition

Fundamental Axiom of Floating-Point Arithmetic

Statement

For any real number xx in the representable range, the floating-point representation fl(x)\mathrm{fl}(x) satisfies:

fl(x)=x(1+δ),δϵmach\mathrm{fl}(x) = x(1 + \delta), \quad |\delta| \leq \epsilon_{\text{mach}}

For any arithmetic operation {+,,×,/}\circ \in \{+, -, \times, /\}:

fl(ab)=(ab)(1+δ),δϵmach\mathrm{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \leq \epsilon_{\text{mach}}

Each floating-point operation introduces a relative error of at most ϵmach\epsilon_{\text{mach}}.

Intuition

Every floating-point operation is "almost right": the result is within a factor of (1±ϵmach)(1 \pm \epsilon_{\text{mach}}) of the true answer. The problem is that errors accumulate over many operations. After nn operations, the relative error can be as large as nϵmachn \epsilon_{\text{mach}} in the worst case (and nϵmach\sqrt{n} \epsilon_{\text{mach}} on average for random errors).

Proof Sketch

The nearest floating-point number to xx differs from xx by at most half a ULP. The ULP of xx is x21p|x| \cdot 2^{1-p} where pp is the mantissa precision. So fl(x)x12x21p=x2p=xϵmach|\mathrm{fl}(x) - x| \leq \frac{1}{2}|x| \cdot 2^{1-p} = |x| \cdot 2^{-p} = |x| \epsilon_{\text{mach}}. Dividing by x|x| gives the relative error bound.

Why It Matters

This axiom is the foundation of numerical analysis. Every error analysis in scientific computing starts from this bound. It tells you that single operations are accurate, and the challenge is controlling error accumulation over long computations (like training a neural network for millions of steps).

Failure Mode

The bound assumes no overflow or underflow. In the overflow regime (result exceeds max representable), you get infinity. In the underflow regime (result is too small), you lose relative accuracy because denormalized numbers have fewer significant bits.

Worked Examples of Precision Loss

Example

Catastrophic cancellation in variance computation

Compute the variance of x=[10000.0,10001.0,10002.0]x = [10000.0, 10001.0, 10002.0] in float32.

Naive formula: Var(x)=E[X2](E[X])2\text{Var}(x) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2.

E[X]=10001.0\mathbb{E}[X] = 10001.0 (exact in float32). E[X2]=(100002+100012+100022)/3=(108+100020001+100040004)/3=100020001.67\mathbb{E}[X^2] = (10000^2 + 10001^2 + 10002^2)/3 = (10^8 + 100020001 + 100040004)/3 = 100020001.67.

In float32, 100002=10810000^2 = 10^8 is exact, but 100012=10002000110001^2 = 100020001 has 9 significant digits while float32 provides only 7. The stored value is approximately 1.00020000×1081.00020000 \times 10^8, losing the last digit. Then E[X2](E[X])2100020001.67100020001.0\mathbb{E}[X^2] - (\mathbb{E}[X])^2 \approx 100020001.67 - 100020001.0. Both numbers agree in their first 7 significant digits, so the subtraction produces a result with approximately 0 reliable digits. The computed variance might be 0.0, 1.0, or any value in between, depending on rounding.

The stable one-pass algorithm (Welford's method) computes Var(x)=1n1i(xixˉ)2\text{Var}(x) = \frac{1}{n-1}\sum_i (x_i - \bar{x})^2 incrementally, avoiding the subtraction of two large numbers. The differences xixˉx_i - \bar{x} are small, so no cancellation occurs.

Example

Gradient accumulation in mixed precision training

In mixed-precision training, gradients are computed in float16 or bfloat16, then accumulated in float32. Consider adding a gradient of magnitude 10410^{-4} to a running sum that has reached 10410^4. In float16, the smallest representable change to 10410^4 is about 104×2109.810^4 \times 2^{-10} \approx 9.8. Since 1049.810^{-4} \ll 9.8, the addition 104+10410^4 + 10^{-4} rounds back to 10410^4. The small gradient is lost entirely.

In float32, the smallest representable change to 10410^4 is about 104×2231.2×10310^4 \times 2^{-23} \approx 1.2 \times 10^{-3}, which is still larger than 10410^{-4}, so even float32 loses this gradient. The fix: use Kahan summation or accumulate in float64 for critical quantities like loss values.

This is why loss scaling is used in mixed-precision training: multiply the loss by a large constant (e.g., 2162^{16}) before backpropagation, which scales all gradients by the same factor, then divide by the scale factor when updating weights in float32. This shifts small gradients into the representable range of float16.

Canonical Examples

Example

Why 0.1 + 0.2 does not equal 0.3

The decimal number 0.1 has no exact binary representation (it is a repeating fraction in base 2, like 1/3 in base 10). In float64, fl(0.1)0.1000000000000000055511...\mathrm{fl}(0.1) \approx 0.1000000000000000055511..., and similarly for 0.2. Their sum is 0.3000000000000000444...0.3000000000000000444..., which rounds to a different float64 than fl(0.3)=0.2999999999999999888...\mathrm{fl}(0.3) = 0.2999999999999999888...

Example

Log-sum-exp trick

Computing log(iexi)\log(\sum_i e^{x_i}) directly overflows if any xix_i is large. The log-sum-exp trick: let c=maxixic = \max_i x_i, then log(iexi)=c+log(iexic)\log(\sum_i e^{x_i}) = c + \log(\sum_i e^{x_i - c}). Now the largest exponent is e0=1e^0 = 1, so no overflow. This is how softmax is computed in every deep learning framework.

Common Confusions

Watch Out

Floating-point is not interval arithmetic

Floating-point numbers are not uniformly spaced. Near zero they are dense; near the maximum they are sparse. The gap between 1.01.0 and the next float32 is 107\sim 10^{-7}; the gap between 103010^{30} and the next float32 is 1023\sim 10^{23}. Relative error is constant, but absolute error grows with magnitude.

Watch Out

Double precision does not fix all problems

Switching from float32 to float64 gives more precision but does not fix Unstable algorithms. If an algorithm amplifies errors by 101010^{10} (ill-conditioned), float64 just delays the problem. Fix the algorithm, not the precision.

Summary

  • IEEE 754: (1)s×1.m×2ebias(-1)^s \times 1.m \times 2^{e-\text{bias}}
  • Machine epsilon: worst-case relative error per operation; 107\sim 10^{-7} for float32, 1016\sim 10^{-16} for float64
  • Overflow: number too large, becomes infinity
  • Underflow: number too small, becomes zero
  • Catastrophic cancellation: subtracting nearly equal numbers destroys precision
  • bfloat16 has float32's range but lower precision; preferred for training
  • Always use log-space arithmetic for products of probabilities

Exercises

ExerciseCore

Problem

In float32 (ϵmach6×108\epsilon_{\text{mach}} \approx 6 \times 10^{-8}), you compute a=1.0000001a = 1.0000001 and b=1.0000000b = 1.0000000. How many significant digits does ab=0.0000001a - b = 0.0000001 have?

ExerciseAdvanced

Problem

Explain why bfloat16 (8 exponent bits, 7 mantissa bits) is preferred over float16 (5 exponent bits, 10 mantissa bits) for training neural networks, despite having less precision.

References

Canonical:

  • Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991)
  • IEEE Standard 754-2019 for Floating-Point Arithmetic

Current:

  • Micikevicius et al., "Mixed Precision Training" (2018)
  • Dettmers et al., "8-bit Optimizers via Block-wise Quantization" (2022)

Next Topics

The natural next steps from floating-point arithmetic:

  • Whitening and decorrelation: improving numerical conditioning of data
  • Numerical linear algebra: stable algorithms for solving linear systems

Last reviewed: April 2026

Builds on This

Next Topics