Skip to main content

Bits and Codes · 35 min

Floating-Point Representation (IEEE 754)

How real numbers fit in 32 or 64 bits: sign, exponent, mantissa, the implicit leading 1, NaN, infinity, denormals, and the precision tradeoffs that bite ML.

Why This Matters

A float is not a real number. It is a 32-bit integer pattern that a hardware unit interprets as (1)s1.m2e127(-1)^s \cdot 1.m \cdot 2^{e-127}. Roughly 2324.3×1092^{32} \approx 4.3 \times 10^9 patterns must cover a range from about 103810^{-38} to 103810^{38}, so most reals get rounded. The constant 0.10.1 has no finite binary expansion; what you store is closer to 0.100000001490116120.10000000149011612. When you sum 10710^7 such values in a training loop, the drift is visible, and naive accumulation in float32 can lose the bottom 4-5 decimal digits.

ML systems live inside these limits. Gradient underflow, NaN explosions, and the reason bfloat16 keeps 8 exponent bits but only 7 mantissa bits all trace back to IEEE 754. Knowing the bit layout tells you why mixed precision needs a loss scale and why Kahan summation exists.

Core Definitions

Definition

Floating-point number

A number of the form (1)sMβE(-1)^s \cdot M \cdot \beta^E where s{0,1}s \in \{0,1\} is the sign, MM is the significand (mantissa), EE is an integer exponent, and β\beta is the radix. IEEE 754 binary formats use β=2\beta = 2 and a normalized significand M=1.m1m2mpM = 1.m_1 m_2 \ldots m_p in [1,2)[1, 2).

Definition

Binary32 (IEEE 754 single precision)

A 32-bit format: 1 sign bit, 8 exponent bits with bias 127, 23 fraction bits. For a normal value with stored exponent e[1,254]e \in [1, 254] and stored fraction bits mm, the value is (1)s(1+m/223)2e127(-1)^s \cdot (1 + m / 2^{23}) \cdot 2^{e - 127}. The leading 1 is implicit and not stored.

Definition

Machine epsilon

The gap between 1.01.0 and the next representable number. For binary32, ε=2231.192×107\varepsilon = 2^{-23} \approx 1.192 \times 10^{-7}. For binary64, ε=2522.220×1016\varepsilon = 2^{-52} \approx 2.220 \times 10^{-16}.

Definition

ULP (Unit in the Last Place)

The distance between two consecutive representable floats near a given value xx. For a normal xx in [2e,2e+1)[2^e, 2^{e+1}), ulp(x)=2ep\mathrm{ulp}(x) = 2^{e-p} where pp is the mantissa width (23 for binary32, 52 for binary64).

The Bit Layout

Binary32 packs three fields into one 32-bit word, MSB first:

 31  30        23 22                              0
+--+-----------+--------------------------------+
| s|  exponent |          fraction              |
+--+-----------+--------------------------------+
  1     8                    23 bits

Binary64 widens both fields: 1 sign, 11 exponent (bias 1023), 52 fraction. The exponent encoding uses biased (excess-KK) representation: stored field ee means true exponent e127e - 127 (or e1023e - 1023). Bias makes float comparison work as unsigned integer comparison on the magnitude bits, which simplifies sort networks.

The stored exponent encodes three regimes:

  • e=0e = 0: zero (if m=0m = 0) or subnormal (if m0m \neq 0), value (1)s0.m2126(-1)^s \cdot 0.m \cdot 2^{-126}.
  • 1e2541 \leq e \leq 254: normal, value (1)s1.m2e127(-1)^s \cdot 1.m \cdot 2^{e-127}.
  • e=255e = 255: infinity (if m=0m = 0) or NaN (if m0m \neq 0).

The implicit leading 1 buys one extra bit of precision: a 23-bit fraction field encodes a 24-bit significand.

Worked Example: Encoding 1.5

Write 1.51.5 in binary: 1.121.1_2. Normalized form: 1.1201.1 \cdot 2^0. So s=0s = 0, true exponent 00, stored exponent 0+127=127=0111111120 + 127 = 127 = 01111111_2, fraction =10000000000000000000000= 10000000000000000000000.

Concatenate:

0 01111111 10000000000000000000000

As hex: 0x3FC00000. You can verify in C:

#include <stdio.h>
#include <string.h>

int main(void) {
    float x = 1.5f;
    unsigned int bits;
    memcpy(&bits, &x, 4);
    printf("%08x\n", bits);   // prints 3fc00000
    return 0;
}

Worked Example: Why 0.1 Is Not Exact

The decimal 0.10.1 in binary is the repeating fraction

0.000112=0.000110011001100112.0.0\overline{0011}_2 = 0.00011001100110011\ldots_2.

To normalize, shift left 4 places: 1.10011001100241.10011001100\ldots \cdot 2^{-4}. Stored exponent =4+127=123=011110112= -4 + 127 = 123 = 01111011_2. The 23-bit fraction is the first 23 bits after the implicit leading 1; the 24th bit governs round-to-nearest-even:

fraction bits: 10011001100110011001101
                                      ^ rounded up

Final binary32 encoding of 0.10.1: 0x3DCCCCCD. Decoded exactly, this equals

0.100000001490116119384765625.0.100000001490116119384765625.

The error is 1.49×109\approx 1.49 \times 10^{-9}, about 0.1250.125 ULP at that magnitude.

For 0.1-0.1, flip the sign bit: 0xBDCCCCCD. The magnitude error is identical.

Special Values

Zero. Stored exponent 0 and fraction 0. There are two zeros: +0+0 (0x00000000) and 0-0 (0x80000000). They compare equal but are bit-distinct; 1/(+0)=+1/(+0) = +\infty, 1/(0)=1/(-0) = -\infty.

Infinity. Stored exponent all ones, fraction zero. ++\infty is 0x7F800000. Produced by overflow (FLT_MAX2\text{FLT\_MAX} \cdot 2) and by 1/01/0.

NaN. Stored exponent all ones, fraction nonzero. Two flavors:

  • quiet NaN (qNaN), MSB of fraction set, propagates silently
  • signaling NaN (sNaN), MSB of fraction clear, raises an FP exception on use

NaN has the property NaNNaN\text{NaN} \neq \text{NaN}. This is how x != x works as a NaN check.

Subnormals (denormals). When stored exponent is 0 and fraction nonzero, the leading bit is implicitly 0, not 1, and the exponent is fixed at 126-126. Subnormals fill the gap between 0 and the smallest normal, 21261.18×10382^{-126} \approx 1.18 \times 10^{-38}. The smallest positive binary32 is 21491.40×10452^{-149} \approx 1.40 \times 10^{-45}. Subnormals preserve "gradual underflow" but often run on a slow microcode path; many GPUs and SIMD modes set FTZ (flush-to-zero) and DAZ (denormals-are-zero) flags to skip them.

Precision and ULPs

Floats are densest near zero and sparser at large magnitudes. Between 2e2^e and 2e+12^{e+1} there are exactly 2232^{23} uniformly spaced binary32 values. Consequences:

  • Near 1.01.0, ulp=2231.19×107\mathrm{ulp} = 2^{-23} \approx 1.19 \times 10^{-7}.
  • Near 10610^6, ulp22023=0.125\mathrm{ulp} \approx 2^{20-23} = 0.125. You cannot represent 106+0.0110^6 + 0.01 in binary32.
  • Near 224=16,777,2162^{24} = 16{,}777{,}216, consecutive floats are 1 apart, and 224+12^{24} + 1 rounds to 2242^{24}.

Round-to-nearest-even (the default) gives a worst-case relative error of ε/2\varepsilon / 2 per operation. For binary32, that bounds one operation at 6×108\approx 6 \times 10^{-8} relative. Errors compound: summing NN values naively has worst-case error growing like NεN \varepsilon and expected error like Nε\sqrt{N} \varepsilon.

Catastrophic Cancellation and Kahan Summation

When two close numbers are subtracted, leading bits cancel and the result is dominated by rounding noise.

float a = 1.0000001f;
float b = 1.0000000f;
float d = a - b;   // d = 1.1920929e-07, not 1e-7

The relative error in d is on the order of 1, even though a and b were each accurate to a fraction of a ULP. This is why the variance formula

σ2=x2xˉ2\sigma^2 = \overline{x^2} - \bar{x}^2

is numerically dangerous when xˉ2x2\bar{x}^2 \approx \overline{x^2}. Use the two-pass formulation (xixˉ)2/n\sum (x_i - \bar{x})^2 / n instead.

For long sums, Kahan summation tracks the low-order error in a separate compensator:

float kahan_sum(const float *x, size_t n) {
    float s = 0.0f;
    float c = 0.0f;            // running compensation
    for (size_t i = 0; i < n; i++) {
        float y = x[i] - c;    // y = next term, minus prior loss
        float t = s + y;       // s is large, y small; low bits drop
        c = (t - s) - y;       // recover the dropped low bits
        s = t;
    }
    return s;
}

This reduces the error bound from O(Nε)O(N \varepsilon) to O(ε)O(\varepsilon) independent of NN, at the cost of 4x arithmetic. ML training loops summing 10810^8 gradient terms in float32 will see meaningful drift without compensation; this is one reason gradient accumulation buffers are often kept in float32 even when activations are float16.

Main Theorem

Theorem

Standard Model of Floating-Point Arithmetic

Statement

For any two representable floats a,ba, b and any operation {+,,,/}\circ \in \{+, -, \cdot, /\}, the computed result fl(ab)\mathrm{fl}(a \circ b) satisfies fl(ab)=(ab)(1+δ),δu\mathrm{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \leq u where u=2pu = 2^{-p} is the unit roundoff (u=224u = 2^{-24} for binary32, u=253u = 2^{-53} for binary64).

Intuition

Each operation rounds the exact mathematical result to the nearest representable float. The maximum relative error is half a ULP of the result, which is bounded by uu.

Proof Sketch

IEEE 754 mandates correctly rounded results for the four basic operations (and square root). Round-to-nearest places the exact result within 12ulp\frac{1}{2}\mathrm{ulp} of a representable value. For a normalized result in [2e,2e+1)[2^e, 2^{e+1}), ulp=2ep+1\mathrm{ulp} = 2^{e-p+1} and the value is at least 2e2^e, giving relative error at most 2p2^{-p}.

Why It Matters

This is the foundation of every floating-point error analysis. Wilkinson-style backward error proofs string together (1+δi)(1+\delta_i) factors and bound their product. It tells you that one float32 multiply costs at most 6×1086 \times 10^{-8} relative error, and lets you reason about whether 10000 of them will stay within tolerance.

Failure Mode

The bound fails on underflow to subnormals, where relative error can be 100 percent (the result rounds to zero). It also fails for non-elementary functions like exp or sin; libm implementations typically guarantee 1-2 ULPs, not correctly rounded.

Common Confusions

Watch Out

Float comparison with ==

0.1f + 0.2f == 0.3f is false in C. Each literal rounds independently and the sum has a different rounding error than the direct encoding of 0.30.3. Use fabsf(a - b) < tol with a tolerance tuned to the magnitudes involved, or compare ULP distances by reinterpreting as int32_t.

Watch Out

Mantissa width versus significand width

A binary32 has a 23-bit fraction field but a 24-bit significand because of the implicit leading 1. When converting between hand-derived and machine-printed bit patterns, this off-by-one is the most common source of confusion.

Exercises

ExerciseCore

Problem

Encode the decimal value 0.1 in IEEE 754 binary32. Show the sign, biased exponent, and 23-bit fraction field. Why is the encoded value not exactly equal to 0.1?

ExerciseCore

Problem

What are the smallest positive normal and smallest positive subnormal binary32 values? What is the largest finite binary32 value? Express each as a decimal power-of-two expression and an approximate decimal magnitude.

ExerciseAdvanced

Problem

Find three binary32 values a,b,ca, b, c such that (a+b)+ca+(b+c)(a + b) + c \neq a + (b + c), demonstrating that floating-point addition is not associative. Compute both sides and show the bit difference.

References

Canonical:

  • Petzold, Code: The Hidden Language of Computer Hardware and Software (2nd ed., 2022), ch. 7–9 — builds binary arithmetic and signed/unsigned representations from first principles before arriving at floating-point encoding
  • Tanenbaum & Austin, Structured Computer Organization (6th ed., 2013), ch. 2 — covers IEEE 754 binary32/binary64 layouts, normalization, and arithmetic at the instruction-set level
  • Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic," ACM Computing Surveys 23(1), 1991 — definitive reference for ULP analysis, rounding modes, and the fundamental theorem of floating-point arithmetic
  • Overton, Numerical Computing with IEEE Floating Point Arithmetic (SIAM, 2001), ch. 2–4 — rigorous treatment of the IEEE 754 standard, rounding error models, and exception handling
  • Muller et al., Handbook of Floating-Point Arithmetic (2nd ed., Birkhäuser, 2018), ch. 1–3 — detailed reference covering representation, rounding, and elementary function error bounds

Accessible:

  • Harris & Harris, Digital Design and Computer Architecture (2nd ed., 2012), ch. 5 — accessible hardware-oriented walkthrough of IEEE 754 with datapath diagrams
  • Bryant & O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed., 2016), ch. 2 — programmer-focused treatment of float bit patterns, special values, and rounding with C examples

Next Topics