Why This Matters
A 64-bit server memory word often travels with 8 extra check bits, making a 72-bit physical word. That 12.5 percent overhead lets ECC RAM correct one flipped bit in the word and detect two. Without it, a bit flipped by a charged particle, thermal noise, RF interference, or DRAM cell wear can turn a tensor value, a page-table entry, or a model checkpoint byte into a different value with no software exception.
Networks and disks face longer error patterns. Ethernet appends a 32-bit CRC to each frame. ZIP stores a CRC-32 per file entry. TCP, by contrast, uses a 16-bit one's-complement checksum rather than a CRC. The design question is quantitative: how many extra bits buy detection, correction, or both?
Core Definitions
Block code
A block code maps information symbols to an -symbol codeword. For a binary code, the rate is , the redundancy is , and the code contains at most valid codewords inside the possible bit strings.
Hamming distance
The Hamming distance between two equal-length strings is the count of positions where they differ. The minimum distance of a code is .
Parity bit
A parity bit is one check bit chosen so the total number of 1 bits is even or odd. Even parity detects every odd number of flips in the protected group and corrects no bit by itself.
Cyclic redundancy check
A CRC treats a bit string as a polynomial over , divides by a fixed generator polynomial, and appends the remainder. Receiver and sender agree on the generator, bit order, initial value, and final xor.
Bits Flip for Physical Reasons
A stored 1 and a stored 0 are voltage, charge, magnetization, or phase states. Noise moves the measured value. Cosmic rays and alpha particles deposit charge in silicon. Thermal noise perturbs small signals. RF interference couples into wires. Flash cells wear out because erase cycles damage the oxide; reads and retention time also move thresholds.
A single bit flip can have very different meanings by location. Suppose the byte 0x41 stores ASCII A.
0x41 = 0100 0001
flip bit 5 from 0 to 1
0x61 = 0110 0001 = ASCII 'a'
For a 32-bit IEEE 754 float, flipping an exponent bit can be much larger.
1.0f = 0x3f800000
bits = 00111111 10000000 00000000 00000000
flip exponent bit 30
0x7f800000 = +infinity
bits = 01111111 10000000 00000000 00000000
This is why storage and transmission formats carry check information. They do not make hardware immune to errors; they make many errors visible and some errors correctable.
Parity Detects Odd Counts, Not Locations
For even parity, append . The receiver recomputes the xor over data and parity. A nonzero result means an odd number of protected bits changed.
Worked byte example:
data byte: 10110010
number of 1s: 4
even parity p: 0
stored bits: 10110010 0
one flip: 10110000 0
number of 1s: 3
parity check: fail
two flips: 10100000 0
number of 1s: 2
parity check: pass
Parity gives a one-bit syndrome, so it cannot name which of the 9 stored positions flipped. It catches all 1, 3, 5, 7, and 9 bit errors in this group. It misses some 2, 4, 6, and 8 bit errors.
The same xor appears in machine code because addition in is xor. A compact even-parity check for one byte is:
#include <stdint.h>
int even_parity_ok(uint8_t data, unsigned parity_bit) {
uint8_t x = data;
x ^= x >> 4;
x ^= x >> 2;
x ^= x >> 1;
return ((x ^ parity_bit) & 1u) == 0u;
}
Hamming(7,4) Encoding and Decoding
Hamming(7,4) uses positions 1, 2, and 4 for parity, and positions 3, 5, 6, and 7 for data. With even parity:
Encode the 4-bit message 1011 as , , , placed at positions 3, 5, 6, 7.
position: 1 2 3 4 5 6 7
role: p1 p2 d1 p4 d2 d3 d4
data: 1 0 1 1
p1 = pos3 xor pos5 xor pos7 = 1 xor 0 xor 1 = 0
p2 = pos3 xor pos6 xor pos7 = 1 xor 1 xor 1 = 1
p4 = pos5 xor pos6 xor pos7 = 0 xor 1 xor 1 = 0
codeword: 0 1 1 0 0 1 1 = 0110011
Now flip position 6 during storage or transmission.
sent: 0 1 1 0 0 1 1
received: 0 1 1 0 0 0 1
position: 1 2 3 4 5 6 7
The syndrome bits recompute the parity checks:
s1 = r1 xor r3 xor r5 xor r7 = 0 xor 1 xor 0 xor 1 = 0
s2 = r2 xor r3 xor r6 xor r7 = 1 xor 1 xor 0 xor 1 = 1
s4 = r4 xor r5 xor r6 xor r7 = 0 xor 0 xor 0 xor 1 = 1
syndrome as binary s4 s2 s1 = 110 = 6
The syndrome names the flipped position. Flip received position 6 back to 1, yielding 0110011, then read data positions 3, 5, 6, 7 as 1011.
Plain Hamming(7,4) has minimum distance 3. It corrects one error. It detects two errors if the receiver only asks whether the syndrome is nonzero, but a correcting decoder will miscorrect a two-bit error to some third position. SECDED in memory is usually the extended Hamming code: Hamming(7,4) plus one overall parity bit, making an (8,4) code with minimum distance 4.
Extended decoding uses this table:
syndrome = 0, overall parity ok: no error
syndrome = 0, overall parity bad: overall parity bit flipped
syndrome != 0, overall parity bad: correct bit named by syndrome
syndrome != 0, overall parity ok: double error detected, do not correct
Example double error in positions 6 and 7 of 0110011 gives syndrome 001, but the overall parity remains even. The extended decoder reports a double error instead of flipping position 1.
CRCs Use Polynomial Division Mod 2
A CRC replaces integer division with polynomial division over . Addition and subtraction are both xor. For generator , the binary divisor is 1011. For message bits 110100, append three zeros, divide by 1011, and append the 3-bit remainder.
message: 110100
append 3 zeros: 110100000
generator: 1011
remainder: 100
transmitted: 110100100
A tiny bit-oriented implementation shows the operation. It is slow, but it matches the arithmetic.
#include <stdint.h>
uint32_t crc3_1011(uint32_t bits, int len) {
uint32_t reg = bits << 3;
uint32_t poly = 0b1011u;
for (int i = len + 2; i >= 3; --i) {
if ((reg >> i) & 1u) {
reg ^= poly << (i - 3);
}
}
return reg & 0b111u;
}
If the receiver divides 110100100 by 1011, the remainder is zero. If a burst flips the contiguous pattern 001110000, the received word is 111010100; its remainder is nonzero for this generator, so the error is detected.
Real CRCs use wider polynomials and fixed conventions. Ethernet uses a 32-bit CRC over the frame check sequence, ZIP records CRC-32 per entry, and many storage protocols use CRC variants. TCP is different: its 16-bit checksum catches many random corruptions but is not polynomial CRC division.
ECC RAM and Reed-Solomon Storage Codes
ECC DRAM commonly protects each 64-bit data word with 8 check bits. That is 72 physical bits per 64 data bits:
On a corrected single-bit event, the memory controller can return the corrected data and log the address. On an uncorrectable multi-bit event, the machine can raise a machine-check exception or poison the cache line, depending on platform behavior. Servers care because a long-running job touches terabytes of memory over hours or days, and silent corruption is worse than a visible crash for databases, training runs, and checkpointed simulations.
Reed-Solomon codes work over larger symbols, often bytes, rather than individual bits. An RS code has minimum distance , so it can correct unknown symbol errors or recover up to known erasures. This symbol view is why Reed-Solomon fits CDs, DVDs, QR codes, and RAID-like storage. A scratch may corrupt many adjacent bits, but it may occupy a smaller number of byte symbols after interleaving. RAID-6 uses two parity blocks per stripe and can survive two disk erasures because the failed disk locations are known.
A numeric storage sketch:
RS(10,8) over bytes
data symbols: 8
check symbols: 2
minimum distance: d = 10 - 8 + 1 = 3
unknown errors: floor((3 - 1) / 2) = 1
known erasures: d - 1 = 2
Full Reed-Solomon decoding uses finite-field algebra beyond this page. The design lesson is still simple: coding by symbols converts long bit bursts into a smaller count of symbol errors or erasures.
Key Result
Distance, Hamming Bound, and Singleton Bound
Statement
A code with minimum distance detects up to symbol errors and corrects up to symbol errors. If , then any such code satisfies the Hamming sphere-packing bound
and the Singleton bound
Intuition
Detection fails only when an error moves one valid codeword exactly onto another valid codeword. Correction fails when one received word is equally near, or nearer, to two different codewords. Balls of radius around codewords must not overlap.
Proof Sketch
If fewer than positions change, a codeword cannot become a different codeword, so detection works through errors. If , the radius- balls around distinct codewords are disjoint, so nearest-neighbor decoding is unique. Counting all words in these disjoint balls gives the Hamming bound. For Singleton, delete coordinates from every codeword. Distinct codewords must remain distinct, or two original codewords would have differed in at most positions. Thus .
Why It Matters
The bounds price redundancy. Hamming(7,4) has , , . The Singleton bound gives , so it is not maximum-distance separable. The binary Hamming bound with gives , so the single-error correction spheres exactly fill the 7-bit space.
Failure Mode
A distance-3 code cannot be SECDED under a correcting decoder. A two-bit error may produce the same syndrome as a one-bit error. SECDED needs distance 4, which extended Hamming provides by adding the overall parity bit.
Common Confusions
Hamming(7,4) versus SECDED
Hamming(7,4) corrects one error and has . It can notice that some double errors occurred if you only check for a nonzero syndrome, but a normal correcting decoder will map a double error to a wrong single-bit correction. SECDED uses the extended (8,4) code with an overall parity bit, giving .
CRC versus checksum
Ethernet and ZIP use CRC-32 variants. TCP uses a 16-bit one's-complement checksum over a pseudo-header, TCP header, and payload. Calling TCP's checksum a CRC is incorrect and leads to wrong assumptions about burst-error detection.
Detection is not correction
A parity bit may tell you that a byte plus parity is wrong, but it gives no location. To correct a single-bit error among 7 positions, Hamming(7,4) needs 3 syndrome bits because outcomes cover no-error plus 7 possible error locations.
Exercises
Problem
Encode the 4-bit message 0110 using even-parity Hamming(7,4) with parity positions 1, 2, and 4. Then suppose position 5 flips. Compute the received word, syndrome, corrected word, and decoded message.
Problem
For a binary code with and , use the Hamming bound to test whether single-error correction with is possible without violating the counting bound. Then compute the Singleton upper bound on .
Problem
Using generator 1011, compute the CRC remainder for message 101100. Give the transmitted bit string.
References
Canonical:
- Charles Petzold, Code: The Hidden Language of Computer Hardware and Software (2nd ed., 2022), ch. 7-9, binary codes, telegraphy, and error-checking ideas
- Andrew S. Tanenbaum and Todd Austin, Structured Computer Organization (6th ed., 2013), ch. 2, data representation, memory organization, and error-correcting codes
- Shu Lin and Daniel J. Costello Jr., Error Control Coding (2nd ed., 2004), ch. 1-3, block codes, bounds, Hamming codes, and cyclic codes
- F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes (1977), ch. 1, Hamming distance, bounds, and introductory code constructions
- W. Wesley Peterson and E. J. Weldon Jr., Error-Correcting Codes (2nd ed., 1972), ch. 6-7, cyclic codes and BCH/Reed-Solomon background
Accessible:
- Ross N. Williams, A Painless Guide to CRC Error Detection Algorithms
- Richard W. Hamming, “Error Detecting and Error Correcting Codes,” Bell System Technical Journal 29(2), 1950
- Chris Heegard and Stephen B. Wicker, Turbo Coding, ch. 1, historical error-control coding overview
Next Topics
/computationpath/finite-fields/computationpath/network-packets-checksums/computationpath/storage-systems-raid/computationpath/channel-coding-overview