Why This Matters
A 1 MiB English text file often gzip-compresses to about 250 KiB, near 2 bits per character if the original uses 8-bit bytes. A 1 MiB file filled with uniform random bytes usually stays near 1 MiB and may grow by a header plus block overhead.
The difference is not implementation luck. English has biased symbols and repeated substrings. Random bytes have neither. Shannon entropy gives the average bits-per-symbol floor for a source model, Huffman coding constructs prefix codes near that floor, and LZ77 turns repeated byte strings into length-distance references.
Core Definitions
Lossless Compression
A lossless compressor maps an input byte string to a shorter byte string plus enough structure for exact reconstruction. If the decompressor returns even one bit different from the input, the scheme is not lossless for that input.
Lossy Compression
A lossy compressor returns an approximation chosen to preserve a task-specific notion of quality. JPEG may change pixel values. MP3 may remove masked audio components. Neural-network weight quantization may change floating-point weights while preserving validation accuracy within a tolerance.
Entropy of a Discrete Source
For a discrete random variable with probabilities , the Shannon entropy is . It is measured in bits per symbol when the logarithm has base 2.
Prefix Code
A prefix code assigns each symbol a bit string such that no codeword is a prefix of another. This lets a decoder read bits left to right without a separator between symbols.
Entropy as the Bits-Per-Symbol Floor
For a source with symbols and probabilities , entropy is
A fair bit has entropy 1 bit because both outcomes have probability . A byte uniformly sampled from 256 values has entropy 8 bits. No lossless compressor can reduce an independent stream of such bytes on average under the true distribution.
Biased symbols are different. Suppose a source emits A with probability 1/4, C with probability $1/8, and D with probability $1/8`. Its entropy is
A fixed-width encoding for four symbols uses 2 bits per symbol. A variable-length code can do better by assigning shorter codewords to frequent symbols.
Here is a minimal C calculation for entropy from counts.
#include <math.h>
#include <stdio.h>
int main(void) {
int count[] = {50, 25, 12, 13};
int n = 100;
double H = 0.0;
for (int i = 0; i < 4; i++) {
double p = (double)count[i] / n;
H -= p * log2(p);
}
printf("%.3f bits per symbol\n", H);
return 0;
}
This prints about 1.743 bits per symbol.
For English text, the true number depends on the source model. Character frequencies alone give a few bits per character. Models using context, words, and long-range constraints reduce the estimated entropy. Shannon's classic experiments placed English around 1 bit per character for skilled prediction; a practical rule for modern gzip on ordinary English prose is about 2 bits per byte-sized character, with entropy estimates often quoted around 1.3 bits per character for strong language models.
No Compressor Wins on Every File
Lossless decompression must be one-to-one on the set of compressed representations it accepts. If two different inputs map to the same compressed bytes, the decompressor cannot know which original to return.
For all -bit files, there are possible inputs. The set of bit strings shorter than bits has size
That is one slot too few. So no injective lossless compressor can map every -bit input to a shorter output. If it shortens some files, it must leave some unchanged or make some longer.
A practical compressor often adds a magic number, method byte, checksum, and length fields. For a short input like hi, gzip metadata dominates the two data bytes. This is not a defect in the entropy theorem; it is fixed overhead plus the fact that tiny samples do not expose much redundancy.
Huffman Coding by Hand
Huffman coding constructs an optimal prefix code for a known set of symbol frequencies when each symbol is encoded separately. It repeatedly merges the two lowest-frequency nodes into a binary tree. Left and right edges become 0 and 1.
Take the string BANANA_BANDANA. Its 14 characters have these counts.
| Symbol | Count | Probability |
|---|---|---|
A | 6 | 0.4286 |
N | 4 | 0.2857 |
B | 2 | 0.1429 |
_ | 1 | 0.0714 |
D | 1 | 0.0714 |
The entropy is about 1.985 bits per character. A fixed-width code for five symbols needs 3 bits per character, so the 14-character string takes 42 bits before storing any table.
Huffman merges the two count-1 symbols, then merges count-2 B with that node, then merges count-4 N with that node, then merges count-6 A with the rest. One valid code is
| Symbol | Code | Length |
|---|---|---|
A | 0 | 1 |
N | 10 | 2 |
B | 110 | 3 |
_ | 1110 | 4 |
D | 1111 | 4 |
The encoded payload is
B A N A N A _ B A N D A N A
110 0 10 0 10 0 1110 110 0 10 1111 0 10 0
Concatenated, that is 28 bits.
1100100100111011001011110100
Packed into bytes with four zero pad bits at the end, the payload bytes are
11001001 00111011 00101111 01000000
0xC9 0x3B 0x2F 0x40
The average code length is bits per character, very close to the 1.985-bit entropy for this empirical distribution. Real file formats must also store or reconstruct the codebook. For small inputs, that overhead can erase the gain.
Arithmetic Coding and the Epsilon Gap
Huffman coding assigns an integer number of bits to each symbol. If a symbol has probability 0.9, the ideal length is bits, but a per-symbol prefix code cannot spend a fractional bit on one occurrence.
Arithmetic coding encodes a whole message as a subinterval of . Each symbol narrows the interval according to its probability. A final binary fraction inside the interval identifies the entire message.
If a message has probability
then its ideal code length is
With careful renormalization and finite-precision arithmetic, arithmetic coding can get within bits per symbol of the entropy rate for long messages. The price is more complex encoder and decoder state than a Huffman tree, plus historical patent and implementation concerns. Many modern formats use range coding or tabled variants for similar reasons.
LZ77 and the Sliding Window
Huffman and arithmetic coding exploit symbol bias. LZ77 exploits repeated substrings. It keeps a sliding window of previous bytes. When the next input bytes match earlier bytes, the encoder emits a pair containing a distance backward and a match length. Otherwise it emits a literal byte.
For abcabcabcX, the ASCII bytes are
61 62 63 61 62 63 61 62 63 58
a b c a b c a b c X
After outputting literals a, b, and c, the encoder sees abcabc repeated starting 3 bytes back. An illustrative LZSS token stream is
literal 0x61
literal 0x62
literal 0x63
match distance 3 length 6
literal 0x58
One toy byte layout could use a flag byte followed by payload bytes, with flag bit 0 for a literal and 1 for a match.
flags payload
0x08 61 62 63 03 06 58
This toy encoding uses 7 bytes instead of 10, before any file header. DEFLATE, the format inside gzip, does more. It emits literals, length codes, and distance codes, then Huffman-codes those token streams. So gzip combines dictionary substitution from LZ77 with entropy coding over the resulting symbols.
LZSS is a common variant that omits matches unless they save space. A match of length 2 may cost more than two literal bytes after flags and distance fields. Real encoders spend CPU time finding good matches inside a bounded window, commonly 32 KiB for DEFLATE.
When Lossy Beats Lossless
Lossless compression preserves every bit, so it cannot discard measurement noise, perceptually irrelevant detail, or numerical precision that a downstream task does not need. Lossy coding changes the object being represented.
JPEG converts image blocks into frequency coefficients and quantizes them, often removing high-frequency detail that viewers tolerate. MP3 models auditory masking and allocates fewer bits to components that are hard to hear. Neural-network weight quantization stores weights in 8-bit integers, 4-bit integers, or mixed formats instead of 32-bit floats. That is not lossless, but inference accuracy can remain acceptable if the quantization error is controlled.
A lossless compressor applied after lossy coding is still useful. JPEG and MP3 both include entropy coding stages after quantization. For ML models, packed low-bit weights still need byte layouts, alignment, and sometimes entropy coding if distribution skew is high.
Key Result
Source Coding Limit for Lossless Compression
Statement
For a discrete memoryless source with entropy , no lossless code can have expected length below bits per symbol in the limit without losing injectivity. Prefix codes such as Huffman satisfy for single-symbol coding. Block and arithmetic codes can make the overhead per symbol smaller than any chosen for long enough blocks.
Intuition
A source with entropy has about typical length- messages. Distinguishing that many likely messages needs about bits. Assigning shorter names to common messages works only because uncommon messages receive longer names.
Proof Sketch
The lower bound follows from Kraft's inequality for prefix codes and Gibbs' inequality. If codeword lengths are , Kraft's inequality requires . Minimizing expected length under this constraint gives ideal lengths near , whose expectation is . Huffman's merge rule constructs the optimal integer-length prefix tree for known single-symbol frequencies. Arithmetic coding avoids per-symbol integer rounding by coding long sequences as intervals.
Why It Matters
The theorem separates modeling from coding. Better compression usually means a better probability model or a better dictionary parse, not a magic backend that defeats entropy.
Failure Mode
The theorem is distributional. It does not say every file shrinks. The pigeonhole count proves the opposite for fixed-length input sets.
Common Confusions
Compression ratio is not a property of the algorithm alone
A gzip ratio measured on English text says little about the same program on encrypted data, JPEG files, or random bytes. Already-compressed formats often have near-uniform byte distributions and few repeated substrings, so gzip has little to exploit.
Entropy of bytes is not always entropy of the data source
If UTF-8 English text is modeled as raw bytes, common ASCII letters dominate. If it is modeled as words with context, the estimated entropy changes. Entropy is computed under a model; a bad model gives a weak bound.
Huffman codes are not self-describing
The decoder must know the code tree or a canonical representation of code lengths. The 28-bit BANANA_BANDANA payload is not decodable by itself unless the codebook is known.
Exercises
Problem
Build a Huffman code for counts A: 8, B: 4, C: 2, D: 2. Compute the entropy and the average Huffman length.
Problem
Decode the LZSS-style token stream literal 0x61, literal 0x62, literal 0x63, match distance 3 length 6, literal 0x58.
Problem
For 8-bit files, prove by counting that at least one 8-bit input must fail to compress if a lossless compressor maps every input to a bit string of length at most 8.
References
Canonical:
- Charles Petzold, Code: The Hidden Language of Computer Hardware and Software, 2nd ed. (2022), ch. 7-9, binary codes, bytes, and structured representations
- Andrew S. Tanenbaum and Todd Austin, Structured Computer Organization, 6th ed. (2013), ch. 2, data representation at the machine level
- Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, 2nd ed. (2006), ch. 5, entropy and lossless source coding
- David Salomon and Giovanni Motta, Handbook of Data Compression, 5th ed. (2010), ch. 2-3, statistical coding and dictionary methods
- Khalid Sayood, Introduction to Data Compression, 5th ed. (2017), ch. 3-5, Huffman coding, arithmetic coding, and dictionary coding
- L. Peter Deutsch, DEFLATE Compressed Data Format Specification version 1.3, RFC 1951 (1996), §§3.2.2-3.2.7, LZ77 tokens and Huffman coding in DEFLATE
Accessible:
- David MacKay, Information Theory, Inference, and Learning Algorithms, ch. 4-6, open textbook treatment of entropy and coding
- Mark Nelson, The Data Compression Book, 2nd ed., practical explanations of Huffman, LZ77, and related file formats
- Matt Mahoney, Data Compression Explained, online notes covering entropy, dictionary coding, and practical compressors