LLM Construction
Model Compression and Pruning
Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.
Prerequisites
Why This Matters
Large neural networks are expensive to store, transfer, and run inference on. A 70B parameter model in float16 requires 140 GB of memory. Compression techniques reduce this cost. Pruning removes weights. Distillation transfers knowledge to a smaller model. These are complementary to quantization (reducing precision per weight). Understanding compression theory clarifies which weights matter, why overparameterization helps training but not inference, and what information a network actually needs.
Mental Model
A trained neural network has many redundant parameters. Most weights are small and contribute little to the output. Pruning removes them. The surprising finding (lottery ticket hypothesis) is that sparse subnetworks exist at initialization that can match the full network's accuracy when trained in isolation. The network is overparameterized for learning but underparameterized for inference.
Pruning: Unstructured vs Structured
Unstructured pruning
Remove individual weights by setting them to zero. The weight tensor becomes sparse. A mask selects which weights survive: . Unstructured pruning achieves high sparsity (90%+ zeros) but the irregular sparsity pattern is hard to accelerate on standard hardware (GPUs prefer dense operations).
Structured pruning
Remove entire structural units: channels in CNNs, attention heads in transformers, or rows/columns of weight matrices. The result is a smaller dense model that runs efficiently on standard hardware without sparse computation support. Structured pruning typically achieves lower sparsity than unstructured pruning for the same accuracy.
Magnitude Pruning
The simplest pruning criterion: remove weights with the smallest absolute values. Given a target sparsity (fraction of weights to remove), sort all weights by and zero out the smallest fraction.
Magnitude pruning is typically done iteratively:
- Train to convergence
- Prune the smallest of remaining weights
- Retrain the surviving weights (fine-tune)
- Repeat until target sparsity is reached
This iterative approach (gradual magnitude pruning) significantly outperforms one-shot pruning at high sparsity levels.
The Lottery Ticket Hypothesis
Lottery Ticket Hypothesis (Frankle & Carbin 2019, empirical)
The lottery ticket hypothesis is an empirical claim, not a theorem. It states: a randomly initialized dense network contains a sparse subnetwork (the "winning ticket") that, when trained in isolation from the same initialization, reaches test accuracy comparable to the full network in a comparable number of training iterations. The subnetwork is identified by iterative magnitude pruning with weight rewinding: train, prune the smallest-magnitude weights, rewind the surviving weights to their initial values (or to an early-training checkpoint), and repeat.
Empirical evidence and scope
Frankle and Carbin (2019) demonstrated the effect on small vision networks (MNIST, CIFAR-10) at sparsity levels up to roughly 90%. Follow-up work documented limits and extensions:
- For larger networks (ResNet-50 on ImageNet), rewinding to initialization fails. Frankle et al. (2020) showed that rewinding to an early training iterate (e.g., a few thousand steps in) recovers the effect.
- At extreme sparsity (above roughly 99%), winning tickets become hard to find reliably and iterative pruning degrades accuracy.
- Transferability across tasks and architectures is partial and not guaranteed.
- Finding a winning ticket requires training the full network first, so the procedure does not save training compute. Its value is in understanding why pruned networks can match dense ones at inference time.
Treat the hypothesis as an empirical regularity with documented scope, not as a mathematical theorem. The practical takeaway for compression is narrower: trained networks are routinely prunable to 50-90% sparsity with small accuracy loss, and the sparse subnetwork captures most of the dense network's function.
Entropy Coding for Weight Compression
After pruning and quantization, weight values cluster at specific discrete values. This clustering makes the weight distribution highly non-uniform, which is exactly what entropy coding exploits.
If quantized weights take values in with probabilities , the entropy is:
Huffman or arithmetic coding achieves close to bits per weight. For a network with 4-bit quantization (16 values) but where 80% of weights are zero, the entropy is much less than 4 bits. Typical compression: 4-bit quantized weights compress to 1-2 bits effective using entropy coding.
Knowledge Distillation
Knowledge distillation
Train a small student network to mimic the outputs of a large teacher network . The distillation loss is:
where and are softmax outputs at temperature . High temperature smooths the output distribution, exposing the teacher's "dark knowledge" (relative probabilities of incorrect classes).
Distillation Generalization Bound
Statement
Let be the teacher's population risk. If the student achieves on the training distribution and the student class has Rademacher complexity , then the student's population risk satisfies:
Intuition
The student's error is bounded by the teacher's error plus the approximation gap plus the student's generalization gap. Good distillation requires: (1) a good teacher, (2) the student class must be rich enough to approximate the teacher, and (3) enough data for the student to generalize.
Proof Sketch
Decompose the student's risk as: . Bound using Pinsker's inequality () and standard Rademacher complexity arguments for the student's generalization gap.
Why It Matters
Distillation is how you deploy large models in resource-constrained settings. DistilBERT (6 layers, 66M parameters) retains 97% of BERT-base accuracy (12 layers, 110M parameters). The theoretical bound explains why: the teacher's soft labels provide a richer training signal than hard labels.
Failure Mode
If the student class is too small to approximate the teacher, the term dominates and distillation fails regardless of data. Also, if the teacher is overfit, the student inherits the teacher's errors. Distillation from an ensemble of teachers is more robust.
Complementary Techniques
Pruning, quantization, and distillation are complementary:
- Pruning reduces the number of nonzero weights
- Quantization reduces the bits per weight (see quantization topic)
- Distillation reduces the number of parameters entirely
They compose: prune a model, quantize the surviving weights, and entropy code the result. Or distill to a smaller architecture, then quantize. Modern deployment pipelines use multiple techniques simultaneously.
Common Confusions
Pruning does not always speed up inference
Unstructured pruning produces sparse tensors. Most GPU kernels are optimized for dense operations, so 90% sparsity does not yield 10x speedup. You need structured sparsity (2:4 patterns on NVIDIA Ampere, or full channel pruning) for hardware acceleration. Unstructured pruning mainly saves memory and communication, not compute.
The lottery ticket is found after training, not before
You cannot identify the winning ticket before training the full network. The lottery ticket hypothesis says winning tickets exist at initialization, not that you can find them cheaply. Finding them requires training and pruning the full network, which is as expensive as standard training.
Distillation does not just copy the teacher
The student learns the teacher's output distribution, not its weights. A small student trained on soft labels can outperform the same architecture trained on hard labels because soft labels carry more information per example (which classes the teacher considers similar).
Summary
- Unstructured pruning achieves high sparsity but needs sparse hardware support; structured pruning produces smaller dense models
- Magnitude pruning with iterative prune-retrain is the simplest effective method
- Lottery ticket hypothesis: overparameterization helps search, not representation
- Entropy coding exploits the non-uniform weight distribution after pruning/quantization
- Knowledge distillation transfers teacher knowledge through soft labels at high temperature
- Pruning, quantization, and distillation compose for maximum compression
Exercises
Problem
A model has 100M parameters in float32 (400 MB). You prune 90% of weights (unstructured), quantize surviving weights to 4 bits, then apply entropy coding that achieves 1.5 bits per nonzero weight. What is the final model size, ignoring overhead for storing the sparsity pattern?
Problem
Why does distillation at high temperature provide more information per training example than hard labels? Express the answer in terms of the entropy of the teacher's output distribution.
Related Comparisons
References
Canonical:
- Frankle & Carlin, "The Lottery Ticket Hypothesis," ICLR 2019
- Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network," NeurIPS Workshop 2015
Current:
- Sanh et al., "DistilBERT," NeurIPS EMC2 Workshop 2019
- Frantar & Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot," ICML 2023
Next Topics
From model compression, the natural continuations are:
- Inference systems overview: how compressed models are served at scale
- Mixture of experts: conditional computation as an alternative to compression
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A