Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Model Compression and Pruning

Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.

CoreTier 2Current~50 min

Why This Matters

Large neural networks are expensive to store, transfer, and run inference on. A 70B parameter model in float16 requires 140 GB of memory. Compression techniques reduce this cost. Pruning removes weights. Distillation transfers knowledge to a smaller model. These are complementary to quantization (reducing precision per weight). Understanding compression theory clarifies which weights matter, why overparameterization helps training but not inference, and what information a network actually needs.

Mental Model

A trained neural network has many redundant parameters. Most weights are small and contribute little to the output. Pruning removes them. The surprising finding (lottery ticket hypothesis) is that sparse subnetworks exist at initialization that can match the full network's accuracy when trained in isolation. The network is overparameterized for learning but underparameterized for inference.

Pruning: Unstructured vs Structured

Definition

Unstructured pruning

Remove individual weights by setting them to zero. The weight tensor becomes sparse. A mask m{0,1}θm \in \{0, 1\}^{|\theta|} selects which weights survive: θpruned=mθ\theta_{\text{pruned}} = m \odot \theta. Unstructured pruning achieves high sparsity (90%+ zeros) but the irregular sparsity pattern is hard to accelerate on standard hardware (GPUs prefer dense operations).

Definition

Structured pruning

Remove entire structural units: channels in CNNs, attention heads in transformers, or rows/columns of weight matrices. The result is a smaller dense model that runs efficiently on standard hardware without sparse computation support. Structured pruning typically achieves lower sparsity than unstructured pruning for the same accuracy.

Magnitude Pruning

The simplest pruning criterion: remove weights with the smallest absolute values. Given a target sparsity ss (fraction of weights to remove), sort all weights by wi|w_i| and zero out the smallest ss fraction.

Magnitude pruning is typically done iteratively:

  1. Train to convergence
  2. Prune the smallest p%p\% of remaining weights
  3. Retrain the surviving weights (fine-tune)
  4. Repeat until target sparsity is reached

This iterative approach (gradual magnitude pruning) significantly outperforms one-shot pruning at high sparsity levels.

The Lottery Ticket Hypothesis

Definition

Lottery Ticket Hypothesis (Frankle & Carbin 2019, empirical)

The lottery ticket hypothesis is an empirical claim, not a theorem. It states: a randomly initialized dense network contains a sparse subnetwork (the "winning ticket") that, when trained in isolation from the same initialization, reaches test accuracy comparable to the full network in a comparable number of training iterations. The subnetwork is identified by iterative magnitude pruning with weight rewinding: train, prune the smallest-magnitude weights, rewind the surviving weights to their initial values (or to an early-training checkpoint), and repeat.

Example

Empirical evidence and scope

Frankle and Carbin (2019) demonstrated the effect on small vision networks (MNIST, CIFAR-10) at sparsity levels up to roughly 90%. Follow-up work documented limits and extensions:

  • For larger networks (ResNet-50 on ImageNet), rewinding to initialization fails. Frankle et al. (2020) showed that rewinding to an early training iterate (e.g., a few thousand steps in) recovers the effect.
  • At extreme sparsity (above roughly 99%), winning tickets become hard to find reliably and iterative pruning degrades accuracy.
  • Transferability across tasks and architectures is partial and not guaranteed.
  • Finding a winning ticket requires training the full network first, so the procedure does not save training compute. Its value is in understanding why pruned networks can match dense ones at inference time.

Treat the hypothesis as an empirical regularity with documented scope, not as a mathematical theorem. The practical takeaway for compression is narrower: trained networks are routinely prunable to 50-90% sparsity with small accuracy loss, and the sparse subnetwork captures most of the dense network's function.

Entropy Coding for Weight Compression

After pruning and quantization, weight values cluster at specific discrete values. This clustering makes the weight distribution highly non-uniform, which is exactly what entropy coding exploits.

If quantized weights take values in {v1,,vK}\{v_1, \ldots, v_K\} with probabilities p1,,pKp_1, \ldots, p_K, the entropy is:

H=k=1Kpklog2pkH = -\sum_{k=1}^{K} p_k \log_2 p_k

Huffman or arithmetic coding achieves close to HH bits per weight. For a network with 4-bit quantization (16 values) but where 80% of weights are zero, the entropy is much less than 4 bits. Typical compression: 4-bit quantized weights compress to 1-2 bits effective using entropy coding.

Knowledge Distillation

Definition

Knowledge distillation

Train a small student network fSf_S to mimic the outputs of a large teacher network fTf_T. The distillation loss is:

Ldistill=KL(pT(yx;τ)pS(yx;τ))\mathcal{L}_{\text{distill}} = \text{KL}(p_T(y|x; \tau) \| p_S(y|x; \tau))

where pTp_T and pSp_S are softmax outputs at temperature τ>1\tau > 1. High temperature smooths the output distribution, exposing the teacher's "dark knowledge" (relative probabilities of incorrect classes).

Proposition

Distillation Generalization Bound

Statement

Let RTR_T be the teacher's population risk. If the student achieves KL(pTpS)ϵ\text{KL}(p_T \| p_S) \leq \epsilon on the training distribution and the student class has Rademacher complexity Rn\mathcal{R}_n, then the student's population risk satisfies:

RSRT+2ϵ+O(Rn+log(1/δ)/n)R_S \leq R_T + \sqrt{2\epsilon} + O(\mathcal{R}_n + \sqrt{\log(1/\delta)/n})

Intuition

The student's error is bounded by the teacher's error plus the approximation gap plus the student's generalization gap. Good distillation requires: (1) a good teacher, (2) the student class must be rich enough to approximate the teacher, and (3) enough data for the student to generalize.

Proof Sketch

Decompose the student's risk as: RSRT+RSRTR_S \leq R_T + |R_S - R_T|. Bound RSRT|R_S - R_T| using Pinsker's inequality (TVKL/2\text{TV} \leq \sqrt{\text{KL}/2}) and standard Rademacher complexity arguments for the student's generalization gap.

Why It Matters

Distillation is how you deploy large models in resource-constrained settings. DistilBERT (6 layers, 66M parameters) retains 97% of BERT-base accuracy (12 layers, 110M parameters). The theoretical bound explains why: the teacher's soft labels provide a richer training signal than hard labels.

Failure Mode

If the student class is too small to approximate the teacher, the 2ϵ\sqrt{2\epsilon} term dominates and distillation fails regardless of data. Also, if the teacher is overfit, the student inherits the teacher's errors. Distillation from an ensemble of teachers is more robust.

Complementary Techniques

Pruning, quantization, and distillation are complementary:

  • Pruning reduces the number of nonzero weights
  • Quantization reduces the bits per weight (see quantization topic)
  • Distillation reduces the number of parameters entirely

They compose: prune a model, quantize the surviving weights, and entropy code the result. Or distill to a smaller architecture, then quantize. Modern deployment pipelines use multiple techniques simultaneously.

Common Confusions

Watch Out

Pruning does not always speed up inference

Unstructured pruning produces sparse tensors. Most GPU kernels are optimized for dense operations, so 90% sparsity does not yield 10x speedup. You need structured sparsity (2:4 patterns on NVIDIA Ampere, or full channel pruning) for hardware acceleration. Unstructured pruning mainly saves memory and communication, not compute.

Watch Out

The lottery ticket is found after training, not before

You cannot identify the winning ticket before training the full network. The lottery ticket hypothesis says winning tickets exist at initialization, not that you can find them cheaply. Finding them requires training and pruning the full network, which is as expensive as standard training.

Watch Out

Distillation does not just copy the teacher

The student learns the teacher's output distribution, not its weights. A small student trained on soft labels can outperform the same architecture trained on hard labels because soft labels carry more information per example (which classes the teacher considers similar).

Summary

  • Unstructured pruning achieves high sparsity but needs sparse hardware support; structured pruning produces smaller dense models
  • Magnitude pruning with iterative prune-retrain is the simplest effective method
  • Lottery ticket hypothesis: overparameterization helps search, not representation
  • Entropy coding exploits the non-uniform weight distribution after pruning/quantization
  • Knowledge distillation transfers teacher knowledge through soft labels at high temperature
  • Pruning, quantization, and distillation compose for maximum compression

Exercises

ExerciseCore

Problem

A model has 100M parameters in float32 (400 MB). You prune 90% of weights (unstructured), quantize surviving weights to 4 bits, then apply entropy coding that achieves 1.5 bits per nonzero weight. What is the final model size, ignoring overhead for storing the sparsity pattern?

ExerciseAdvanced

Problem

Why does distillation at high temperature τ\tau provide more information per training example than hard labels? Express the answer in terms of the entropy of the teacher's output distribution.

Related Comparisons

References

Canonical:

  • Frankle & Carlin, "The Lottery Ticket Hypothesis," ICLR 2019
  • Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network," NeurIPS Workshop 2015

Current:

  • Sanh et al., "DistilBERT," NeurIPS EMC2 Workshop 2019
  • Frantar & Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot," ICML 2023

Next Topics

From model compression, the natural continuations are:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics