Model Compression and Pruning

Sneiderman, Robby

LLM Construction

Model Compression and Pruning

Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.

CoreTier 2CurrentSupporting~50 min

Prerequisites

Feedforward Networks and Backpropagation

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 3 | tier 2. This page has 1 direct prerequisite and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Inference Systems Overview

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Large neural networks are expensive to store, transfer, and run inference on. A 70B parameter model in float16 requires 140 GB of memory. Compression techniques reduce this cost. Pruning removes weights. Distillation transfers knowledge to a smaller model. These are complementary to quantization (reducing precision per weight). Understanding compression theory clarifies which weights matter, why overparameterization helps training but not inference, and what information a network actually needs.

Mental Model

A trained neural network has many redundant parameters. Most weights are small and contribute little to the output. Pruning removes them. The surprising finding (lottery ticket hypothesis) is that sparse subnetworks exist at initialization that can match the full network's accuracy when trained in isolation. The network is overparameterized for learning but underparameterized for inference.

Pruning: Unstructured vs Structured

Definition

Unstructured pruning

Remove individual weights by setting them to zero. The weight tensor becomes sparse. A mask $m \in \{0, 1\}^{|\theta|}$ selects which weights survive: $\theta_{\text{pruned}} = m \odot \theta$ . Unstructured pruning achieves high sparsity (90%+ zeros) but the irregular sparsity pattern is hard to accelerate on standard hardware (GPUs prefer dense operations).

Definition

Structured pruning

Remove entire structural units: channels in CNNs, attention heads in transformers, or rows/columns of weight matrices. The result is a smaller dense model that runs efficiently on standard hardware without sparse computation support. Structured pruning typically achieves lower sparsity than unstructured pruning for the same accuracy.

Magnitude Pruning

The simplest pruning criterion: remove weights with the smallest absolute values. Given a target sparsity $s$ (fraction of weights to remove), sort all weights by $|w_i|$ and zero out the smallest $s$ fraction.

Magnitude pruning is typically done iteratively:

Train to convergence
Prune the smallest $p\%$ of remaining weights
Retrain the surviving weights (fine-tune)
Repeat until target sparsity is reached

This iterative approach (gradual magnitude pruning) significantly outperforms one-shot pruning at high sparsity levels.

The Lottery Ticket Hypothesis

Definition

Lottery Ticket Hypothesis (Frankle & Carbin 2019, empirical)

The lottery ticket hypothesis is an empirical claim, not a theorem. It states: a randomly initialized dense network contains a sparse subnetwork (the "winning ticket") that, when trained in isolation from the same initialization, reaches test accuracy comparable to the full network in a comparable number of training iterations. The subnetwork is identified by iterative magnitude pruning with weight rewinding: train, prune the smallest-magnitude weights, rewind the surviving weights to their initial values (or to an early-training checkpoint), and repeat.

Example

Empirical evidence and scope

Frankle and Carbin (2019) demonstrated the effect on small vision networks (MNIST, CIFAR-10) at sparsity levels up to roughly 90%. Follow-up work documented limits and extensions:

For larger networks (ResNet-50 on ImageNet), rewinding to initialization fails. Frankle et al. (2020) showed that rewinding to an early training iterate (e.g., a few thousand steps in) recovers the effect.
At extreme sparsity (above roughly 99%), winning tickets become hard to find reliably and iterative pruning degrades accuracy.
Transferability across tasks and architectures is partial and not guaranteed.
Finding a winning ticket requires training the full network first, so the procedure does not save training compute. Its value is in understanding why pruned networks can match dense ones at inference time.

Treat the hypothesis as an empirical regularity with documented scope, not as a mathematical theorem. The practical takeaway for compression is narrower: trained networks are routinely prunable to 50-90% sparsity with small accuracy loss, and the sparse subnetwork captures most of the dense network's function.

Entropy Coding for Weight Compression

After pruning and quantization, weight values cluster at specific discrete values. This clustering makes the weight distribution highly non-uniform, which is exactly what entropy coding exploits.

If quantized weights take values in $\{v_1, \ldots, v_K\}$ with probabilities $p_1, \ldots, p_K$ , the entropy is:

$H = -\sum_{k=1}^{K} p_k \log_2 p_k$

Huffman or arithmetic coding achieves close to $H$ bits per weight. For a network with 4-bit quantization (16 values) but where 80% of weights are zero, the entropy is much less than 4 bits. Typical compression: 4-bit quantized weights compress to 1-2 bits effective using entropy coding.

Knowledge Distillation

Definition

Knowledge distillation

Train a small student network $f_S$ to mimic the outputs of a large teacher network $f_T$ . The distillation loss is:

$\mathcal{L}_{\text{distill}} = \text{KL}(p_T(y|x; \tau) \| p_S(y|x; \tau))$

where $p_T$ and $p_S$ are softmax outputs at temperature $\tau > 1$ . High temperature smooths the output distribution, exposing the teacher's "dark knowledge" (relative probabilities of incorrect classes).

Proposition

Distillation Generalization Bound (Heuristic)

Statement

Suppose the student achieves $\widehat{\text{KL}}(p_T \| p_S) \leq \epsilon$ on the training sample, the distillation loss class admits a uniform convergence rate $\mathcal{R}_n$ over the student class (with confidence $1 - \delta$ ), and the task loss $\ell$ is bounded and $L$ -Lipschitz with respect to total variation distance between predictive distributions. Then the student's population task risk satisfies, with probability at least $1 - \delta$ :

$R_S \leq R_T + L \cdot \sqrt{2\epsilon} + O\!\left(\mathcal{R}_n + \sqrt{\log(1/\delta)/n}\right)$

Intuition

The student's error is bounded by the teacher's error plus a distributional-closeness term plus the student's generalization gap. Good distillation requires: (1) a good teacher; (2) a student class rich enough to approximate the teacher; (3) a task loss that is well-controlled by distributional closeness (Lipschitz in TV); and (4) enough data, plus uniform convergence for the distillation loss class, so empirical KL generalizes to population KL.

Proof Sketch

This is a heuristic / informal argument rather than a sharp theorem. Two separate steps are needed and each requires its own assumptions: (i) lift empirical $\widehat{\text{KL}}(p_T \| p_S) \leq \epsilon$ to a population KL bound via uniform convergence over the distillation loss class, giving $\text{KL}(p_T \| p_S) \leq \epsilon + O(\mathcal{R}_n + \sqrt{\log(1/\delta)/n})$ ; (ii) translate population KL closeness into task-risk closeness via Pinsker's inequality $\text{TV}(p_T, p_S) \leq \sqrt{\text{KL}/2}$ combined with the Lipschitz-in-TV assumption on $\ell$ , giving $|R_S - R_T| \leq L \sqrt{2\,\text{KL}}$ . Note that for unbounded losses or losses that are not Lipschitz in TV, neither Pinsker nor a linear-in- $\sqrt{\epsilon}$ bound is appropriate without further structure.

Why It Matters

Distillation is how you deploy large models in resource-constrained settings. DistilBERT (6 layers, 66M parameters) retains 97% of BERT-base accuracy (12 layers, 110M parameters). The theoretical bound explains why: the teacher's soft labels provide a richer training signal than hard labels.

Failure Mode

If the student class is too small to approximate the teacher, the $\sqrt{2\epsilon}$ term dominates and distillation fails regardless of data. Also, if the teacher is overfit, the student inherits the teacher's errors. Distillation from an ensemble of teachers is more robust.

report a correction →

Complementary Techniques

Pruning, quantization, and distillation are complementary:

Pruning reduces the number of nonzero weights
Quantization reduces the bits per weight (see quantization topic)
Distillation reduces the number of parameters entirely

They compose: prune a model, quantize the surviving weights, and entropy code the result. Or distill to a smaller architecture, then quantize. Modern deployment pipelines use multiple techniques simultaneously.

Common Confusions

Watch Out

Pruning does not always speed up inference

Unstructured pruning produces sparse tensors. Most GPU kernels are optimized for dense operations, so 90% sparsity does not yield 10x speedup. You need structured sparsity (2:4 patterns on NVIDIA Ampere, or full channel pruning) for hardware acceleration. Unstructured pruning mainly saves memory and communication, not compute.

Watch Out

The lottery ticket is found after training, not before

You cannot identify the winning ticket before training the full network. The lottery ticket hypothesis says winning tickets exist at initialization, not that you can find them cheaply. Finding them requires training and pruning the full network, which is as expensive as standard training.

Watch Out

Distillation does not just copy the teacher

The student learns the teacher's output distribution, not its weights. A small student trained on soft labels can outperform the same architecture trained on hard labels because soft labels carry more information per example (which classes the teacher considers similar).

Summary

Unstructured pruning achieves high sparsity but needs sparse hardware support; structured pruning produces smaller dense models
Magnitude pruning with iterative prune-retrain is the simplest effective method
Lottery ticket hypothesis: overparameterization helps search, not representation
Entropy coding exploits the non-uniform weight distribution after pruning/quantization
Knowledge distillation transfers teacher knowledge through soft labels at high temperature
Pruning, quantization, and distillation compose for maximum compression

Exercises

ExerciseCore

Problem

A model has 100M parameters in float32 (400 MB). You prune 90% of weights (unstructured), quantize surviving weights to 4 bits, then apply entropy coding that achieves 1.5 bits per nonzero weight. What is the final model size, ignoring overhead for storing the sparsity pattern?

ExerciseAdvanced

Problem

Why does distillation at high temperature $\tau$ provide more information per training example than hard labels? Express the answer in terms of the entropy of the teacher's output distribution.

Related Comparisons

LoRA vs. Full Fine-Tune vs. QLoRA

References

Canonical:

Frankle & Carbin, "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks," ICLR 2019
Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network," NeurIPS Workshop 2015

Current:

Sanh et al., "DistilBERT," NeurIPS EMC2 Workshop 2019
Frantar & Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot," ICML 2023

Next Topics

From model compression, the natural continuations are:

Inference systems overview: how compressed models are served at scale
Mixture of experts: conditional computation as an alternative to compression

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

3

Iterative Magnitude Pruning and the Lottery Ticket Hypothesislayer 4 · tier 2
Mixture of Expertslayer 4 · tier 2
Inference Systems Overviewlayer 5 · tier 2

Graph-backed continuations

Inference Systems Overview Mixture of Experts Iterative Magnitude Pruning and the Lottery Ticket Hypothesis