LLM Construction

Iterative Magnitude Pruning and the Lottery Ticket Hypothesis

Iterative magnitude pruning repeatedly trains, prunes, rewinds, and retrains a network to search for sparse subnetworks that still learn well. The point is not cheap training; the point is understanding trainable sparsity, rewind stability, and when a sparse mask still preserves optimization geometry.

AdvancedTier 2Stable~60 min

Prerequisites

Model Compression and Pruning Feedforward Networks and Backpropagation

Prereq Map

Why This Matters

There are two very different questions in sparsity research:

Can we make an already-trained model smaller for inference?
Does a dense random initialization already contain a sparse subnetwork that could have trained well on its own?

The first question is ordinary model compression. The second is the lottery ticket question, and it is what iterative magnitude pruning (IMP) was built to study.

That distinction matters because people often hear "winning ticket" and assume this means sparse training has been solved. It has not. IMP is scientifically important because it isolates the geometry of trainable sparse masks, rewind points, and optimization stability. But it is computationally expensive: you usually have to train the dense model first and then repeat prune-and-retrain rounds.

If we want a future live artifact on pruning, lottery tickets, or sparse training, this page is the conceptual spine. It tells us what the classical result actually says, what it does not say, and why the rewind point is the interesting part.

IMP is a mask-search loop, not a free training shortcut

Iterative magnitude pruning alternates between dense training and sparse rewinding because the scientific question is stronger than compression: can a masked subnetwork still learn from an early checkpoint?

train dense model

learn useful weights and identify candidate coordinates to prune

prune small magnitudes

keep the surviving binary mask and discard the weakest coordinates

rewind kept weights

reset the survivors to an early checkpoint instead of keeping late values

retrain sparse subnetwork

ask whether the rewound ticket can still match the dense baseline

Each round only matters if the rewound sparse ticket can still stay near the dense baseline; once that breaks, the mask-search loop has gone too far.

What the rewind point means

$At each round we keep the binary mask m, but reset the surviving weights to an early checkpoint θ_{t_{0}} rather than keeping the late dense values.$

Mask quality at high sparsity

post-training pruning

good for smaller inference models

IMP winning ticket

stronger claim: sparse trainability after rewind

one-shot random mask

usually collapses long before the ticket does

The practical warning

IMP is educational and scientifically interesting, but its training cost is at least the cost of the dense run plus every rewind round. The win is understanding trainable sparsity, not free optimization.

Mental Model

Think of IMP as a mask search loop. Start with a dense network, train it until it has learned useful structure, prune the smallest-magnitude weights, then rewind the surviving weights to an earlier checkpoint and ask whether that masked subnetwork can still learn.

The key question is not "are these weights small?" The key question is:

does the mask preserve the optimization path the network needed,
does the rewind point land before training becomes too specialized,
and can the surviving coordinates still move toward a good basin?

That is why IMP feels closer to an optimization experiment than to ordinary compression.

Formal Setup

Definition

Binary pruning mask

A binary pruning mask is a vector or tensor

m \in \{0,1\}^{|\theta|}

with the same shape as the parameter collection $\theta$ . The masked network is

f(x; m \odot \theta),

where $\odot$ is elementwise multiplication. Coordinates with $m_i = 0$ are permanently removed from the model.

Definition

Iterative magnitude pruning (IMP)

IMP is the classical procedure introduced by Frankle and Carbin:

initialize a dense model at $\theta_0$ ,
train to step $T$ to obtain $\theta_T$ ,
prune the smallest-magnitude surviving weights to update the mask $m$ ,
rewind the surviving coordinates to an earlier checkpoint $\theta_{t_0}$ ,
retrain the masked model,
repeat until the desired sparsity is reached.

When $t_0 = 0$ , the rewind point is the original initialization. Later work showed that for larger networks, rewinding to a small positive step $t_0 > 0$ is often more stable.

Definition

Winning ticket

A winning ticket is an empirical object: a sparse mask $m$ together with a rewind point $\theta_{t_0}$ such that training the masked network $f(x; m \odot \theta_{t_0})$ reaches accuracy comparable to the dense baseline in a comparable number of updates.

This is not a theorem about all architectures. It is an observed regularity with scope, failure cases, and strong dependence on architecture and scale.

Main Propositions

Proposition

Masked training stays inside the sparse subspace

Statement

If the training objective is

\mathcal L(\theta) = L(m \odot \theta),

then the gradient with respect to the unconstrained parameter vector $\theta$ satisfies

\nabla_\theta \mathcal L(\theta) = m \odot \nabla L(m \odot \theta).

Hence every gradient-based update leaves the pruned coordinates unchanged. Training happens entirely inside the affine subspace selected by the mask.

Intuition

Once a coordinate has been masked out, it is gone from the computational graph. The optimizer can move only the surviving weights. So the sparse model is not a "smaller dense model" in a vague sense; it is literally a constrained optimization problem in a lower-dimensional subspace.

Proof Sketch

Apply the chain rule coordinatewise. Since $(m \odot \theta)_i = m_i \theta_i$ , the derivative with respect to $\theta_i$ is multiplied by $m_i$ . When $m_i = 0$ , the gradient is zero and that coordinate cannot move.

Why It Matters

This is the clean mathematical reason sparse retraining can fail: the mask may cut off coordinates the dense model needed to reach a good basin. IMP is searching for masks whose subspace still contains a trainable route.

Proposition

IMP is a search procedure, not a free training shortcut

Statement

The total optimization cost of IMP is at least

C_{\mathrm{IMP}} = C_{\mathrm{dense}} + \sum_{j=1}^{r} C_j.

Therefore IMP does not reduce training cost relative to a single dense run. It trades additional compute for information about which sparse masks remain trainable after rewinding.

Intuition

Every IMP round spends real optimization budget: train, prune, rewind, retrain. So the payoff is understanding and extracting trainable sparsity, not getting a cheaper first-pass training algorithm.

Why It Matters

This is where many presentations go wrong. The lottery ticket literature is not mainly a story about faster training in production. It is a story about the optimization geometry of sparse subnetworks and why some masks preserve trainability while others do not.

Failure Mode

You can still get inference wins from the final sparse model, especially if the hardware stack exploits sparsity or the mask is structured. But those are inference wins after an expensive search, not free training wins from IMP itself.

What The Classical Result Actually Says

The original lottery ticket result was narrow and important:

small vision networks,
standard magnitude pruning,
rewinding to the original initialization,
and sparse subnetworks that could match the dense model at moderate to high sparsity.

What it did not say:

that every architecture has a stable winning ticket,
that one-shot pruning is enough,
that sparse training is solved,
or that production systems should find tickets with IMP.

Later work sharpened the picture. Frankle, Dziugaite, Roy, and Carbin showed that larger-scale settings often need late rewinding rather than strict rewinding to initialization. Their linear-mode-connectivity analysis also suggested why some masks work: trainable tickets tend to remain in a geometry that is still connected to the dense solution, while bad masks break that connection.

Why Rewinding Matters

Early optimization often does disproportionately important work:

symmetries break,
feature directions start to align,
unstable coordinates settle down,
and the network leaves the most fragile part of training.

If you rewind too early, the sparse network may never recover. If you rewind a little later, the same mask can become trainable. That is why the rewind point is not a bookkeeping detail; it is part of the scientific claim.

Practical Lessons

The pruning story splits into three layers:

Compression for deployment. Magnitude pruning, structured pruning, and quantization are practical inference tools.
Scientific sparsity. IMP studies which masks preserve trainability.
Sparse training systems. These ask a harder engineering question: can we exploit sparsity during training without paying dense-run search cost?

TheoremPath should keep these layers separate. Otherwise, we end up claiming a research phenomenon solved a systems problem that it did not solve.

Common Confusions

Watch Out

A prunable network is not automatically a sparse-trainable network

Many trained networks can be pruned aggressively after training. That does not mean the same sparse mask could have been trained from scratch with equal success. IMP is studying trainability, not just compressibility.

Watch Out

Winning tickets are empirical findings, not universal theorems

The lottery ticket hypothesis is supported in many settings, but its scope depends on scale, architecture, optimizer, and rewind point. Present it as a measured phenomenon, not a law of nature.

Watch Out

Sparsity does not guarantee wall-clock speedup

Unstructured sparsity can reduce parameter count without helping hardware much. Real speedups depend on kernels, memory layout, and whether the deployment stack actually exploits the mask pattern.

Exercises

ExerciseCore

Problem

Why does a fixed pruning mask turn training into a constrained optimization problem rather than merely a smaller dense problem?

ExerciseAdvanced

Problem

Why can two masks with the same final sparsity behave very differently under rewinding?

ExerciseResearch

Problem

Suppose a future browser lab wants to show iterative magnitude pruning live. Why would a scientifically honest version need both dense-baseline tracking and a random-mask baseline?

References

Jonathan Frankle and Michael Carbin, The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, arXiv 2018 / ICLR 2019. Original empirical statement of the lottery ticket hypothesis.
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin, Stabilizing the Lottery Ticket Hypothesis, arXiv 2019 / ICML 2020. The key rewinding paper for larger-scale settings.
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin, Linear Mode Connectivity and the Lottery Ticket Hypothesis, ICML 2020. Best source for the optimization-geometry interpretation of IMP.
Trevor Gale, Erich Elsen, and Sara Hooker, The State of Sparsity in Deep Neural Networks, arXiv 2019. Practical hardware and systems reality check for sparse models.
Harshit Gupta, Ankit Singh Rawat, and Sashank Reddi, The Power of Momentum for Magnitude Pruning, arXiv 2020. Useful follow-up on why optimizer dynamics matter during sparse retraining.

Next Topics

If this page asks whether sparse subnetworks can still learn, the natural follow-ups are:

Optimal Brain Surgery and Pruning Theory for second-order pruning criteria,
Quantization Theory for a different compression axis,
and Knowledge Distillation for compression that transfers function rather than masks.

Last reviewed: April 25, 2026

Prerequisites

Foundations this topic depends on.

Model Compression and PruningLayer 3
Feedforward Networks and BackpropagationLayer 2
Differentiation in RnLayer 0A
Sets, Functions, and RelationsLayer 0A
Basic Logic and Proof TechniquesLayer 0A
Vectors, Matrices, and Linear MapsLayer 0A
Continuity in RⁿLayer 0A
Metric Spaces, Convergence, and CompletenessLayer 0A
Matrix CalculusLayer 1
The Jacobian MatrixLayer 0A
The Hessian MatrixLayer 0A
Matrix Operations and PropertiesLayer 0A
Eigenvalues and EigenvectorsLayer 0A
Activation FunctionsLayer 1
Convex Optimization BasicsLayer 1

Next Topics

Optimal Brain Surgery and Pruning TheoryContinue →Quantization TheoryContinue →Knowledge DistillationContinue →