ML Methods
Optimal Brain Surgery and Pruning Theory
Principled weight pruning via second-order information: Optimal Brain Damage uses the Hessian diagonal, Optimal Brain Surgeon uses the full inverse Hessian, and both reveal why magnitude pruning is a crude but popular approximation.
Why This Matters
Neural networks trained via backpropagation are overparameterized. A 100M-parameter model often contains millions of weights that contribute almost nothing to the output. Removing these weights (pruning) reduces memory, speeds up inference, and can even improve generalization.
The question is: which weights should you remove? Magnitude pruning (remove the smallest weights) is the most common heuristic. But it has no theoretical justification beyond a vague appeal to "small weights don't matter." Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS) answer this question properly using second-order information from the loss surface.
Formal Setup
Consider a trained network with weight vector at a local minimum of the loss . We want to set one weight to zero and find the resulting increase in loss.
Saliency
The saliency of weight is the increase in loss caused by removing it:
where is the perturbation that sets and possibly adjusts other weights to compensate.
A second-order Taylor expansion around the trained weights, using the Hessian matrix, gives:
At a local minimum, , so:
where is the Hessian of the loss with respect to the weights.
Optimal Brain Damage (LeCun et al. 1989)
OBD Diagonal Approximation
OBD assumes the Hessian is diagonal: . Under this approximation, the saliency of deleting weight (setting and for ) is:
Prune the weight with the smallest .
The diagonal Hessian entries can be computed efficiently during backpropagation. The cost is roughly the same as computing gradients.
OBD is not magnitude pruning
Magnitude pruning removes the weight with the smallest . OBD removes the weight with the smallest . A weight can be large but sit in a flat region of the loss (small ), making it cheap to remove. A weight can be small but sit on a steep ridge (large ), making it expensive to remove. OBD accounts for the local curvature; magnitude pruning does not.
Optimal Brain Surgeon (Hassibi and Stork, 1993)
OBS removes the diagonal approximation and uses the full inverse Hessian. When deleting weight , the remaining weights are adjusted to compensate.
OBS Optimal Weight Update
Statement
The optimal perturbation for deleting weight is:
and the resulting saliency is:
where is the -th standard basis vector and is the -th diagonal element of the inverse Hessian.
Intuition
OBS asks: if I must set , what is the best way to adjust all other weights to minimize the damage? The answer involves the inverse Hessian because it captures how weights interact through the loss surface. The correction shifts other weights along the direction that most efficiently compensates for the deleted weight.
Proof Sketch
Minimize subject to the constraint , i.e., . This is a quadratic program with a linear equality constraint. Using a Lagrange multiplier : the KKT conditions give , so . Substituting into the constraint yields .
Why It Matters
OBS gives strictly better pruning decisions than OBD because it accounts for weight correlations. A weight might look important in isolation but be redundant given other weights. The inverse Hessian captures this redundancy.
Failure Mode
Computing and storing the full Hessian (or its inverse) is infeasible for large networks. For a model with parameters, the Hessian has entries. OBS is therefore limited to small networks or requires approximations (block-diagonal, low-rank, Kronecker-factored).
Comparing Pruning Criteria
Hierarchy of Saliency Approximations
Statement
The three pruning criteria form a hierarchy of approximations. Magnitude pruning assumes (identity). OBD assumes is diagonal. OBS uses the full Hessian. Formally:
- Magnitude:
- OBD:
- OBS:
Each successive criterion uses more curvature information.
Intuition
Magnitude pruning treats all directions in weight space as equally important. OBD adds per-weight curvature. OBS adds cross-weight interactions. The more structure you use, the better your pruning decisions, but the higher the computational cost.
Proof Sketch
Setting in the OBD formula gives . Setting to be diagonal in the OBS formula gives , so , recovering OBD.
Why It Matters
This hierarchy explains why magnitude pruning works at all: it is the zero-th order approximation to the correct saliency criterion. It also explains when magnitude pruning fails: when the Hessian is far from a scaled identity.
Failure Mode
All three criteria assume the network is at a local minimum. If training has not converged (common with early stopping), the gradient term is nonzero and the Taylor approximation is less accurate.
Connection to the Lottery Ticket Hypothesis
Frankle and Carlin (2019) observed that within a randomly initialized network, there exist sparse subnetworks (lottery tickets) that can be trained from scratch to full accuracy. This raises a question that OBD/OBS does not answer: pruning at initialization vs. pruning after training.
OBD and OBS operate on trained networks. The lottery ticket hypothesis suggests that the structure of the initialization matters. Modern pruning research bridges both views: methods like SNIP (Lee et al. 2019) use gradient information at initialization to approximate saliency before training.
Canonical Examples
Pruning a 2-layer network
Consider a network with 4 weights and diagonal Hessian entries .
Magnitude pruning removes .
OBD saliency: .
OBD removes (saliency 0.045), not (saliency 0.10). Weight has larger magnitude than but sits in a much flatter region of the loss, so removing it causes less damage.
Common Confusions
Pruning is not the same as regularization
L1 regularization encourages sparsity during training. Pruning removes weights after training (or during training with iterative pruning). They are complementary: L1 can be seen as soft pruning, while OBD/OBS perform hard pruning with explicit saliency analysis.
The Hessian at convergence is not always positive definite
OBS requires a positive definite Hessian. In practice, trained neural networks often have near-zero or negative Hessian eigenvalues. Adding a damping term is standard to ensure invertibility, but the choice of affects pruning decisions.
Exercises
Problem
A network has two weights: with , and with . Which weight does magnitude pruning remove? Which does OBD remove? Compute both saliencies.
Problem
Derive the OBS optimal weight update using Lagrange multipliers. Start from the objective subject to .
References
Canonical:
- LeCun, Denker, Solla, "Optimal Brain Damage", NeurIPS 1989
- Hassibi and Stork, "Second Order Derivatives for Network Pruning: Optimal Brain Surgeon", NeurIPS 1993
Current:
- Frankle and Carlin, "The Lottery Ticket Hypothesis", ICLR 2019
- Blalock et al., "What is the State of Neural Network Pruning?", MLSys 2020
- Lee et al., "SNIP: Single-shot Network Pruning based on Connection Sensitivity", ICLR 2019
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- The Hessian MatrixLayer 0A
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A