Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Optimal Brain Surgery and Pruning Theory

Principled weight pruning via second-order information: Optimal Brain Damage uses the Hessian diagonal, Optimal Brain Surgeon uses the full inverse Hessian, and both reveal why magnitude pruning is a crude but popular approximation.

AdvancedTier 2Stable~50 min
0

Why This Matters

Neural networks trained via backpropagation are overparameterized. A 100M-parameter model often contains millions of weights that contribute almost nothing to the output. Removing these weights (pruning) reduces memory, speeds up inference, and can even improve generalization.

The question is: which weights should you remove? Magnitude pruning (remove the smallest weights) is the most common heuristic. But it has no theoretical justification beyond a vague appeal to "small weights don't matter." Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS) answer this question properly using second-order information from the loss surface.

Formal Setup

Consider a trained network with weight vector wRnw \in \mathbb{R}^n at a local minimum of the loss L(w)L(w). We want to set one weight wqw_q to zero and find the resulting increase in loss.

Definition

Saliency

The saliency of weight wqw_q is the increase in loss caused by removing it:

sq=L(w+δw)L(w)s_q = L(w + \delta w) - L(w)

where δw\delta w is the perturbation that sets wq=0w_q = 0 and possibly adjusts other weights to compensate.

A second-order Taylor expansion around the trained weights, using the Hessian matrix, gives:

δL=LTδw+12δwTHδw+O(δw3)\delta L = \nabla L^T \delta w + \frac{1}{2} \delta w^T H \delta w + O(\|\delta w\|^3)

At a local minimum, L0\nabla L \approx 0, so:

δL12δwTHδw\delta L \approx \frac{1}{2} \delta w^T H \delta w

where HH is the Hessian of the loss with respect to the weights.

Optimal Brain Damage (LeCun et al. 1989)

Definition

OBD Diagonal Approximation

OBD assumes the Hessian is diagonal: Hdiag(h11,h22,,hnn)H \approx \text{diag}(h_{11}, h_{22}, \ldots, h_{nn}). Under this approximation, the saliency of deleting weight wqw_q (setting δwq=wq\delta w_q = -w_q and δwi=0\delta w_i = 0 for iqi \neq q) is:

sqOBD=12hqqwq2s_q^{\text{OBD}} = \frac{1}{2} h_{qq} w_q^2

Prune the weight with the smallest sqOBDs_q^{\text{OBD}}.

The diagonal Hessian entries hqqh_{qq} can be computed efficiently during backpropagation. The cost is roughly the same as computing gradients.

Watch Out

OBD is not magnitude pruning

Magnitude pruning removes the weight with the smallest wq|w_q|. OBD removes the weight with the smallest hqqwq2h_{qq} w_q^2. A weight can be large but sit in a flat region of the loss (small hqqh_{qq}), making it cheap to remove. A weight can be small but sit on a steep ridge (large hqqh_{qq}), making it expensive to remove. OBD accounts for the local curvature; magnitude pruning does not.

Optimal Brain Surgeon (Hassibi and Stork, 1993)

OBS removes the diagonal approximation and uses the full inverse Hessian. When deleting weight wqw_q, the remaining weights are adjusted to compensate.

Theorem

OBS Optimal Weight Update

Statement

The optimal perturbation for deleting weight wqw_q is:

δw=wq[H1]qqH1eq\delta w = -\frac{w_q}{[H^{-1}]_{qq}} H^{-1} e_q

and the resulting saliency is:

sqOBS=wq22[H1]qqs_q^{\text{OBS}} = \frac{w_q^2}{2 [H^{-1}]_{qq}}

where eqe_q is the qq-th standard basis vector and [H1]qq[H^{-1}]_{qq} is the qq-th diagonal element of the inverse Hessian.

Intuition

OBS asks: if I must set wq=0w_q = 0, what is the best way to adjust all other weights to minimize the damage? The answer involves the inverse Hessian because it captures how weights interact through the loss surface. The correction δw\delta w shifts other weights along the direction that most efficiently compensates for the deleted weight.

Proof Sketch

Minimize 12δwTHδw\frac{1}{2} \delta w^T H \delta w subject to the constraint eqT(w+δw)=0e_q^T (w + \delta w) = 0, i.e., eqTδw=wqe_q^T \delta w = -w_q. This is a quadratic program with a linear equality constraint. Using a Lagrange multiplier λ\lambda: the KKT conditions give Hδw+λeq=0H \delta w + \lambda e_q = 0, so δw=λH1eq\delta w = -\lambda H^{-1} e_q. Substituting into the constraint yields λ=wq/[H1]qq\lambda = w_q / [H^{-1}]_{qq}.

Why It Matters

OBS gives strictly better pruning decisions than OBD because it accounts for weight correlations. A weight might look important in isolation but be redundant given other weights. The inverse Hessian captures this redundancy.

Failure Mode

Computing and storing the full n×nn \times n Hessian (or its inverse) is infeasible for large networks. For a model with 10810^8 parameters, the Hessian has 101610^{16} entries. OBS is therefore limited to small networks or requires approximations (block-diagonal, low-rank, Kronecker-factored).

Comparing Pruning Criteria

Proposition

Hierarchy of Saliency Approximations

Statement

The three pruning criteria form a hierarchy of approximations. Magnitude pruning assumes H=IH = I (identity). OBD assumes HH is diagonal. OBS uses the full Hessian. Formally:

  • Magnitude: sq=12wq2s_q = \frac{1}{2} w_q^2
  • OBD: sq=12hqqwq2s_q = \frac{1}{2} h_{qq} w_q^2
  • OBS: sq=wq22[H1]qqs_q = \frac{w_q^2}{2 [H^{-1}]_{qq}}

Each successive criterion uses more curvature information.

Intuition

Magnitude pruning treats all directions in weight space as equally important. OBD adds per-weight curvature. OBS adds cross-weight interactions. The more structure you use, the better your pruning decisions, but the higher the computational cost.

Proof Sketch

Setting H=IH = I in the OBD formula gives sq=12wq2s_q = \frac{1}{2} w_q^2. Setting HH to be diagonal in the OBS formula gives [H1]qq=1/hqq[H^{-1}]_{qq} = 1/h_{qq}, so sq=12hqqwq2s_q = \frac{1}{2} h_{qq} w_q^2, recovering OBD.

Why It Matters

This hierarchy explains why magnitude pruning works at all: it is the zero-th order approximation to the correct saliency criterion. It also explains when magnitude pruning fails: when the Hessian is far from a scaled identity.

Failure Mode

All three criteria assume the network is at a local minimum. If training has not converged (common with early stopping), the gradient term LTδw\nabla L^T \delta w is nonzero and the Taylor approximation is less accurate.

Connection to the Lottery Ticket Hypothesis

Frankle and Carlin (2019) observed that within a randomly initialized network, there exist sparse subnetworks (lottery tickets) that can be trained from scratch to full accuracy. This raises a question that OBD/OBS does not answer: pruning at initialization vs. pruning after training.

OBD and OBS operate on trained networks. The lottery ticket hypothesis suggests that the structure of the initialization matters. Modern pruning research bridges both views: methods like SNIP (Lee et al. 2019) use gradient information at initialization to approximate saliency before training.

Canonical Examples

Example

Pruning a 2-layer network

Consider a network with 4 weights w=(0.5,0.1,0.3,0.8)w = (0.5, 0.1, 0.3, 0.8) and diagonal Hessian entries h=(2.0,20.0,1.0,0.5)h = (2.0, 20.0, 1.0, 0.5).

Magnitude pruning removes w2=0.1w_2 = 0.1.

OBD saliency: s=(0.25,0.10,0.045,0.16)s = (0.25, 0.10, 0.045, 0.16).

OBD removes w3w_3 (saliency 0.045), not w2w_2 (saliency 0.10). Weight w3w_3 has larger magnitude than w2w_2 but sits in a much flatter region of the loss, so removing it causes less damage.

Common Confusions

Watch Out

Pruning is not the same as regularization

L1 regularization encourages sparsity during training. Pruning removes weights after training (or during training with iterative pruning). They are complementary: L1 can be seen as soft pruning, while OBD/OBS perform hard pruning with explicit saliency analysis.

Watch Out

The Hessian at convergence is not always positive definite

OBS requires a positive definite Hessian. In practice, trained neural networks often have near-zero or negative Hessian eigenvalues. Adding a damping term H+λIH + \lambda I is standard to ensure invertibility, but the choice of λ\lambda affects pruning decisions.

Exercises

ExerciseCore

Problem

A network has two weights: w1=2.0w_1 = 2.0 with h11=0.1h_{11} = 0.1, and w2=0.5w_2 = 0.5 with h22=10.0h_{22} = 10.0. Which weight does magnitude pruning remove? Which does OBD remove? Compute both saliencies.

ExerciseAdvanced

Problem

Derive the OBS optimal weight update using Lagrange multipliers. Start from the objective minδw12δwTHδw\min_{\delta w} \frac{1}{2} \delta w^T H \delta w subject to eqT(w+δw)=0e_q^T(w + \delta w) = 0.

References

Canonical:

  • LeCun, Denker, Solla, "Optimal Brain Damage", NeurIPS 1989
  • Hassibi and Stork, "Second Order Derivatives for Network Pruning: Optimal Brain Surgeon", NeurIPS 1993

Current:

  • Frankle and Carlin, "The Lottery Ticket Hypothesis", ICLR 2019
  • Blalock et al., "What is the State of Neural Network Pruning?", MLSys 2020
  • Lee et al., "SNIP: Single-shot Network Pruning based on Connection Sensitivity", ICLR 2019

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.