Optimal Brain Surgeon and Pruning Theory

Sneiderman, Robby

ML Methods

Optimal Brain Surgeon and Pruning Theory

Principled weight pruning via second-order information: Optimal Brain Damage uses the Hessian diagonal, Optimal Brain Surgeon uses the full inverse Hessian, and both reveal why magnitude pruning is a crude but popular approximation.

AdvancedTier 2StableSupporting~50 min

Prerequisites

The Hessian Matrix Feedforward Networks and Backpropagation Iterative Magnitude Pruning and Lottery Ticket Hypothesis

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 3 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Neural networks trained via backpropagation are overparameterized. A 100M-parameter model often contains millions of weights that contribute almost nothing to the output. Removing these weights (pruning) reduces memory, speeds up inference, and can even improve generalization.

The question is: which weights should you remove? Magnitude pruning (remove the smallest weights) is the most common heuristic. But it has no theoretical justification beyond a vague appeal to "small weights don't matter." Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS) answer this question properly using second-order information from the loss surface.

Formal Setup

Consider a trained network with weight vector $w \in \mathbb{R}^n$ at a local minimum of the loss $L(w)$ . We want to set one weight $w_q$ to zero and find the resulting increase in loss.

Definition

Saliency $s_{q}$

The saliency of weight $w_q$ is the increase in loss caused by removing it:

$s_q = L(w + \delta w) - L(w)$

where $\delta w$ is the perturbation that sets $w_q = 0$ and possibly adjusts other weights to compensate.

A second-order Taylor expansion around the trained weights, using the Hessian matrix, gives:

$\delta L = \nabla L^T \delta w + \frac{1}{2} \delta w^T H \delta w + O(\|\delta w\|^3)$

At a local minimum, $\nabla L \approx 0$ , so:

$\delta L \approx \frac{1}{2} \delta w^T H \delta w$

where $H$ is the Hessian of the loss with respect to the weights.

Optimal Brain Damage (LeCun et al. 1989)

Definition

OBD Diagonal Approximation

OBD assumes the Hessian is diagonal: $H \approx \text{diag}(h_{11}, h_{22}, \ldots, h_{nn})$ . Under this approximation, the saliency of deleting weight $w_q$ (setting $\delta w_q = -w_q$ and $\delta w_i = 0$ for $i \neq q$ ) is:

$s_q^{\text{OBD}} = \frac{1}{2} h_{qq} w_q^2$

Prune the weight with the smallest $s_q^{\text{OBD}}$ .

The diagonal Hessian entries $h_{qq}$ can be computed efficiently during backpropagation. The cost is roughly the same as computing gradients.

Watch Out

OBD is not magnitude pruning

Magnitude pruning removes the weight with the smallest $|w_q|$ . OBD removes the weight with the smallest $h_{qq} w_q^2$ . A weight can be large but sit in a flat region of the loss (small $h_{qq}$ ), making it cheap to remove. A weight can be small but sit on a steep ridge (large $h_{qq}$ ), making it expensive to remove. OBD accounts for the local curvature; magnitude pruning does not.

Optimal Brain Surgeon (Hassibi and Stork, 1993)

OBS removes the diagonal approximation and uses the full inverse Hessian. When deleting weight $w_q$ , the remaining weights are adjusted to compensate.

Theorem

OBS Optimal Weight Update

Statement

The optimal perturbation for deleting weight $w_q$ is:

$\delta w = -\frac{w_q}{[H^{-1}]_{qq}} H^{-1} e_q$

and the resulting saliency is:

$s_q^{\text{OBS}} = \frac{w_q^2}{2 [H^{-1}]_{qq}}$

where $e_q$ is the $q$ -th standard basis vector and $[H^{-1}]_{qq}$ is the $q$ -th diagonal element of the inverse Hessian.

Intuition

OBS asks: if I must set $w_q = 0$ , what is the best way to adjust all other weights to minimize the damage? The answer involves the inverse Hessian because it captures how weights interact through the loss surface. The correction $\delta w$ shifts other weights along the direction that most efficiently compensates for the deleted weight.

Proof Sketch

Minimize $\frac{1}{2} \delta w^T H \delta w$ subject to the constraint $e_q^T (w + \delta w) = 0$ , i.e., $e_q^T \delta w = -w_q$ . This is a quadratic program with a linear equality constraint. Using a Lagrange multiplier $\lambda$ : the KKT conditions give $H \delta w + \lambda e_q = 0$ , so $\delta w = -\lambda H^{-1} e_q$ . Substituting into the constraint yields $\lambda = w_q / [H^{-1}]_{qq}$ .

Why It Matters

OBS gives strictly better pruning decisions than OBD because it accounts for weight correlations. A weight might look important in isolation but be redundant given other weights. The inverse Hessian captures this redundancy.

Failure Mode

Computing and storing the full $n \times n$ Hessian (or its inverse) is infeasible for large networks. For a model with $10^8$ parameters, the Hessian has $10^{16}$ entries. OBS is therefore limited to small networks or requires approximations (block-diagonal, low-rank, Kronecker-factored).

report a correction →

Comparing Pruning Criteria

Proposition

Hierarchy of Saliency Approximations

Statement

The three pruning criteria form a hierarchy of approximations. Magnitude pruning assumes $H = I$ (identity). OBD assumes $H$ is diagonal. OBS uses the full Hessian. Formally:

Magnitude: $s_q = \frac{1}{2} w_q^2$
OBD: $s_q = \frac{1}{2} h_{qq} w_q^2$
OBS: $s_q = \frac{w_q^2}{2 [H^{-1}]_{qq}}$

Each successive criterion uses more curvature information.

Intuition

Magnitude pruning treats all directions in weight space as equally important. OBD adds per-weight curvature. OBS adds cross-weight interactions. The more structure you use, the better your pruning decisions, but the higher the computational cost.

Proof Sketch

Setting $H = I$ in the OBD formula gives $s_q = \frac{1}{2} w_q^2$ . Setting $H$ to be diagonal in the OBS formula gives $[H^{-1}]_{qq} = 1/h_{qq}$ , so $s_q = \frac{1}{2} h_{qq} w_q^2$ , recovering OBD.

Why It Matters

This hierarchy explains why magnitude pruning works at all: it is the zero-th order approximation to the correct saliency criterion. It also explains when magnitude pruning fails: when the Hessian is far from a scaled identity.

Failure Mode

All three criteria assume the network is at a local minimum. If training has not converged (common with early stopping), the gradient term $\nabla L^T \delta w$ is nonzero and the Taylor approximation is less accurate.

report a correction →

Connection to the Lottery Ticket Hypothesis

Frankle and Carbin (2019) observed that within a randomly initialized network, there exist sparse subnetworks (lottery tickets) that can be trained from scratch to full accuracy. This raises a question that OBD/OBS does not answer: pruning at initialization vs. pruning after training.

OBD and OBS operate on trained networks. The lottery ticket hypothesis suggests that the structure of the initialization matters. Modern pruning research bridges both views: methods like SNIP (Lee et al. 2019) use gradient information at initialization to approximate saliency before training.

Canonical Examples

Example

Pruning a 2-layer network

Consider a network with 4 weights $w = (0.5, 0.1, 0.3, 0.8)$ and diagonal Hessian entries $h = (2.0, 20.0, 1.0, 0.5)$ .

Magnitude pruning removes $w_2 = 0.1$ .

OBD saliency: $s = (0.25, 0.10, 0.045, 0.16)$ .

OBD removes $w_3$ (saliency 0.045), not $w_2$ (saliency 0.10). Weight $w_3$ has larger magnitude than $w_2$ but sits in a much flatter region of the loss, so removing it causes less damage.

Common Confusions

Watch Out

Pruning is not the same as regularization

L1 regularization encourages sparsity during training. Pruning removes weights after training (or during training with iterative pruning). They are complementary: L1 can be seen as soft pruning, while OBD/OBS perform hard pruning with explicit saliency analysis.

Watch Out

The Hessian at convergence is not always positive definite

OBS requires a positive definite Hessian. In practice, trained neural networks often have near-zero or negative Hessian eigenvalues. Adding a damping term $H + \lambda I$ is standard to ensure invertibility, but the choice of $\lambda$ affects pruning decisions.

Exercises

ExerciseCore

Problem

A network has two weights: $w_1 = 2.0$ with $h_{11} = 0.1$ , and $w_2 = 0.5$ with $h_{22} = 10.0$ . Which weight does magnitude pruning remove? Which does OBD remove? Compute both saliencies.

ExerciseAdvanced

Problem

Derive the OBS optimal weight update using Lagrange multipliers. Start from the objective $\min_{\delta w} \frac{1}{2} \delta w^T H \delta w$ subject to $e_q^T(w + \delta w) = 0$ .

References

Canonical:

LeCun, Denker, Solla, "Optimal Brain Damage", NeurIPS 1989
Hassibi and Stork, "Second Order Derivatives for Network Pruning: Optimal Brain Surgeon", NeurIPS 1993

Current:

Frankle and Carbin, "The Lottery Ticket Hypothesis", ICLR 2019
Blalock et al., "What is the State of Neural Network Pruning?", MLSys 2020
Lee et al., "SNIP: Single-shot Network Pruning based on Connection Sensitivity", ICLR 2019
Frantar and Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot", ICML 2023. Applies OBS-style weight updates via an efficient layer-wise inverse-Hessian approximation, scaling OBS to LLMs.
Martens and Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (K-FAC), ICML 2015. The canonical Kronecker-factored Hessian approximation used when full $H$ is infeasible.

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

The Hessian Matrixlayer 0A · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Iterative Magnitude Pruning and the Lottery Ticket Hypothesislayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.