Second-Order Optimization Methods

Sneiderman, Robby

Numerical Optimization

Second-Order Optimization Methods

Newton's method, Gauss-Newton, natural gradient, and K-FAC: how curvature information accelerates convergence, why the Hessian is too expensive to compute at scale, and Hessian-free alternatives that use Hessian-vector products.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Newtons Method The Hessian Matrix Conjugate Gradient Methods Equilibrium and Implicit Models

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 3 | tier 2. This page has 6 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

First-order methods (SGD, Adam) use only gradient information. They treat the loss landscape as if all directions have the same curvature. This is rarely true: loss landscapes of neural networks have wildly different curvatures in different directions, with condition numbers often exceeding $10^6$ .

Second-order methods use curvature (Hessian) information to rescale the gradient, taking large steps in flat directions and small steps in steep directions. The theoretical payoff is quadratic convergence instead of linear. The practical cost is computing and inverting the Hessian, which is $O(n^2)$ storage and $O(n^3)$ computation for $n$ parameters.

The central tension of second-order optimization: better convergence per step vs. much higher cost per step.

Newton's Method Review

The Newton update for minimizing $L(w)$ is:

$w_{t+1} = w_t - H_t^{-1} \nabla L(w_t)$

where $H_t = \nabla^2 L(w_t)$ is the Hessian at $w_t$ .

Theorem

Quadratic Convergence of Newton's Method

Statement

Under the stated assumptions, Newton's method converges quadratically: there exists a constant $C > 0$ such that

$\|w_{t+1} - w^*\| \leq C \|w_t - w^*\|^2$

for all iterates sufficiently close to the optimum $w^*$ .

Intuition

Newton's method fits a quadratic approximation to the loss at each step and jumps to the minimum of that quadratic. If the loss is close to quadratic near the optimum, this jump is nearly exact, so the error squares at each step. Gradient descent on the same quadratic would only reduce the error by a constant factor per step (linear convergence).

Proof Sketch

Write $w_{t+1} - w^* = w_t - w^* - H_t^{-1} \nabla L(w_t)$ . Taylor expand $\nabla L(w_t)$ around $w^*$ : $\nabla L(w_t) = H^* (w_t - w^*) + O(\|w_t - w^*\|^2)$ . Then $w_{t+1} - w^* = (I - H_t^{-1} H^*)(w_t - w^*) + H_t^{-1} O(\|w_t - w^*\|^2)$ . Since $H_t \to H^*$ , the first term is $O(\|w_t - w^*\|^2)$ , giving quadratic convergence.

Why It Matters

Quadratic convergence is dramatically faster than linear. If gradient descent needs 1000 iterations to reach error $10^{-6}$ , Newton's method might need 20. But each Newton step costs $O(n^3)$ vs $O(n)$ for gradient descent, so the wall-clock comparison depends on $n$ .

Failure Mode

(1) The Hessian must be positive definite. For non-convex losses (neural networks), the Hessian has negative eigenvalues, and Newton's method can ascend to saddle points. (2) The starting point must be close to the optimum (local convergence only). (3) For $n = 10^8$ parameters, storing the Hessian requires $10^{16}$ entries, roughly $10^{7}$ GB. This is completely infeasible.

report a correction →

Gauss-Newton Method

For least-squares problems $L(w) = \frac{1}{2} \|r(w)\|^2$ where $r(w)$ is a residual vector, the Hessian is:

$H = J^T J + \sum_i r_i(w) \nabla^2 r_i(w)$

where $J = \nabla r(w)$ is the Jacobian of the residual.

Proposition

Gauss-Newton Approximation

Statement

The Gauss-Newton method approximates the Hessian by $H \approx J^T J$ , dropping the second-order residual terms. The update is:

$w_{t+1} = w_t - (J_t^T J_t)^{-1} J_t^T r(w_t)$

This is equivalent to solving the linearized least-squares problem at each step. The matrix $J^T J$ is always positive semidefinite.

Intuition

Near the solution where residuals are small, the dropped terms $\sum_i r_i \nabla^2 r_i$ are small. The remaining $J^T J$ is guaranteed positive semidefinite, so Gauss-Newton always descends (unlike full Newton which can ascend at saddle points).

Proof Sketch

Write $H = J^T J + S$ where $S = \sum_i r_i \nabla^2 r_i$ . If $\|r(w^*)\| \approx 0$ , then $\|S\|$ is small near $w^*$ , so $H \approx J^T J$ . The matrix $J^T J$ is the Gram matrix of the Jacobian columns, which is positive semidefinite by construction.

Why It Matters

Gauss-Newton is the basis for Levenberg-Marquardt (add damping $J^T J + \lambda I$ ) and is closely related to the Fisher information matrix approach used in natural gradient methods.

Failure Mode

When residuals are large at the solution (model misspecification), the dropped term $S$ is not small and Gauss-Newton may converge slowly or diverge. Damping ( $\lambda I$ ) partially addresses this.

report a correction →

Natural Gradient

The ordinary gradient $\nabla L$ depends on the parameterization. Reparameterizing the model changes the gradient direction, even though the function has not changed. The natural gradient removes this dependence.

Definition

Natural Gradient $\tilde{\nabla} L$

For a probabilistic model $p(y \mid x; w)$ , the natural gradient is:

$\tilde{\nabla} L = F^{-1} \nabla L$

where $F$ is the Fisher information matrix:

$F = \mathbb{E}_{p(y \mid x; w)}\left[\nabla \log p(y \mid x; w) \nabla \log p(y \mid x; w)^T\right]$

Theorem

Parameterization Invariance of Natural Gradient

Statement

The natural gradient descent trajectory in the space of distributions is independent of the parameterization. If $w$ and $\phi = g(w)$ are two parameterizations of the same model family, then natural gradient descent in $w$ -space and $\phi$ -space trace the same path through distribution space.

Intuition

The Fisher matrix $F$ acts as a Riemannian metric on the parameter space. Natural gradient descent is steepest descent with respect to KL divergence between distributions, not Euclidean distance between parameters. This is the "right" metric for probability distributions.

Proof Sketch

Under reparameterization $\phi = g(w)$ , the Fisher matrix transforms as $F_\phi = J_g^T F_w J_g$ where $J_g$ is the Jacobian of $g$ . The gradient transforms as $\nabla_\phi L = J_g^{-T} \nabla_w L$ . Then $F_\phi^{-1} \nabla_\phi L = (J_g^T F_w J_g)^{-1} J_g^{-T} \nabla_w L = J_g^{-1} F_w^{-1} \nabla_w L$ . The update in distribution space is the same.

Why It Matters

First-order methods are sensitive to parameterization: a simple rescaling of weights can change convergence speed. Natural gradient methods avoid this artifact. Adam partially approximates this by dividing by running variance of gradients, which crudely estimates diagonal Fisher entries.

Failure Mode

The Fisher matrix is $n \times n$ and suffers the same scalability problems as the Hessian. For non-probabilistic losses, the Fisher is not defined and you must use the Hessian directly.

report a correction →

K-FAC: Kronecker-Factored Approximate Curvature

Martens and Grosse (2015) observed that for a neural network, the Fisher matrix of each layer has approximate Kronecker structure.

Definition

K-FAC Approximation

For a fully connected layer with input $a$ and pre-activation gradient $g$ , the Fisher block for that layer is $F_l \approx \mathbb{E}[aa^T] \otimes \mathbb{E}[gg^T]$ . The inverse is then $(A^{-1}) \otimes (G^{-1})$ where $A = \mathbb{E}[aa^T]$ and $G = \mathbb{E}[gg^T]$ .

This reduces inversion from $O(n_l^3)$ for the full block (where $n_l$ is the number of parameters in layer $l$ ) to $O(d_{\text{in}}^3 + d_{\text{out}}^3)$ where $d_{\text{in}}$ and $d_{\text{out}}$ are the layer dimensions.

K-FAC maintains running averages of $A$ and $G$ and inverts them periodically. The per-step overhead is modest compared to SGD: roughly 10-20% additional computation.

Hessian-Free Methods

Computing the full Hessian costs $O(n^2)$ . But computing a single Hessian-vector product $Hv$ costs only $O(n)$ using the "Pearlmutter trick" (automatic differentiation applied twice).

Given a vector $v$ , compute:

$Hv = \nabla_w (\nabla_w L \cdot v)$

This requires one forward pass and two backward passes. With $Hv$ available, you can solve $Hw = -\nabla L$ approximately using conjugate gradients without ever forming $H$ explicitly. This is the Hessian-free approach (Martens, 2010).

Why Second-Order Methods Are Rarely Used at Scale

Despite theoretical advantages, first-order methods (gradient descent variants like SGD with momentum and Adam) dominate large-scale training. The reasons:

Memory: even K-FAC requires storing and inverting per-layer covariance matrices.
Stochastic noise: with mini-batches, the Hessian estimate is noisy. The noise in second-order information can outweigh its benefits.
Non-convexity: convergence guarantees require positive definite Hessians, which neural network losses do not have globally.
Engineering: Adam is simple to implement, tune, and parallelize. K-FAC requires careful distributed implementation.

Common Confusions

Watch Out

Adam is not a second-order method

Adam divides by $\sqrt{v_t}$ where $v_t$ is a running average of squared gradients. This approximates the diagonal of the Fisher matrix, but it is still a first-order method: it uses only first derivatives. True second-order methods use the full Hessian or Fisher, not just a diagonal approximation.

Watch Out

Quadratic convergence requires being near the optimum

Newton's quadratic convergence is a local result. Far from the optimum, Newton's method can diverge or oscillate. In practice, you need globalization strategies: line search, trust regions, or damping. These add overhead and reduce the per-step advantage.

Exercises

ExerciseCore

Problem

A quadratic loss $L(w) = \frac{1}{2} w^T A w - b^T w$ has gradient $\nabla L = Aw - b$ and Hessian $H = A$ . Show that Newton's method converges in exactly one step from any starting point (assuming $A$ is invertible).

ExerciseAdvanced

Problem

Explain why the Gauss-Newton matrix $J^T J$ is positive semidefinite but may not be positive definite. Under what condition is it positive definite? What happens to the Gauss-Newton update when $J^T J$ is singular?

ExerciseResearch

Problem

K-FAC approximates the Fisher block for layer $l$ as $A_l \otimes G_l$ . The exact Fisher block is $\mathbb{E}[a a^T \otimes g g^T]$ . Under what conditions does the Kronecker factorization $\mathbb{E}[aa^T] \otimes \mathbb{E}[gg^T]$ equal $\mathbb{E}[aa^T \otimes gg^T]$ ? When does this fail?

References

Canonical:

Nocedal and Wright, Numerical Optimization, Chapters 7 and 10
Amari, "Natural Gradient Works Efficiently in Learning", Neural Computation 1998

Current:

Martens, "Deep Learning via Hessian-Free Optimization", ICML 2010
Martens and Grosse, "Optimizing Neural Networks with Kronecker-Factored Approximate Curvature", ICML 2015

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

The Hessian Matrixlayer 0A · tier 1
Newton's Methodlayer 1 · tier 1
Conjugate Gradient Methodslayer 2 · tier 2
Trust Region Methodslayer 2 · tier 2
Riemannian Optimization and Manifold Constraintslayer 3 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.