Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Second-Order Optimization Methods

Newton's method, Gauss-Newton, natural gradient, and K-FAC: how curvature information accelerates convergence, why the Hessian is too expensive to compute at scale, and Hessian-free alternatives that use Hessian-vector products.

AdvancedTier 2Stable~55 min
0

Why This Matters

First-order methods (SGD, Adam) use only gradient information. They treat the loss landscape as if all directions have the same curvature. This is rarely true: loss landscapes of neural networks have wildly different curvatures in different directions, with condition numbers often exceeding 10610^6.

Second-order methods use curvature (Hessian) information to rescale the gradient, taking large steps in flat directions and small steps in steep directions. The theoretical payoff is quadratic convergence instead of linear. The practical cost is computing and inverting the Hessian, which is O(n2)O(n^2) storage and O(n3)O(n^3) computation for nn parameters.

The central tension of second-order optimization: better convergence per step vs. much higher cost per step.

Newton's Method Review

The Newton update for minimizing L(w)L(w) is:

wt+1=wtHt1L(wt)w_{t+1} = w_t - H_t^{-1} \nabla L(w_t)

where Ht=2L(wt)H_t = \nabla^2 L(w_t) is the Hessian at wtw_t.

Theorem

Quadratic Convergence of Newton's Method

Statement

Under the stated assumptions, Newton's method converges quadratically: there exists a constant C>0C > 0 such that

wt+1wCwtw2\|w_{t+1} - w^*\| \leq C \|w_t - w^*\|^2

for all iterates sufficiently close to the optimum ww^*.

Intuition

Newton's method fits a quadratic approximation to the loss at each step and jumps to the minimum of that quadratic. If the loss is close to quadratic near the optimum, this jump is nearly exact, so the error squares at each step. Gradient descent on the same quadratic would only reduce the error by a constant factor per step (linear convergence).

Proof Sketch

Write wt+1w=wtwHt1L(wt)w_{t+1} - w^* = w_t - w^* - H_t^{-1} \nabla L(w_t). Taylor expand L(wt)\nabla L(w_t) around ww^*: L(wt)=H(wtw)+O(wtw2)\nabla L(w_t) = H^* (w_t - w^*) + O(\|w_t - w^*\|^2). Then wt+1w=(IHt1H)(wtw)+Ht1O(wtw2)w_{t+1} - w^* = (I - H_t^{-1} H^*)(w_t - w^*) + H_t^{-1} O(\|w_t - w^*\|^2). Since HtHH_t \to H^*, the first term is O(wtw2)O(\|w_t - w^*\|^2), giving quadratic convergence.

Why It Matters

Quadratic convergence is dramatically faster than linear. If gradient descent needs 1000 iterations to reach error 10610^{-6}, Newton's method might need 20. But each Newton step costs O(n3)O(n^3) vs O(n)O(n) for gradient descent, so the wall-clock comparison depends on nn.

Failure Mode

(1) The Hessian must be positive definite. For non-convex losses (neural networks), the Hessian has negative eigenvalues, and Newton's method can ascend to saddle points. (2) The starting point must be close to the optimum (local convergence only). (3) For n=108n = 10^8 parameters, storing the Hessian requires 101610^{16} entries, roughly 10710^{7} GB. This is completely infeasible.

Gauss-Newton Method

For least-squares problems L(w)=12r(w)2L(w) = \frac{1}{2} \|r(w)\|^2 where r(w)r(w) is a residual vector, the Hessian is:

H=JTJ+iri(w)2ri(w)H = J^T J + \sum_i r_i(w) \nabla^2 r_i(w)

where J=r(w)J = \nabla r(w) is the Jacobian of the residual.

Proposition

Gauss-Newton Approximation

Statement

The Gauss-Newton method approximates the Hessian by HJTJH \approx J^T J, dropping the second-order residual terms. The update is:

wt+1=wt(JtTJt)1JtTr(wt)w_{t+1} = w_t - (J_t^T J_t)^{-1} J_t^T r(w_t)

This is equivalent to solving the linearized least-squares problem at each step. The matrix JTJJ^T J is always positive semidefinite.

Intuition

Near the solution where residuals are small, the dropped terms iri2ri\sum_i r_i \nabla^2 r_i are small. The remaining JTJJ^T J is guaranteed positive semidefinite, so Gauss-Newton always descends (unlike full Newton which can ascend at saddle points).

Proof Sketch

Write H=JTJ+SH = J^T J + S where S=iri2riS = \sum_i r_i \nabla^2 r_i. If r(w)0\|r(w^*)\| \approx 0, then S\|S\| is small near ww^*, so HJTJH \approx J^T J. The matrix JTJJ^T J is the Gram matrix of the Jacobian columns, which is positive semidefinite by construction.

Why It Matters

Gauss-Newton is the basis for Levenberg-Marquardt (add damping JTJ+λIJ^T J + \lambda I) and is closely related to the Fisher information matrix approach used in natural gradient methods.

Failure Mode

When residuals are large at the solution (model misspecification), the dropped term SS is not small and Gauss-Newton may converge slowly or diverge. Damping (λI\lambda I) partially addresses this.

Natural Gradient

The ordinary gradient L\nabla L depends on the parameterization. Reparameterizing the model changes the gradient direction, even though the function has not changed. The natural gradient removes this dependence.

Definition

Natural Gradient

For a probabilistic model p(yx;w)p(y \mid x; w), the natural gradient is:

~L=F1L\tilde{\nabla} L = F^{-1} \nabla L

where FF is the Fisher information matrix:

F=Ep(yx;w)[logp(yx;w)logp(yx;w)T]F = \mathbb{E}_{p(y \mid x; w)}\left[\nabla \log p(y \mid x; w) \nabla \log p(y \mid x; w)^T\right]

Theorem

Parameterization Invariance of Natural Gradient

Statement

The natural gradient descent trajectory in the space of distributions is independent of the parameterization. If ww and ϕ=g(w)\phi = g(w) are two parameterizations of the same model family, then natural gradient descent in ww-space and ϕ\phi-space trace the same path through distribution space.

Intuition

The Fisher matrix FF acts as a Riemannian metric on the parameter space. Natural gradient descent is steepest descent with respect to KL divergence between distributions, not Euclidean distance between parameters. This is the "right" metric for probability distributions.

Proof Sketch

Under reparameterization ϕ=g(w)\phi = g(w), the Fisher matrix transforms as Fϕ=JgTFwJgF_\phi = J_g^T F_w J_g where JgJ_g is the Jacobian of gg. The gradient transforms as ϕL=JgTwL\nabla_\phi L = J_g^{-T} \nabla_w L. Then Fϕ1ϕL=(JgTFwJg)1JgTwL=Jg1Fw1wLF_\phi^{-1} \nabla_\phi L = (J_g^T F_w J_g)^{-1} J_g^{-T} \nabla_w L = J_g^{-1} F_w^{-1} \nabla_w L. The update in distribution space is the same.

Why It Matters

First-order methods are sensitive to parameterization: a simple rescaling of weights can change convergence speed. Natural gradient methods avoid this artifact. Adam partially approximates this by dividing by running variance of gradients, which crudely estimates diagonal Fisher entries.

Failure Mode

The Fisher matrix is n×nn \times n and suffers the same scalability problems as the Hessian. For non-probabilistic losses, the Fisher is not defined and you must use the Hessian directly.

K-FAC: Kronecker-Factored Approximate Curvature

Martens and Grosse (2015) observed that for a neural network, the Fisher matrix of each layer has approximate Kronecker structure.

Definition

K-FAC Approximation

For a fully connected layer with input aa and pre-activation gradient gg, the Fisher block for that layer is FlE[aaT]E[ggT]F_l \approx \mathbb{E}[aa^T] \otimes \mathbb{E}[gg^T]. The inverse is then (A1)(G1)(A^{-1}) \otimes (G^{-1}) where A=E[aaT]A = \mathbb{E}[aa^T] and G=E[ggT]G = \mathbb{E}[gg^T].

This reduces inversion from O(nl3)O(n_l^3) for the full block (where nln_l is the number of parameters in layer ll) to O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3) where dind_{\text{in}} and doutd_{\text{out}} are the layer dimensions.

K-FAC maintains running averages of AA and GG and inverts them periodically. The per-step overhead is modest compared to SGD: roughly 10-20% additional computation.

Hessian-Free Methods

Computing the full Hessian costs O(n2)O(n^2). But computing a single Hessian-vector product HvHv costs only O(n)O(n) using the "Pearlmutter trick" (automatic differentiation applied twice).

Given a vector vv, compute:

Hv=w(wLv)Hv = \nabla_w (\nabla_w L \cdot v)

This requires one forward pass and two backward passes. With HvHv available, you can solve Hw=LHw = -\nabla L approximately using conjugate gradients without ever forming HH explicitly. This is the Hessian-free approach (Martens, 2010).

Why Second-Order Methods Are Rarely Used at Scale

Despite theoretical advantages, first-order methods (gradient descent variants like SGD with momentum and Adam) dominate large-scale training. The reasons:

  1. Memory: even K-FAC requires storing and inverting per-layer covariance matrices.
  2. Stochastic noise: with mini-batches, the Hessian estimate is noisy. The noise in second-order information can outweigh its benefits.
  3. Non-convexity: convergence guarantees require positive definite Hessians, which neural network losses do not have globally.
  4. Engineering: Adam is simple to implement, tune, and parallelize. K-FAC requires careful distributed implementation.

Common Confusions

Watch Out

Adam is not a second-order method

Adam divides by vt\sqrt{v_t} where vtv_t is a running average of squared gradients. This approximates the diagonal of the Fisher matrix, but it is still a first-order method: it uses only first derivatives. True second-order methods use the full Hessian or Fisher, not just a diagonal approximation.

Watch Out

Quadratic convergence requires being near the optimum

Newton's quadratic convergence is a local result. Far from the optimum, Newton's method can diverge or oscillate. In practice, you need globalization strategies: line search, trust regions, or damping. These add overhead and reduce the per-step advantage.

Exercises

ExerciseCore

Problem

A quadratic loss L(w)=12wTAwbTwL(w) = \frac{1}{2} w^T A w - b^T w has gradient L=Awb\nabla L = Aw - b and Hessian H=AH = A. Show that Newton's method converges in exactly one step from any starting point (assuming AA is invertible).

ExerciseAdvanced

Problem

Explain why the Gauss-Newton matrix JTJJ^T J is positive semidefinite but may not be positive definite. Under what condition is it positive definite? What happens to the Gauss-Newton update when JTJJ^T J is singular?

ExerciseResearch

Problem

K-FAC approximates the Fisher block for layer ll as AlGlA_l \otimes G_l. The exact Fisher block is E[aaTggT]\mathbb{E}[a a^T \otimes g g^T]. Under what conditions does the Kronecker factorization E[aaT]E[ggT]\mathbb{E}[aa^T] \otimes \mathbb{E}[gg^T] equal E[aaTggT]\mathbb{E}[aa^T \otimes gg^T]? When does this fail?

References

Canonical:

  • Nocedal and Wright, Numerical Optimization, Chapters 7 and 10
  • Amari, "Natural Gradient Works Efficiently in Learning", Neural Computation 1998

Current:

  • Martens, "Deep Learning via Hessian-Free Optimization", ICML 2010
  • Martens and Grosse, "Optimizing Neural Networks with Kronecker-Factored Approximate Curvature", ICML 2015

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.