Numerical Optimization
Second-Order Optimization Methods
Newton's method, Gauss-Newton, natural gradient, and K-FAC: how curvature information accelerates convergence, why the Hessian is too expensive to compute at scale, and Hessian-free alternatives that use Hessian-vector products.
Prerequisites
Why This Matters
First-order methods (SGD, Adam) use only gradient information. They treat the loss landscape as if all directions have the same curvature. This is rarely true: loss landscapes of neural networks have wildly different curvatures in different directions, with condition numbers often exceeding .
Second-order methods use curvature (Hessian) information to rescale the gradient, taking large steps in flat directions and small steps in steep directions. The theoretical payoff is quadratic convergence instead of linear. The practical cost is computing and inverting the Hessian, which is storage and computation for parameters.
The central tension of second-order optimization: better convergence per step vs. much higher cost per step.
Newton's Method Review
The Newton update for minimizing is:
where is the Hessian at .
Quadratic Convergence of Newton's Method
Statement
Under the stated assumptions, Newton's method converges quadratically: there exists a constant such that
for all iterates sufficiently close to the optimum .
Intuition
Newton's method fits a quadratic approximation to the loss at each step and jumps to the minimum of that quadratic. If the loss is close to quadratic near the optimum, this jump is nearly exact, so the error squares at each step. Gradient descent on the same quadratic would only reduce the error by a constant factor per step (linear convergence).
Proof Sketch
Write . Taylor expand around : . Then . Since , the first term is , giving quadratic convergence.
Why It Matters
Quadratic convergence is dramatically faster than linear. If gradient descent needs 1000 iterations to reach error , Newton's method might need 20. But each Newton step costs vs for gradient descent, so the wall-clock comparison depends on .
Failure Mode
(1) The Hessian must be positive definite. For non-convex losses (neural networks), the Hessian has negative eigenvalues, and Newton's method can ascend to saddle points. (2) The starting point must be close to the optimum (local convergence only). (3) For parameters, storing the Hessian requires entries, roughly GB. This is completely infeasible.
Gauss-Newton Method
For least-squares problems where is a residual vector, the Hessian is:
where is the Jacobian of the residual.
Gauss-Newton Approximation
Statement
The Gauss-Newton method approximates the Hessian by , dropping the second-order residual terms. The update is:
This is equivalent to solving the linearized least-squares problem at each step. The matrix is always positive semidefinite.
Intuition
Near the solution where residuals are small, the dropped terms are small. The remaining is guaranteed positive semidefinite, so Gauss-Newton always descends (unlike full Newton which can ascend at saddle points).
Proof Sketch
Write where . If , then is small near , so . The matrix is the Gram matrix of the Jacobian columns, which is positive semidefinite by construction.
Why It Matters
Gauss-Newton is the basis for Levenberg-Marquardt (add damping ) and is closely related to the Fisher information matrix approach used in natural gradient methods.
Failure Mode
When residuals are large at the solution (model misspecification), the dropped term is not small and Gauss-Newton may converge slowly or diverge. Damping () partially addresses this.
Natural Gradient
The ordinary gradient depends on the parameterization. Reparameterizing the model changes the gradient direction, even though the function has not changed. The natural gradient removes this dependence.
Natural Gradient
For a probabilistic model , the natural gradient is:
where is the Fisher information matrix:
Parameterization Invariance of Natural Gradient
Statement
The natural gradient descent trajectory in the space of distributions is independent of the parameterization. If and are two parameterizations of the same model family, then natural gradient descent in -space and -space trace the same path through distribution space.
Intuition
The Fisher matrix acts as a Riemannian metric on the parameter space. Natural gradient descent is steepest descent with respect to KL divergence between distributions, not Euclidean distance between parameters. This is the "right" metric for probability distributions.
Proof Sketch
Under reparameterization , the Fisher matrix transforms as where is the Jacobian of . The gradient transforms as . Then . The update in distribution space is the same.
Why It Matters
First-order methods are sensitive to parameterization: a simple rescaling of weights can change convergence speed. Natural gradient methods avoid this artifact. Adam partially approximates this by dividing by running variance of gradients, which crudely estimates diagonal Fisher entries.
Failure Mode
The Fisher matrix is and suffers the same scalability problems as the Hessian. For non-probabilistic losses, the Fisher is not defined and you must use the Hessian directly.
K-FAC: Kronecker-Factored Approximate Curvature
Martens and Grosse (2015) observed that for a neural network, the Fisher matrix of each layer has approximate Kronecker structure.
K-FAC Approximation
For a fully connected layer with input and pre-activation gradient , the Fisher block for that layer is . The inverse is then where and .
This reduces inversion from for the full block (where is the number of parameters in layer ) to where and are the layer dimensions.
K-FAC maintains running averages of and and inverts them periodically. The per-step overhead is modest compared to SGD: roughly 10-20% additional computation.
Hessian-Free Methods
Computing the full Hessian costs . But computing a single Hessian-vector product costs only using the "Pearlmutter trick" (automatic differentiation applied twice).
Given a vector , compute:
This requires one forward pass and two backward passes. With available, you can solve approximately using conjugate gradients without ever forming explicitly. This is the Hessian-free approach (Martens, 2010).
Why Second-Order Methods Are Rarely Used at Scale
Despite theoretical advantages, first-order methods (gradient descent variants like SGD with momentum and Adam) dominate large-scale training. The reasons:
- Memory: even K-FAC requires storing and inverting per-layer covariance matrices.
- Stochastic noise: with mini-batches, the Hessian estimate is noisy. The noise in second-order information can outweigh its benefits.
- Non-convexity: convergence guarantees require positive definite Hessians, which neural network losses do not have globally.
- Engineering: Adam is simple to implement, tune, and parallelize. K-FAC requires careful distributed implementation.
Common Confusions
Adam is not a second-order method
Adam divides by where is a running average of squared gradients. This approximates the diagonal of the Fisher matrix, but it is still a first-order method: it uses only first derivatives. True second-order methods use the full Hessian or Fisher, not just a diagonal approximation.
Quadratic convergence requires being near the optimum
Newton's quadratic convergence is a local result. Far from the optimum, Newton's method can diverge or oscillate. In practice, you need globalization strategies: line search, trust regions, or damping. These add overhead and reduce the per-step advantage.
Exercises
Problem
A quadratic loss has gradient and Hessian . Show that Newton's method converges in exactly one step from any starting point (assuming is invertible).
Problem
Explain why the Gauss-Newton matrix is positive semidefinite but may not be positive definite. Under what condition is it positive definite? What happens to the Gauss-Newton update when is singular?
Problem
K-FAC approximates the Fisher block for layer as . The exact Fisher block is . Under what conditions does the Kronecker factorization equal ? When does this fail?
References
Canonical:
- Nocedal and Wright, Numerical Optimization, Chapters 7 and 10
- Amari, "Natural Gradient Works Efficiently in Learning", Neural Computation 1998
Current:
- Martens, "Deep Learning via Hessian-Free Optimization", ICML 2010
- Martens and Grosse, "Optimizing Neural Networks with Kronecker-Factored Approximate Curvature", ICML 2015
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Newton's MethodLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- The Hessian MatrixLayer 0A