Calculus Objects
The Hessian Matrix
The matrix of second partial derivatives: encodes curvature, determines the nature of critical points, and is the central object in second-order optimization.
Why This Matters
The gradient tells you which way is downhill. The Hessian tells you how the landscape curves. is it a bowl (minimum), a dome (maximum), or a saddle? Every second-order optimization method (Newton's method, natural gradient, L-BFGS) depends on the Hessian or an approximation to it. In deep learning, the eigenvalues of the Hessian reveal the geometry of the loss landscape: sharp minima versus flat minima, the prevalence of saddle points, and why certain optimizers work better than others.
If the gradient is the first thing you learn in optimization, the Hessian is the second. literally.
Mental Model
For a function of one variable, tells you the curvature: positive means concave up (bowl), negative means concave down (dome). The Hessian is the multivariable generalization. But in multiple dimensions, the curvature can be different in different directions. The Hessian captures all of this information as a matrix. Its eigenvalues are the curvatures along the principal directions.
Core Definitions
Hessian Matrix
For a twice-differentiable function , the Hessian matrix at a point is the matrix of second partial derivatives:
Explicitly:
Symmetry of the Hessian (Schwarz's Theorem)
If the second partial derivatives of are continuous (i.e., ), then the mixed partials are equal:
This means the Hessian is a symmetric matrix: . Consequently, the Hessian has real eigenvalues and orthogonal eigenvectors. all the machinery of symmetric matrix theory applies.
Second-Order Taylor Expansion
The Hessian appears in the second-order Taylor expansion of around a point :
The three terms have clear meanings:
- : the value at the current point
- : the linear (first-order) change. The gradient tells you the slope
- : the quadratic (second-order) change. the Hessian tells you the curvature
This quadratic approximation is the basis for Newton's method and for classifying critical points.
The Second Derivative Test
Second Derivative Test (Multivariate)
Statement
Let be a critical point of (i.e., ). Then:
- If is positive definite (all eigenvalues ), then is a strict local minimum.
- If is negative definite (all eigenvalues ), then is a strict local maximum.
- If is indefinite (has both positive and negative eigenvalues), then is a saddle point.
- If is positive (or negative) semidefinite (some eigenvalue is zero), the test is inconclusive: higher-order terms are needed.
Intuition
At a critical point, , so the Taylor expansion becomes . If is positive definite, the quadratic form for all , so increases in every direction from . It is a minimum. If is indefinite, increases in some directions and decreases in others. a saddle point.
Proof Sketch
At the critical point , Taylor's theorem with remainder gives:
If is positive definite with minimum eigenvalue , then , so for sufficiently small :
The term is dominated by the quadratic term for small . The indefinite case follows by choosing along eigenvectors with positive and negative eigenvalues.
Why It Matters
This test is how you verify that a critical point found by setting is actually a minimum (or maximum or saddle). In optimization, you need to know that your converged solution is a local minimum, not a saddle point. In deep learning, this reveals the structure of the loss landscape: are we stuck at a saddle point, or have we found a genuine minimum?
Failure Mode
The test is inconclusive when is semidefinite (has a zero eigenvalue). Example: at the origin has , which is positive semidefinite. The origin is a minimum, but the second derivative test cannot confirm this: you need to examine the fourth-order term. Also, the test is local: a positive definite Hessian at does not guarantee is a global minimum.
The Hessian in Optimization
Newton's Method
Newton's method for minimizing uses the Hessian directly. At each iteration:
This is equivalent to minimizing the quadratic Taylor approximation at each step. When is quadratic, Newton's method converges in one step. For general smooth functions near a minimum, Newton's method converges quadratically (doubling the number of correct digits per iteration).
The cost: computing and inverting the Hessian is storage and computation. For deep networks with millions of parameters, this is prohibitive.
Quasi-Newton Methods
Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian using only gradient information. L-BFGS stores a low-rank approximation using the last gradient differences, requiring only storage. These methods achieve superlinear convergence. faster than gradient descent, slower than Newton. at a fraction of the cost.
Hessian-Vector Products
You often do not need the full Hessian. just its action on a vector : . This Hessian-vector product can be computed in time (the same cost as a gradient computation) using the identity:
In automatic differentiation, this is computed exactly via a forward-over- reverse pass. Hessian-vector products enable Krylov methods (conjugate gradients on the Hessian) and are the basis for understanding curvature without storing the full matrix.
Hessian Eigenvalues and the Loss Landscape
In deep learning, the Hessian of the loss with respect to the parameters reveals the geometry of the loss landscape:
- Eigenvalue spectrum: the distribution of Hessian eigenvalues tells you about curvature. A few large eigenvalues with many near zero suggests a low-dimensional structure in the loss landscape.
- Sharp vs. flat minima: a minimum with large Hessian eigenvalues is "sharp" (the loss increases quickly when you move away). A minimum with small eigenvalues is "flat." There is a conjecture (debated) that flat minima generalize better.
- Saddle points: in high dimensions, critical points are almost always saddle points (not local minima), because a random symmetric matrix is indefinite with high probability. The ratio of negative eigenvalues to total eigenvalues is called the index of the saddle point.
Canonical Examples
Hessian of a quadratic form
Let where is symmetric. The gradient is , and the Hessian is:
The Hessian is constant. independent of . The function is convex if and only if is positive semidefinite. For quadratics, the curvature is the same everywhere, which is why Newton's method converges in one step.
Hessian of a simple two-variable function
Let .
Gradient: .
Second partial derivatives:
Hessian:
At the origin : , the zero matrix. The second derivative test is inconclusive. (Indeed, the origin is a degenerate critical point.)
At : . positive definite, so this is a local minimum (though is not a critical point since ).
Common Confusions
The Hessian is NOT the same as the outer product of gradients
In deep learning, the Fisher information matrix is sometimes confused with the Hessian. They are different objects. The Hessian involves second derivatives of a single function; the Fisher involves first derivatives averaged over data. They coincide only under specific conditions (e.g., for the negative log-likelihood of an exponential family at the true parameters).
Positive definite Hessian at a point does not mean global convexity
means is locally convex near . The function could be non-convex elsewhere. Global convexity requires for all , which is a much stronger condition.
The Hessian exists but may be useless to compute explicitly
For a function of variables, the Hessian is an matrix. For a neural network with parameters, the Hessian has entries . It cannot be stored, let alone inverted. This is why Hessian-vector products and low-rank approximations (L-BFGS, Kronecker-factored approximations) are essential in practice.
Summary
- The Hessian is the matrix of second partial derivatives:
- Schwarz's theorem: if , the Hessian is symmetric
- Second-order Taylor:
- Second derivative test: positive definite local min, negative definite local max, indefinite saddle
- Newton's method: . uses Hessian directly, converges quadratically
- Hessian-vector products cost via autodiff. same as a gradient
- Hessian eigenvalues reveal loss landscape geometry: sharp vs. flat minima, saddle point structure
Exercises
Problem
Compute the Hessian of at the point . Determine whether the Hessian at this point is positive definite, negative definite, or indefinite.
Problem
Let for and . Compute and . Under what condition on is the Hessian positive definite (guaranteeing a unique global minimum)?
References
Canonical:
- Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3
- Boyd & Vandenberghe, Convex Optimization (2004), Appendix A
- Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 5-6
Current:
- Ghorbani, Krishnan, Xiao, "An Investigation into Neural Net Optimization via Hessian Eigenvalue Density" (2019)
Next Topics
The natural next steps from the Hessian:
- Newton's method: using the Hessian for second-order optimization, convergence theory, and practical modifications
- Convex optimization basics: where the Hessian being positive semidefinite everywhere guarantees a global minimum
Last reviewed: April 2026
Builds on This
- Matrix CalculusLayer 1
- Neural Network Optimization LandscapeLayer 4
- Optimal Brain Surgery and Pruning TheoryLayer 3
- Preconditioned Optimizers: Shampoo, K-FAC, and Natural GradientLayer 3
- Riemannian Optimization and Manifold ConstraintsLayer 3
- Second-Order Optimization MethodsLayer 3
- Training Dynamics and Loss LandscapesLayer 4