Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Calculus Objects

The Hessian Matrix

The matrix of second partial derivatives: encodes curvature, determines the nature of critical points, and is the central object in second-order optimization.

CoreTier 1Stable~40 min

Why This Matters

The gradient tells you which way is downhill. The Hessian tells you how the landscape curves. is it a bowl (minimum), a dome (maximum), or a saddle? Every second-order optimization method (Newton's method, natural gradient, L-BFGS) depends on the Hessian or an approximation to it. In deep learning, the eigenvalues of the Hessian reveal the geometry of the loss landscape: sharp minima versus flat minima, the prevalence of saddle points, and why certain optimizers work better than others.

If the gradient is the first thing you learn in optimization, the Hessian is the second. literally.

2.0
0.5
Minimum (positive definite)
xyv (λ₁=2.0)v (λ₂=0.5)f(x) f(0) + ½(λ₁x² + λ₂x²)Eigenvalues of the Hessian determine curvature along each eigenvector directionκ = λ₁/λ₂ = 4.0 (condition number)

Mental Model

For a function of one variable, f(x)f''(x) tells you the curvature: positive means concave up (bowl), negative means concave down (dome). The Hessian is the multivariable generalization. But in multiple dimensions, the curvature can be different in different directions. The Hessian captures all of this information as a matrix. Its eigenvalues are the curvatures along the principal directions.

Core Definitions

Definition

Hessian Matrix

For a twice-differentiable function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian matrix at a point xx is the n×nn \times n matrix of second partial derivatives:

[Hf(x)]ij=2fxixj[H_f(x)]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Explicitly:

Hf(x)=(2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2)H_f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}

Definition

Symmetry of the Hessian (Schwarz's Theorem)

If the second partial derivatives of ff are continuous (i.e., fC2f \in C^2), then the mixed partials are equal:

2fxixj=2fxjxi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}

This means the Hessian is a symmetric matrix: Hf=HfTH_f = H_f^T. Consequently, the Hessian has real eigenvalues and orthogonal eigenvectors. all the machinery of symmetric matrix theory applies.

Second-Order Taylor Expansion

The Hessian appears in the second-order Taylor expansion of ff around a point xx:

f(x+h)f(x)+f(x)Th+12hTHf(x)hf(x + h) \approx f(x) + \nabla f(x)^T h + \frac{1}{2} h^T H_f(x) \, h

The three terms have clear meanings:

  • f(x)f(x): the value at the current point
  • f(x)Th\nabla f(x)^T h: the linear (first-order) change. The gradient tells you the slope
  • 12hTHf(x)h\frac{1}{2} h^T H_f(x) \, h: the quadratic (second-order) change. the Hessian tells you the curvature

This quadratic approximation is the basis for Newton's method and for classifying critical points.

The Second Derivative Test

Theorem

Second Derivative Test (Multivariate)

Statement

Let xx^* be a critical point of ff (i.e., f(x)=0\nabla f(x^*) = 0). Then:

  • If Hf(x)H_f(x^*) is positive definite (all eigenvalues >0> 0), then xx^* is a strict local minimum.
  • If Hf(x)H_f(x^*) is negative definite (all eigenvalues <0< 0), then xx^* is a strict local maximum.
  • If Hf(x)H_f(x^*) is indefinite (has both positive and negative eigenvalues), then xx^* is a saddle point.
  • If Hf(x)H_f(x^*) is positive (or negative) semidefinite (some eigenvalue is zero), the test is inconclusive: higher-order terms are needed.

Intuition

At a critical point, f=0\nabla f = 0, so the Taylor expansion becomes f(x+h)f(x)+12hTHhf(x^* + h) \approx f(x^*) + \frac{1}{2} h^T H h. If HH is positive definite, the quadratic form hTHh>0h^T H h > 0 for all h0h \neq 0, so ff increases in every direction from xx^*. It is a minimum. If HH is indefinite, ff increases in some directions and decreases in others. a saddle point.

Proof Sketch

At the critical point xx^*, Taylor's theorem with remainder gives:

f(x+h)=f(x)+12hTHf(x)h+o(h2)f(x^* + h) = f(x^*) + \frac{1}{2} h^T H_f(x^*) h + o(\|h\|^2)

If Hf(x)H_f(x^*) is positive definite with minimum eigenvalue λmin>0\lambda_{\min} > 0, then hTHhλminh2h^T H h \geq \lambda_{\min} \|h\|^2, so for sufficiently small hh:

f(x+h)f(x)+λmin2h2+o(h2)>f(x)f(x^* + h) \geq f(x^*) + \frac{\lambda_{\min}}{2}\|h\|^2 + o(\|h\|^2) > f(x^*)

The o(h2)o(\|h\|^2) term is dominated by the quadratic term for small h\|h\|. The indefinite case follows by choosing hh along eigenvectors with positive and negative eigenvalues.

Why It Matters

This test is how you verify that a critical point found by setting f=0\nabla f = 0 is actually a minimum (or maximum or saddle). In optimization, you need to know that your converged solution is a local minimum, not a saddle point. In deep learning, this reveals the structure of the loss landscape: are we stuck at a saddle point, or have we found a genuine minimum?

Failure Mode

The test is inconclusive when HH is semidefinite (has a zero eigenvalue). Example: f(x,y)=x4+y2f(x, y) = x^4 + y^2 at the origin has H=diag(0,2)H = \text{diag}(0, 2), which is positive semidefinite. The origin is a minimum, but the second derivative test cannot confirm this: you need to examine the fourth-order term. Also, the test is local: a positive definite Hessian at xx^* does not guarantee xx^* is a global minimum.

The Hessian in Optimization

Newton's Method

Newton's method for minimizing ff uses the Hessian directly. At each iteration:

xt+1=xt[Hf(xt)]1f(xt)x_{t+1} = x_t - [H_f(x_t)]^{-1} \nabla f(x_t)

This is equivalent to minimizing the quadratic Taylor approximation at each step. When ff is quadratic, Newton's method converges in one step. For general smooth functions near a minimum, Newton's method converges quadratically (doubling the number of correct digits per iteration).

The cost: computing and inverting the n×nn \times n Hessian is O(n2)O(n^2) storage and O(n3)O(n^3) computation. For deep networks with millions of parameters, this is prohibitive.

Quasi-Newton Methods

Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian using only gradient information. L-BFGS stores a low-rank approximation using the last mm gradient differences, requiring only O(mn)O(mn) storage. These methods achieve superlinear convergence. faster than gradient descent, slower than Newton. at a fraction of the cost.

Hessian-Vector Products

You often do not need the full Hessian. just its action on a vector vv: Hf(x)vH_f(x) \cdot v. This Hessian-vector product can be computed in O(n)O(n) time (the same cost as a gradient computation) using the identity:

Hf(x)v=limϵ0f(x+ϵv)f(x)ϵH_f(x) \cdot v = \lim_{\epsilon \to 0} \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}

In automatic differentiation, this is computed exactly via a forward-over- reverse pass. Hessian-vector products enable Krylov methods (conjugate gradients on the Hessian) and are the basis for understanding curvature without storing the full matrix.

Hessian Eigenvalues and the Loss Landscape

In deep learning, the Hessian of the loss with respect to the parameters reveals the geometry of the loss landscape:

  • Eigenvalue spectrum: the distribution of Hessian eigenvalues tells you about curvature. A few large eigenvalues with many near zero suggests a low-dimensional structure in the loss landscape.
  • Sharp vs. flat minima: a minimum with large Hessian eigenvalues is "sharp" (the loss increases quickly when you move away). A minimum with small eigenvalues is "flat." There is a conjecture (debated) that flat minima generalize better.
  • Saddle points: in high dimensions, critical points are almost always saddle points (not local minima), because a random symmetric matrix is indefinite with high probability. The ratio of negative eigenvalues to total eigenvalues is called the index of the saddle point.

Canonical Examples

Example

Hessian of a quadratic form

Let f(x)=12xTAx+bTx+cf(x) = \frac{1}{2} x^T A x + b^T x + c where AA is symmetric. The gradient is f(x)=Ax+b\nabla f(x) = Ax + b, and the Hessian is:

Hf(x)=AH_f(x) = A

The Hessian is constant. independent of xx. The function is convex if and only if AA is positive semidefinite. For quadratics, the curvature is the same everywhere, which is why Newton's method converges in one step.

Example

Hessian of a simple two-variable function

Let f(x,y)=x2y+y3f(x, y) = x^2 y + y^3.

Gradient: f=(2xy,  x2+3y2)T\nabla f = (2xy, \; x^2 + 3y^2)^T.

Second partial derivatives:

  • 2fx2=2y\frac{\partial^2 f}{\partial x^2} = 2y
  • 2fxy=2x\frac{\partial^2 f}{\partial x \partial y} = 2x
  • 2fy2=6y\frac{\partial^2 f}{\partial y^2} = 6y

Hessian:

Hf(x,y)=(2y2x2x6y)H_f(x, y) = \begin{pmatrix} 2y & 2x \\ 2x & 6y \end{pmatrix}

At the origin (0,0)(0,0): Hf=(0000)H_f = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}, the zero matrix. The second derivative test is inconclusive. (Indeed, the origin is a degenerate critical point.)

At (0,1)(0, 1): Hf=(2006)H_f = \begin{pmatrix} 2 & 0 \\ 0 & 6 \end{pmatrix}. positive definite, so this is a local minimum (though (0,1)(0,1) is not a critical point since f(0,1)=(0,3)T0\nabla f(0,1) = (0, 3)^T \neq 0).

Common Confusions

Watch Out

The Hessian is NOT the same as the outer product of gradients

In deep learning, the Fisher information matrix E[logp(logp)T]\mathbb{E}[\nabla \log p \cdot (\nabla \log p)^T] is sometimes confused with the Hessian. They are different objects. The Hessian involves second derivatives of a single function; the Fisher involves first derivatives averaged over data. They coincide only under specific conditions (e.g., for the negative log-likelihood of an exponential family at the true parameters).

Watch Out

Positive definite Hessian at a point does not mean global convexity

Hf(x)0H_f(x^*) \succ 0 means ff is locally convex near xx^*. The function could be non-convex elsewhere. Global convexity requires Hf(x)0H_f(x) \succeq 0 for all xx, which is a much stronger condition.

Watch Out

The Hessian exists but may be useless to compute explicitly

For a function of nn variables, the Hessian is an n×nn \times n matrix. For a neural network with n=108n = 10^8 parameters, the Hessian has 101610^{16} entries . It cannot be stored, let alone inverted. This is why Hessian-vector products and low-rank approximations (L-BFGS, Kronecker-factored approximations) are essential in practice.

Summary

  • The Hessian Hf(x)H_f(x) is the n×nn \times n matrix of second partial derivatives: [H]ij=2f/xixj[H]_{ij} = \partial^2 f / \partial x_i \partial x_j
  • Schwarz's theorem: if fC2f \in C^2, the Hessian is symmetric
  • Second-order Taylor: f(x+h)f(x)+fTh+12hTHhf(x+h) \approx f(x) + \nabla f^T h + \frac{1}{2} h^T H h
  • Second derivative test: positive definite \Rightarrow local min, negative definite \Rightarrow local max, indefinite \Rightarrow saddle
  • Newton's method: xt+1=xtH1fx_{t+1} = x_t - H^{-1} \nabla f. uses Hessian directly, converges quadratically
  • Hessian-vector products cost O(n)O(n) via autodiff. same as a gradient
  • Hessian eigenvalues reveal loss landscape geometry: sharp vs. flat minima, saddle point structure

Exercises

ExerciseCore

Problem

Compute the Hessian of f(x,y)=x2y+y3f(x, y) = x^2 y + y^3 at the point (1,2)(1, 2). Determine whether the Hessian at this point is positive definite, negative definite, or indefinite.

ExerciseAdvanced

Problem

Let f(x)=Axb2f(x) = \|Ax - b\|^2 for ARm×nA \in \mathbb{R}^{m \times n} and bRmb \in \mathbb{R}^m. Compute f(x)\nabla f(x) and Hf(x)H_f(x). Under what condition on AA is the Hessian positive definite (guaranteeing a unique global minimum)?

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3
  • Boyd & Vandenberghe, Convex Optimization (2004), Appendix A
  • Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 5-6

Current:

  • Ghorbani, Krishnan, Xiao, "An Investigation into Neural Net Optimization via Hessian Eigenvalue Density" (2019)

Next Topics

The natural next steps from the Hessian:

  • Newton's method: using the Hessian for second-order optimization, convergence theory, and practical modifications
  • Convex optimization basics: where the Hessian being positive semidefinite everywhere guarantees a global minimum

Last reviewed: April 2026

Builds on This

Next Topics