Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Quasi-Newton Methods

Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Superlinear convergence without second derivatives.

CoreTier 1Stable~60 min

Prerequisites

Why This Matters

Newton's method converges quadratically but requires computing and inverting the d×dd \times d Hessian at each step. O(d2)O(d^2) storage and O(d3)O(d^3) computation. For problems with d>104d > 10^4, this is prohibitive.

Quasi-Newton methods keep the speed advantage of curvature information while avoiding the Hessian entirely. They build an approximation to the Hessian (or its inverse) using only gradient information, updated cheaply at each step. The result: superlinear convergence at the cost of first-order methods.

L-BFGS in particular is the default optimizer for medium-scale smooth optimization problems across scientific computing, and it remains competitive for many ML tasks where full-batch gradients are available.

Mental Model

At each step, you have a matrix Bt2f(x(t))B_t \approx \nabla^2 f(x^{(t)}) (Hessian approximation) or Ht[2f(x(t))]1H_t \approx [\nabla^2 f(x^{(t)})]^{-1} (inverse Hessian approximation). You use HtH_t to compute a Newton-like step:

x(t+1)=x(t)αtHtf(x(t))x^{(t+1)} = x^{(t)} - \alpha_t H_t \nabla f(x^{(t)})

After taking the step, you observe how the gradient changed. You use that observation to update HtHt+1H_t \to H_{t+1}, making it a better Hessian approximation. The crucial constraint: Ht+1H_{t+1} must satisfy the secant condition, which ensures it is consistent with the curvature you just observed.

The Secant Condition

Definition

Secant Condition

Define the step and gradient difference:

st=x(t+1)x(t),yt=f(x(t+1))f(x(t))s_t = x^{(t+1)} - x^{(t)}, \qquad y_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)})

The secant condition (or quasi-Newton equation) requires the Hessian approximation to satisfy:

Bt+1st=ytB_{t+1} s_t = y_t

or equivalently, for the inverse approximation:

Ht+1yt=stH_{t+1} y_t = s_t

The secant condition says: the approximate Hessian, when applied to the step direction, must reproduce the observed gradient change. This is a first-order consistency requirement. It ensures the approximation is exact along the direction you just moved.

Proposition

Derivation of the Secant Condition from Taylor Expansion

Statement

If 2f\nabla^2 f is approximately constant along the segment from x(t)x^{(t)} to x(t+1)x^{(t+1)}, then 2fstyt\nabla^2 f \cdot s_t \approx y_t. The secant condition Bt+1st=ytB_{t+1} s_t = y_t enforces this relationship exactly for the approximation Bt+1B_{t+1}.

Intuition

By the mean value theorem for vector-valued functions:

yt=f(x(t+1))f(x(t))=(012f(x(t)+τst)dτ)sty_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)}) = \left(\int_0^1 \nabla^2 f(x^{(t)} + \tau s_t) \, d\tau\right) s_t

If the Hessian is roughly constant, the integral is approximately 2f(x(t))\nabla^2 f(x^{(t)}), giving yt2f(x(t))sty_t \approx \nabla^2 f(x^{(t)}) s_t. The secant condition forces the approximation to match this.

Proof Sketch

Taylor expand f(x(t+1))\nabla f(x^{(t+1)}) around x(t)x^{(t)}:

f(x(t+1))f(x(t))+2f(x(t))st\nabla f(x^{(t+1)}) \approx \nabla f(x^{(t)}) + \nabla^2 f(x^{(t)}) s_t

Rearranging: 2f(x(t))styt\nabla^2 f(x^{(t)}) s_t \approx y_t. Replacing 2f\nabla^2 f with the approximation Bt+1B_{t+1} gives Bt+1st=ytB_{t+1} s_t = y_t.

Why It Matters

The secant condition is the minimal consistency requirement for a Hessian approximation. It constrains Bt+1B_{t+1} along one direction (sts_t) but leaves freedom in all other directions. Different quasi-Newton methods (BFGS, DFP, SR1) differ in how they use this freedom.

Failure Mode

The secant condition requires styt>0s_t^\top y_t > 0 (the curvature condition) for the update to produce a positive definite Bt+1B_{t+1}. This is guaranteed if ff is strongly convex but can fail for non-convex functions. When it fails, the BFGS update must be skipped or a safeguard applied.

The BFGS Update

The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update is the most successful quasi-Newton method. It updates the inverse Hessian approximation HtH_t to satisfy the secant condition while remaining positive definite, using a rank-2 update:

Definition

BFGS Inverse Hessian Update

Ht+1=(Iρtstyt)Ht(Iρtytst)+ρtststH_{t+1} = \left(I - \rho_t s_t y_t^\top\right) H_t \left(I - \rho_t y_t s_t^\top\right) + \rho_t s_t s_t^\top

where ρt=1ytst\rho_t = \frac{1}{y_t^\top s_t}.

Equivalently, the direct Hessian approximation update is:

Bt+1=BtBtststBtstBtst+ytytytstB_{t+1} = B_t - \frac{B_t s_t s_t^\top B_t}{s_t^\top B_t s_t} + \frac{y_t y_t^\top}{y_t^\top s_t}

Key properties of the BFGS update:

  • Rank-2: Ht+1HtH_{t+1} - H_t has rank at most 2, so the update is O(d2)O(d^2)
  • Positive definiteness preserved: if Ht0H_t \succ 0 and ytst>0y_t^\top s_t > 0, then Ht+10H_{t+1} \succ 0
  • Secant condition satisfied: Ht+1yt=stH_{t+1} y_t = s_t by construction
  • Minimal change: BFGS is the unique update that satisfies the secant condition, preserves symmetry and positive definiteness, and is closest to HtH_t in a weighted Frobenius norm

The Full BFGS Algorithm

  1. Initialize H0=IH_0 = I (or a scaled identity) and x(0)x^{(0)}.
  2. Compute search direction: dt=Htf(x(t))d_t = -H_t \nabla f(x^{(t)}).
  3. Line search: find αt\alpha_t satisfying the Wolfe conditions.
  4. Update: x(t+1)=x(t)+αtdtx^{(t+1)} = x^{(t)} + \alpha_t d_t.
  5. Compute st=x(t+1)x(t)s_t = x^{(t+1)} - x^{(t)} and yt=f(x(t+1))f(x(t))y_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)}).
  6. Update HtHt+1H_t \to H_{t+1} using the BFGS formula.
  7. Go to step 2.

The Wolfe conditions in step 3 are crucial: they ensure ytst>0y_t^\top s_t > 0, which guarantees that the BFGS update preserves positive definiteness.

Superlinear Convergence

Theorem

BFGS Superlinear Convergence

Statement

Under the above assumptions, BFGS with Wolfe line search converges superlinearly to the minimizer xx^*:

limtx(t+1)xx(t)x=0\lim_{t \to \infty} \frac{\|x^{(t+1)} - x^*\|}{\|x^{(t)} - x^*\|} = 0

That is, the convergence rate eventually beats any fixed linear rate, though it does not achieve the quadratic rate of exact Newton.

Intuition

As the iterates approach xx^*, the BFGS approximation HtH_t converges to [2f(x)]1[\nabla^2 f(x^*)]^{-1} along the directions that matter. The directions the algorithm actually moves. The approximation becomes increasingly accurate where it counts, so the steps approach true Newton steps, and the convergence rate transitions from linear to superlinear.

Proof Sketch

The proof (Dennis and More, 1974) shows that superlinear convergence holds if and only if:

limt(Ht[2f(x)]1)dtdt=0\lim_{t \to \infty} \frac{\|(H_t - [\nabla^2 f(x^*)]^{-1}) d_t\|}{\|d_t\|} = 0

That is, HtH_t need not converge to the true inverse Hessian in every direction. only along the search directions dtd_t. The BFGS update, by incorporating curvature information along each step, achieves this directional convergence.

Why It Matters

Superlinear convergence means BFGS is much faster than gradient descent (which is only linearly convergent) and approaches the speed of Newton without computing second derivatives. For smooth, strongly convex problems, BFGS is the practical optimum.

Failure Mode

The superlinear convergence requires strong convexity and exact (or sufficiently accurate) line searches. On non-convex problems, BFGS may converge to saddle points, and the Hessian approximation can become increasingly inaccurate. The curvature condition ytst>0y_t^\top s_t > 0 may fail, requiring skipped updates or damping.

L-BFGS: Limited-Memory BFGS

BFGS stores a d×dd \times d matrix HtH_t: O(d2)O(d^2) memory. For d=106d = 10^6, this is 8 TB in double precision. L-BFGS avoids this by never forming HtH_t explicitly.

Definition

L-BFGS

L-BFGS stores only the last mm pairs {(si,yi)}i=tmt1\{(s_i, y_i)\}_{i=t-m}^{t-1} (typically m=5m = 5 to 2020). The matrix-vector product HtfH_t \nabla f is computed using a two-loop recursion that requires only O(md)O(md) storage and O(md)O(md) computation.

The two-loop recursion (Nocedal, 1980) computes HtgH_t g without forming HtH_t:

Loop 1 (backward over stored pairs): For i=t1,,tmi = t-1, \ldots, t-m: αi=ρisiq\alpha_i = \rho_i s_i^\top q; qqαiyiq \leftarrow q - \alpha_i y_i.

Set r=H0qr = H_0 q (where H0H_0 is a scaled identity, typically H0=yt1st1yt1yt1IH_0 = \frac{y_{t-1}^\top s_{t-1}}{y_{t-1}^\top y_{t-1}} I).

Loop 2 (forward over stored pairs): For i=tm,,t1i = t-m, \ldots, t-1: β=ρiyir\beta = \rho_i y_i^\top r; rr+(αiβ)sir \leftarrow r + (\alpha_i - \beta) s_i.

Return r=Htgr = H_t g.

Cost: O(md)O(md) time and storage, where mdm \ll d.

Why L-BFGS Is the Default

For medium-scale smooth optimization (dd in the thousands to low millions, full-batch gradients available), L-BFGS dominates because:

  1. O(md)O(md) memory vs. O(d2)O(d^2) for BFGS
  2. No hyperparameter tuning (unlike SGD which needs learning rate schedules)
  3. Superlinear convergence for strongly convex problems
  4. Built into every serious optimization library (scipy, PyTorch lbfgs)

L-BFGS is the standard for logistic regression, CRFs, maximum entropy models, and any smooth ML problem where you can compute full gradients.

Connection to Preconditioning

The inverse Hessian approximation HtH_t acts as a preconditioner for the gradient. Instead of descending along f-\nabla f (steepest descent), you descend along Htf-H_t \nabla f, which approximately rescales the gradient by the inverse curvature.

In the eigenspace of the Hessian, this rescaling approximately equalizes the step sizes across all directions: directions with high curvature (large eigenvalues) get smaller steps, and directions with low curvature (small eigenvalues) get larger steps. This is why quasi-Newton methods handle ill-conditioned problems far better than gradient descent.

This is the same principle behind adaptive gradient methods in deep learning: Adam and AdaGrad use diagonal preconditioning (diagonal Hessian approximation), while L-BFGS uses a low-rank-plus-diagonal approximation.

Common Confusions

Watch Out

BFGS approximates the Hessian, not the gradient

BFGS does not modify the gradient direction in an ad hoc way. It builds a principled approximation to the Hessian (or its inverse) that satisfies a consistency condition (the secant equation). The resulting search direction is a quasi-Newton direction: close to the true Newton direction when the approximation is accurate.

Watch Out

L-BFGS is not just 'BFGS with less memory'

L-BFGS does not simply truncate the BFGS matrix. It implicitly represents HtH_t as a product of rank-2 updates applied to a scaled identity, and computes the matrix-vector product HtgH_t g without ever forming HtH_t. The two-loop recursion is the key algorithmic insight. The resulting approximation is different from full BFGS (it effectively "forgets" old curvature information), but in practice this limited memory is sufficient.

Summary

  • Quasi-Newton methods approximate the Hessian using only gradient differences
  • The secant condition Bt+1st=ytB_{t+1} s_t = y_t is the fundamental constraint
  • BFGS uses a rank-2 update preserving positive definiteness: O(d2)O(d^2) per step
  • L-BFGS stores only mm vector pairs: O(md)O(md) per step with two-loop recursion
  • Superlinear convergence for strongly convex problems with Wolfe line search
  • L-BFGS is the default for medium-scale smooth optimization
  • The Hessian approximation acts as a preconditioner, equalizing curvature

Exercises

ExerciseCore

Problem

Derive the secant condition from a first-order Taylor expansion of f\nabla f around x(t)x^{(t)} evaluated at x(t+1)x^{(t+1)}. Specifically, show that 2f(x(t))styt\nabla^2 f(x^{(t)}) s_t \approx y_t under the assumption that the Hessian is approximately constant.

ExerciseAdvanced

Problem

Consider BFGS initialized with H0=IH_0 = I on a strictly convex quadratic f(x)=12xAxbxf(x) = \frac{1}{2} x^\top A x - b^\top x with A0A \succ 0. Show that after dd steps with exact line searches, Hd=A1H_d = A^{-1} (i.e., BFGS recovers the exact inverse Hessian in at most dd steps on a quadratic). This is called the finite termination property.

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 6-7
  • Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996)

Current:

  • Nocedal, "Updating Quasi-Newton Matrices with Limited Storage" (1980). The L-BFGS paper
  • Liu & Nocedal, "On the Limited Memory BFGS Method for Large Scale Optimization" (1989)

Next Topics

The natural next step from quasi-Newton methods:

  • Proximal gradient methods: when the objective has a non-smooth regularizer (e.g., 1\ell_1), quasi-Newton cannot be applied directly

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics