Numerical Optimization
Quasi-Newton Methods
Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Superlinear convergence without second derivatives.
Prerequisites
Why This Matters
Newton's method converges quadratically but requires computing and inverting the Hessian at each step. storage and computation. For problems with , this is prohibitive.
Quasi-Newton methods keep the speed advantage of curvature information while avoiding the Hessian entirely. They build an approximation to the Hessian (or its inverse) using only gradient information, updated cheaply at each step. The result: superlinear convergence at the cost of first-order methods.
L-BFGS in particular is the default optimizer for medium-scale smooth optimization problems across scientific computing, and it remains competitive for many ML tasks where full-batch gradients are available.
Mental Model
At each step, you have a matrix (Hessian approximation) or (inverse Hessian approximation). You use to compute a Newton-like step:
After taking the step, you observe how the gradient changed. You use that observation to update , making it a better Hessian approximation. The crucial constraint: must satisfy the secant condition, which ensures it is consistent with the curvature you just observed.
The Secant Condition
Secant Condition
Define the step and gradient difference:
The secant condition (or quasi-Newton equation) requires the Hessian approximation to satisfy:
or equivalently, for the inverse approximation:
The secant condition says: the approximate Hessian, when applied to the step direction, must reproduce the observed gradient change. This is a first-order consistency requirement. It ensures the approximation is exact along the direction you just moved.
Derivation of the Secant Condition from Taylor Expansion
Statement
If is approximately constant along the segment from to , then . The secant condition enforces this relationship exactly for the approximation .
Intuition
By the mean value theorem for vector-valued functions:
If the Hessian is roughly constant, the integral is approximately , giving . The secant condition forces the approximation to match this.
Proof Sketch
Taylor expand around :
Rearranging: . Replacing with the approximation gives .
Why It Matters
The secant condition is the minimal consistency requirement for a Hessian approximation. It constrains along one direction () but leaves freedom in all other directions. Different quasi-Newton methods (BFGS, DFP, SR1) differ in how they use this freedom.
Failure Mode
The secant condition requires (the curvature condition) for the update to produce a positive definite . This is guaranteed if is strongly convex but can fail for non-convex functions. When it fails, the BFGS update must be skipped or a safeguard applied.
The BFGS Update
The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update is the most successful quasi-Newton method. It updates the inverse Hessian approximation to satisfy the secant condition while remaining positive definite, using a rank-2 update:
BFGS Inverse Hessian Update
where .
Equivalently, the direct Hessian approximation update is:
Key properties of the BFGS update:
- Rank-2: has rank at most 2, so the update is
- Positive definiteness preserved: if and , then
- Secant condition satisfied: by construction
- Minimal change: BFGS is the unique update that satisfies the secant condition, preserves symmetry and positive definiteness, and is closest to in a weighted Frobenius norm
The Full BFGS Algorithm
- Initialize (or a scaled identity) and .
- Compute search direction: .
- Line search: find satisfying the Wolfe conditions.
- Update: .
- Compute and .
- Update using the BFGS formula.
- Go to step 2.
The Wolfe conditions in step 3 are crucial: they ensure , which guarantees that the BFGS update preserves positive definiteness.
Superlinear Convergence
BFGS Superlinear Convergence
Statement
Under the above assumptions, BFGS with Wolfe line search converges superlinearly to the minimizer :
That is, the convergence rate eventually beats any fixed linear rate, though it does not achieve the quadratic rate of exact Newton.
Intuition
As the iterates approach , the BFGS approximation converges to along the directions that matter. The directions the algorithm actually moves. The approximation becomes increasingly accurate where it counts, so the steps approach true Newton steps, and the convergence rate transitions from linear to superlinear.
Proof Sketch
The proof (Dennis and More, 1974) shows that superlinear convergence holds if and only if:
That is, need not converge to the true inverse Hessian in every direction. only along the search directions . The BFGS update, by incorporating curvature information along each step, achieves this directional convergence.
Why It Matters
Superlinear convergence means BFGS is much faster than gradient descent (which is only linearly convergent) and approaches the speed of Newton without computing second derivatives. For smooth, strongly convex problems, BFGS is the practical optimum.
Failure Mode
The superlinear convergence requires strong convexity and exact (or sufficiently accurate) line searches. On non-convex problems, BFGS may converge to saddle points, and the Hessian approximation can become increasingly inaccurate. The curvature condition may fail, requiring skipped updates or damping.
L-BFGS: Limited-Memory BFGS
BFGS stores a matrix : memory. For , this is 8 TB in double precision. L-BFGS avoids this by never forming explicitly.
L-BFGS
L-BFGS stores only the last pairs (typically to ). The matrix-vector product is computed using a two-loop recursion that requires only storage and computation.
The two-loop recursion (Nocedal, 1980) computes without forming :
Loop 1 (backward over stored pairs): For : ; .
Set (where is a scaled identity, typically ).
Loop 2 (forward over stored pairs): For : ; .
Return .
Cost: time and storage, where .
Why L-BFGS Is the Default
For medium-scale smooth optimization ( in the thousands to low millions, full-batch gradients available), L-BFGS dominates because:
- memory vs. for BFGS
- No hyperparameter tuning (unlike SGD which needs learning rate schedules)
- Superlinear convergence for strongly convex problems
- Built into every serious optimization library (scipy, PyTorch
lbfgs)
L-BFGS is the standard for logistic regression, CRFs, maximum entropy models, and any smooth ML problem where you can compute full gradients.
Connection to Preconditioning
The inverse Hessian approximation acts as a preconditioner for the gradient. Instead of descending along (steepest descent), you descend along , which approximately rescales the gradient by the inverse curvature.
In the eigenspace of the Hessian, this rescaling approximately equalizes the step sizes across all directions: directions with high curvature (large eigenvalues) get smaller steps, and directions with low curvature (small eigenvalues) get larger steps. This is why quasi-Newton methods handle ill-conditioned problems far better than gradient descent.
This is the same principle behind adaptive gradient methods in deep learning: Adam and AdaGrad use diagonal preconditioning (diagonal Hessian approximation), while L-BFGS uses a low-rank-plus-diagonal approximation.
Common Confusions
BFGS approximates the Hessian, not the gradient
BFGS does not modify the gradient direction in an ad hoc way. It builds a principled approximation to the Hessian (or its inverse) that satisfies a consistency condition (the secant equation). The resulting search direction is a quasi-Newton direction: close to the true Newton direction when the approximation is accurate.
L-BFGS is not just 'BFGS with less memory'
L-BFGS does not simply truncate the BFGS matrix. It implicitly represents as a product of rank-2 updates applied to a scaled identity, and computes the matrix-vector product without ever forming . The two-loop recursion is the key algorithmic insight. The resulting approximation is different from full BFGS (it effectively "forgets" old curvature information), but in practice this limited memory is sufficient.
Summary
- Quasi-Newton methods approximate the Hessian using only gradient differences
- The secant condition is the fundamental constraint
- BFGS uses a rank-2 update preserving positive definiteness: per step
- L-BFGS stores only vector pairs: per step with two-loop recursion
- Superlinear convergence for strongly convex problems with Wolfe line search
- L-BFGS is the default for medium-scale smooth optimization
- The Hessian approximation acts as a preconditioner, equalizing curvature
Exercises
Problem
Derive the secant condition from a first-order Taylor expansion of around evaluated at . Specifically, show that under the assumption that the Hessian is approximately constant.
Problem
Consider BFGS initialized with on a strictly convex quadratic with . Show that after steps with exact line searches, (i.e., BFGS recovers the exact inverse Hessian in at most steps on a quadratic). This is called the finite termination property.
References
Canonical:
- Nocedal & Wright, Numerical Optimization (2006), Chapters 6-7
- Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996)
Current:
- Nocedal, "Updating Quasi-Newton Matrices with Limited Storage" (1980). The L-BFGS paper
- Liu & Nocedal, "On the Limited Memory BFGS Method for Large Scale Optimization" (1989)
Next Topics
The natural next step from quasi-Newton methods:
- Proximal gradient methods: when the objective has a non-smooth regularizer (e.g., ), quasi-Newton cannot be applied directly
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Newton's MethodLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A