Quasi-Newton Methods

Sneiderman, Robby

Numerical Optimization

Quasi-Newton Methods

Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Full BFGS has classical local superlinear convergence; L-BFGS is excellent in practice but its fixed-memory theory is weaker and not generally superlinear.

CoreTier 1StableSupporting~60 min

Prerequisites

Newtons Method Line Search Methods Secant Method

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 2 | tier 1. This page has 3 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Proximal Gradient Methods

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

theorem visual

Quasi-Newton learns curvature from steps and gradients

$Plain gradient descent zigzags down the steep eigendirection of an ill-conditioned quadratic. Newton lands in one step but needs the full Hessian. BFGS interpolates: each iteration enforces the secant condition so the next Hessian estimate sends the recent step to the recent gradient change, recovering Newton-like behavior with first-order information.$

Trajectories on an ill-conditioned quadratic

Secant condition pins the next Hessian estimate

gradient descent

$x_{k + 1} = x_{k} - η \nabla f (x_{k})$

$Ignores curvature; on an ill-conditioned quadratic it zigzags across the narrow valley.$

Newton step

$x_{k + 1} = x_{k} - [\nabla^{2} f (x_{k})]^{- 1} \nabla f (x_{k})$

$Quadratic convergence near a minimizer, but stores and inverts the full Hessian.$

BFGS update

$B_{k + 1} = B_{k} - \frac{B _{k} s _{k} s _{k}^{⊤} B _{k}}{s _{k}^{⊤} B _{k} s _{k}} + \frac{y _{k} y _{k}^{⊤}}{y _{k}^{⊤} s _{k}}$

$Rank-2 fix of the previous estimate; uses only gradient evaluations.$

secant condition

$B_{k + 1} s_{k} = y_{k}$

$With s_{k} = x_{k + 1} - x_{k} and y_{k} = \nabla f (x_{k + 1}) - \nabla f (x_{k}) . Forces B_{k + 1} to match the empirical curvature along the latest step.$

line search

$f (x_{k} + α p_{k}) \leq f (x_{k}) + c_{1} α \nabla f (x_{k})^{⊤} p_{k}$

$Wolfe conditions guarantee y_{k}^{⊤} s_{k} > 0, which keeps B_{k + 1} positive definite.$

L-BFGS

$O (m d) memory, m \approx 10$

$Stores the last m pairs (s_{i}, y_{i}) instead of a dense matrix; default optimizer for medium-scale smooth problems.$

Newton's method converges quadratically but requires computing and inverting the $d \times d$ Hessian at each step: $O(d^2)$ storage and $O(d^3)$ computation per iteration. For problems with $d > 10^4$ , this is prohibitive.

Quasi-Newton methods keep the speed advantage of curvature information while avoiding the Hessian entirely. They build an approximation to the Hessian (or its inverse) using only gradient information, updated cheaply at each step. The result: superlinear convergence at the cost of first-order methods.

L-BFGS in particular is the default optimizer for medium-scale smooth optimization problems across scientific computing, and it remains competitive for many ML tasks where full-batch gradients are available.

Mental Model

At each step, you have a matrix $B_t \approx \nabla^2 f(x^{(t)})$ (Hessian approximation) or $H_t \approx [\nabla^2 f(x^{(t)})]^{-1}$ (inverse Hessian approximation). You use $H_t$ to compute a Newton-like step:

$x^{(t+1)} = x^{(t)} - \alpha_t H_t \nabla f(x^{(t)})$

After taking the step, you observe how the gradient changed. You use that observation to update $H_t \to H_{t+1}$ , making it a better Hessian approximation. The crucial constraint: $H_{t+1}$ must satisfy the secant condition, which ensures it is consistent with the curvature you just observed.

Two conventions coexist in the literature:

Direct form: maintain $B_t \approx \nabla^2 f(x^{(t)})$ and solve a linear system $B_t d_t = -\nabla f(x^{(t)})$ each iteration. Cost per step: $O(d^2)$ for the solve via Cholesky, plus $O(d^2)$ for the rank-2 update.
Inverse form: maintain $H_t \approx [\nabla^2 f(x^{(t)})]^{-1}$ and compute $d_t = -H_t \nabla f(x^{(t)})$ by a single matrix-vector multiply. Cost per step: $O(d^2)$ total. No linear system solve needed.

The inverse form is more common in practice because it avoids the Cholesky factorization. L-BFGS operates entirely in the inverse form, computing $H_t g$ implicitly.

The Secant Condition

Definition

Secant Condition $B_{t + 1} s_{t} = y_{t}$

Define the step and gradient difference:

$s_t = x^{(t+1)} - x^{(t)}, \qquad y_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)})$

The secant condition (or quasi-Newton equation) requires the Hessian approximation to satisfy:

$B_{t+1} s_t = y_t$

or equivalently, for the inverse approximation:

$H_{t+1} y_t = s_t$

The secant condition says: the approximate Hessian, when applied to the step direction, must reproduce the observed gradient change. This is a first-order consistency requirement derived from Taylor expansion. It ensures the approximation is exact along the direction you just moved.

Note that $s_t \in \mathbb{R}^d$ and $y_t \in \mathbb{R}^d$ , so the secant condition imposes $d$ scalar constraints on $B_{t+1}$ , which has $d(d+1)/2$ free parameters (since it is symmetric). For $d > 1$ , the system is underdetermined: many symmetric matrices satisfy $B_{t+1} s_t = y_t$ . The various quasi-Newton methods (BFGS, DFP, SR1, Broyden family) correspond to different ways of resolving this freedom.

Proposition

Derivation of the Secant Condition from Taylor Expansion

Statement

If $\nabla^2 f$ is approximately constant along the segment from $x^{(t)}$ to $x^{(t+1)}$ , then $\nabla^2 f \cdot s_t \approx y_t$ . The secant condition $B_{t+1} s_t = y_t$ enforces this relationship exactly for the approximation $B_{t+1}$ .

Intuition

By the mean value theorem for vector-valued functions:

$y_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)}) = \left(\int_0^1 \nabla^2 f(x^{(t)} + \tau s_t) \, d\tau\right) s_t$

If the Hessian is roughly constant, the integral is approximately $\nabla^2 f(x^{(t)})$ , giving $y_t \approx \nabla^2 f(x^{(t)}) s_t$ . The secant condition forces the approximation to match this.

Proof Sketch

Taylor expand $\nabla f(x^{(t+1)})$ around $x^{(t)}$ :

$\nabla f(x^{(t+1)}) = \nabla f(x^{(t)}) + \nabla^2 f(x^{(t)}) s_t + O(\|s_t\|^2)$

Rearranging: $\nabla^2 f(x^{(t)}) s_t = y_t - O(\|s_t\|^2)$ .

When $f$ is quadratic, the $O(\|s_t\|^2)$ term vanishes: the gradient of a quadratic is affine, so $y_t = \nabla^2 f \cdot s_t$ exactly. For general smooth $f$ , the approximation improves as the step size $\|s_t\|$ shrinks near convergence.

Replacing the true Hessian with the approximation $B_{t+1}$ and dropping the remainder gives $B_{t+1} s_t = y_t$ .

Why It Matters

The secant condition is the minimal consistency requirement for a Hessian approximation. It constrains $B_{t+1}$ along one direction ( $s_t$ ) but leaves freedom in all other directions. Different quasi-Newton methods (BFGS, DFP, SR1) differ in how they use this freedom.

Failure Mode

The secant condition requires $s_t^\top y_t > 0$ (the curvature condition) for the update to produce a positive definite $B_{t+1}$ . This is guaranteed if $f$ is strongly convex but can fail for non-convex functions. When it fails, the BFGS update must be skipped or a safeguard applied (e.g., Powell damping).

report a correction →

The Curvature Condition in Detail

The requirement $s_t^\top y_t > 0$ deserves further attention. Expanding:

$s_t^\top y_t = s_t^\top (\nabla f(x^{(t+1)}) - \nabla f(x^{(t)}))$

For a strongly convex function with parameter $\mu > 0$ , the bound $s_t^\top y_t \geq \mu \|s_t\|^2$ holds, so the curvature condition is automatic.

For general smooth functions, the Wolfe conditions on the line search guarantee $s_t^\top y_t > 0$ . Specifically, the sufficient decrease (Armijo) condition and the curvature condition together imply:

$s_t^\top y_t \geq (c_2 - c_1) \alpha_t \|\nabla f(x^{(t)})\|^2$

where $0 < c_1 < c_2 < 1$ are the Wolfe parameters (typical values: $c_1 = 10^{-4}$ , $c_2 = 0.9$ ). This is why the Wolfe line search is not optional for BFGS; it is a structural requirement.

The BFGS Update

The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update is the most successful quasi-Newton method. It was discovered independently by four researchers in 1970. The update modifies the inverse Hessian approximation $H_t$ to satisfy the secant condition while remaining positive definite, using a rank-2 correction:

Definition

BFGS Inverse Hessian Update $H_{t + 1}$

$H_{t+1} = \left(I - \rho_t s_t y_t^\top\right) H_t \left(I - \rho_t y_t s_t^\top\right) + \rho_t s_t s_t^\top$

where $\rho_t = \frac{1}{y_t^\top s_t}$ .

Equivalently, the direct Hessian approximation update is:

$B_{t+1} = B_t - \frac{B_t s_t s_t^\top B_t}{s_t^\top B_t s_t} + \frac{y_t y_t^\top}{y_t^\top s_t}$

Key properties of the BFGS update:

Rank-2: $H_{t+1} - H_t$ has rank at most 2, so the update is $O(d^2)$
Positive definiteness preserved: if $H_t \succ 0$ and $y_t^\top s_t > 0$ , then $H_{t+1} \succ 0$
Secant condition satisfied: $H_{t+1} y_t = s_t$ by construction
Minimal change: BFGS is the unique update that satisfies the secant condition, preserves symmetry and positive definiteness, and is closest to $H_t$ in a weighted Frobenius norm

Verifying the Secant Condition

A direct check that $H_{t+1} y_t = s_t$ :

$H_{t+1} y_t = \left(I - \rho_t s_t y_t^\top\right) H_t \left(I - \rho_t y_t s_t^\top\right) y_t + \rho_t s_t s_t^\top y_t$

The inner product $s_t^\top y_t = 1/\rho_t$ , so the second factor gives $(I - \rho_t y_t s_t^\top) y_t = y_t - y_t = 0$ . The first term vanishes, and the last term gives $\rho_t s_t (s_t^\top y_t) = \rho_t s_t / \rho_t = s_t$ .

Verifying Positive Definiteness Preservation

To see why $H_{t+1} \succ 0$ when $H_t \succ 0$ and $\rho_t > 0$ , write $V_t = I - \rho_t y_t s_t^\top$ . Then $H_{t+1} = V_t^\top H_t V_t + \rho_t s_t s_t^\top$ . For any $z \neq 0$ :

$z^\top H_{t+1} z = \|H_t^{1/2} V_t z\|^2 + \rho_t (s_t^\top z)^2 \geq 0$

Equality requires both $V_t z \in \text{null}(H_t^{1/2}) = \{0\}$ and $s_t^\top z = 0$ . If $H_t \succ 0$ then $V_t z = 0$ means $z = \rho_t s_t (y_t^\top z)$ . Substituting into $s_t^\top z = 0$ gives $\rho_t (s_t^\top s_t)(y_t^\top z) = 0$ , which forces $y_t^\top z = 0$ and hence $z = 0$ . So $H_{t+1} \succ 0$ .

The Minimality Property

BFGS solves the following optimization problem:

$\min_{H} \|H - H_t\|_W \quad \text{subject to} \quad H = H^\top, \quad H y_t = s_t$

where $\|M\|_W^2 = \|W^{1/2} M W^{1/2}\|_F^2$ with $W = \bar{G}_t$ , the average Hessian $\int_0^1 \nabla^2 f(x^{(t)} + \tau s_t) d\tau$ . In practice $W$ is approximated by $B_t$ or by $B_{t+1}$ itself (the direct BFGS update). The weighted Frobenius norm measures distance scale-invariantly: it penalizes changes in well-conditioned directions as much as ill-conditioned ones.

This variational characterization explains BFGS's success. It preserves as much of the old curvature information as possible while incorporating the new observation, in a scale-adapted sense.

DFP: The Other Rank-2 Update

The DFP (Davidon-Fletcher-Powell) method predates BFGS by a decade. It solves the dual minimality problem:

$\min_{B} \|B - B_t\|_W \quad \text{subject to} \quad B = B^\top, \quad B s_t = y_t$

The DFP update for the direct Hessian approximation is:

$B_{t+1}^{\text{DFP}} = \left(I - \frac{y_t s_t^\top}{y_t^\top s_t}\right) B_t \left(I - \frac{s_t y_t^\top}{y_t^\top s_t}\right) + \frac{y_t y_t^\top}{y_t^\top s_t}$

DFP and BFGS are duals of each other: the BFGS update on $H$ has the same algebraic form as the DFP update on $B$ , and vice versa. In practice, BFGS outperforms DFP because DFP's inverse Hessian approximation tends to become increasingly ill-conditioned, amplifying errors in the line search. Nocedal and Wright (2006, Section 6.2) give empirical comparisons showing BFGS is more forgiving of inexact line searches.

SR1: The Rank-1 Alternative

The symmetric rank-1 (SR1) update takes the simplest possible form:

$B_{t+1} = B_t + \frac{(y_t - B_t s_t)(y_t - B_t s_t)^\top}{(y_t - B_t s_t)^\top s_t}$

SR1 has one advantage over BFGS: it does not require $s_t^\top y_t > 0$ . The approximation can capture negative curvature, making SR1 useful within trust-region methods for non-convex problems. The drawback: SR1 does not preserve positive definiteness, and the denominator $(y_t - B_t s_t)^\top s_t$ can be zero or near-zero, requiring a skip condition.

The Full BFGS Algorithm

Initialize $H_0 = I$ (or a scaled identity) and $x^{(0)}$ .
Compute search direction: $d_t = -H_t \nabla f(x^{(t)})$ .
Line search: find $\alpha_t$ satisfying the Wolfe conditions.
Update: $x^{(t+1)} = x^{(t)} + \alpha_t d_t$ .
Compute $s_t = x^{(t+1)} - x^{(t)}$ and $y_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)})$ .
Check: if $s_t^\top y_t < \epsilon$ (some small threshold), skip the BFGS update.
Update $H_t \to H_{t+1}$ using the BFGS formula.
Go to step 2.

The Wolfe conditions in step 3 enforce two requirements. First, the Armijo (sufficient decrease) condition: $f(x^{(t)} + \alpha d_t) \leq f(x^{(t)}) + c_1 \alpha \nabla f(x^{(t)})^\top d_t$ . Second, the curvature condition: $\nabla f(x^{(t)} + \alpha d_t)^\top d_t \geq c_2 \nabla f(x^{(t)})^\top d_t$ . Together, they guarantee $y_t^\top s_t > 0$ , which preserves positive definiteness.

Initialization Strategy

The choice of $H_0$ affects early convergence. Three common strategies:

Identity: $H_0 = I$ . The first step is steepest descent. Simple but can be far from the true scale.
Scaled identity: $H_0 = \gamma I$ where $\gamma = y_0^\top s_0 / y_0^\top y_0$ . After the first step, rescale to match observed curvature. This is the standard choice for L-BFGS.
Diagonal: $H_0 = \text{diag}(h_1, \ldots, h_d)$ from a rough estimate of the diagonal Hessian.

In L-BFGS, the scaling $\gamma_t = y_{t-1}^\top s_{t-1} / y_{t-1}^\top y_{t-1}$ is recomputed at every iteration, adapting the base matrix to the local curvature. This small detail has a large practical effect: without it, the first loop of the two-loop recursion starts from a poorly scaled direction.

Superlinear Convergence

Theorem

BFGS Superlinear Convergence

Statement

Under the above assumptions, BFGS with Wolfe line search converges superlinearly to the minimizer $x^*$ :

$\lim_{t \to \infty} \frac{\|x^{(t+1)} - x^*\|}{\|x^{(t)} - x^*\|} = 0$

That is, the convergence rate eventually beats any fixed linear rate, though it does not achieve the quadratic rate of exact Newton.

Intuition

As the iterates approach $x^*$ , the BFGS approximation $H_t$ converges to $[\nabla^2 f(x^*)]^{-1}$ along the directions that matter: the directions the algorithm actually moves. The approximation becomes increasingly accurate where it counts, so the steps approach true Newton steps, and the convergence rate transitions from linear to superlinear.

Proof Sketch

The proof relies on the Dennis-More characterization (1974). Superlinear convergence of a quasi-Newton method holds if and only if:

$\lim_{t \to \infty} \frac{\|(H_t - [\nabla^2 f(x^*)]^{-1}) d_t\|}{\|d_t\|} = 0$

This condition says that $H_t$ need not converge to the true inverse Hessian in every direction; only along the search directions $d_t$ . Call this directional convergence.

The proof that BFGS achieves directional convergence proceeds through a trace-determinant analysis. Define the error $E_t = \nabla^2 f(x^*)^{1/2} H_t \nabla^2 f(x^*)^{1/2} - I$ . The key identities are:

$\text{tr}(E_{t+1}) - \text{tr}(E_t) \leq C_1 \|s_t\|^2$

$\det(I + E_{t+1}) / \det(I + E_t) \geq C_2 \cos^2 \theta_t$

where $\theta_t$ is the angle between $d_t$ and the Newton direction. The trace grows slowly while the determinant is bounded below, which forces the eigenvalues of $I + E_t$ to cluster near 1 along the directions explored by the algorithm. Summing the trace bound over all iterations shows $\sum \|E_t d_t / \|d_t\|\|^2 < \infty$ , which gives the Dennis-More condition.

See Nocedal and Wright (2006), Theorem 6.6, for the full argument.

Why It Matters

Superlinear convergence means BFGS is much faster than gradient descent (which is only linearly convergent) and approaches the speed of Newton without computing second derivatives. For smooth, strongly convex problems, BFGS is the practical optimum: each iteration costs $O(d^2)$ versus $O(d^3)$ for Newton, and the number of iterations is comparable.

Failure Mode

The superlinear convergence requires strong convexity and sufficiently accurate line searches. On non-convex problems, BFGS may converge to saddle points, and the Hessian approximation can become increasingly inaccurate. The curvature condition $y_t^\top s_t > 0$ may fail, requiring skipped updates or damping. Even on convex problems, if the line search is too crude (e.g., fixed step size without Wolfe conditions), the approximation can degrade and convergence reverts to linear or worse.

report a correction →

Convergence Rate Comparison

Method	Rate	Cost per iteration	Storage
Gradient descent	Linear: $\kappa$ dependence	$O(d)$	$O(d)$
Conjugate gradient	Superlinear (on quadratics, finite termination in $d$ steps)	$O(d)$	$O(d)$
BFGS	Superlinear	$O(d^2)$	$O(d^2)$
L-BFGS ( $m$ pairs)	Superlinear (in practice; theory weaker)	$O(md)$	$O(md)$
Newton	Quadratic	$O(d^3)$	$O(d^2)$

For gradient descent on a strongly convex function with condition number $\kappa$ , the linear rate is $1 - 2/(\kappa + 1)$ . When $\kappa = 10^3$ , gradient descent needs roughly 3500 iterations to reduce the error by $10^{-6}$ . BFGS typically needs 50-100 iterations for the same reduction, independent of $\kappa$ , because the Hessian approximation adapts to the curvature.

L-BFGS: Limited-Memory BFGS

BFGS stores a $d \times d$ matrix $H_t$ : $O(d^2)$ memory. For $d = 10^6$ , this is 8 TB in double precision. L-BFGS avoids this by never forming $H_t$ explicitly.

Definition

L-BFGS $L imi t e d - m e m or y B F GS$

L-BFGS stores only the last $m$ pairs $\{(s_i, y_i)\}_{i=t-m}^{t-1}$ (typically $m = 5$ to $20$ ). The matrix-vector product $H_t \nabla f$ is computed using a two-loop recursion that requires only $O(md)$ storage and $O(md)$ computation.

The Two-Loop Recursion

The two-loop recursion (Nocedal, 1980) computes $r = H_t g$ without forming $H_t$ :

Input: gradient $g = \nabla f(x^{(t)})$ , stored pairs $\{(s_i, y_i, \rho_i)\}_{i=t-m}^{t-1}$ , scaling $\gamma_t$ .

Loop 1 (backward, $i = t-1$ down to $t-m$ ):

Set $q \leftarrow g$ . For each $i$ : compute $\alpha_i = \rho_i s_i^\top q$ ; set $q \leftarrow q - \alpha_i y_i$ .

Scale: $r = \gamma_t q$ where $\gamma_t = s_{t-1}^\top y_{t-1} / y_{t-1}^\top y_{t-1}$ .

Loop 2 (forward, $i = t-m$ up to $t-1$ ):

For each $i$ : compute $\beta = \rho_i y_i^\top r$ ; set $r \leftarrow r + (\alpha_i - \beta) s_i$ .

Output: $r = H_t g$ .

Cost: $4md$ multiplications and $2m$ inner products per call. Total: $O(md)$ time and $O(md)$ storage, where $m \ll d$ .

Why It Works

The two-loop recursion computes the product of $m$ rank-2 updates applied to the scaled identity $\gamma_t I$ . Unrolling the BFGS recurrence, $H_t$ has the form:

$H_t = V_{t-1}^\top \cdots V_{t-m}^\top (\gamma_t I) V_{t-m} \cdots V_{t-1} + \text{rank-2 corrections}$

where $V_i = I - \rho_i y_i s_i^\top$ . The two-loop recursion evaluates $H_t g$ by propagating $g$ through these factors without ever forming the $d \times d$ matrix. Old pairs beyond the window of $m$ are discarded, so the effective approximation "forgets" curvature information from more than $m$ steps ago.

Choosing $m$

The memory parameter $m$ controls the trade-off between approximation quality and cost:

$m = 3\text{-}5$ : fast, low memory, often sufficient for well-conditioned problems
$m = 10\text{-}20$ : standard for most applications. Diminishing returns beyond $m = 20$
$m = d$ : recovers full BFGS (but defeats the purpose)

Empirically, $m$ between 5 and 20 gives performance within 10-20% of full BFGS on most smooth problems. The sensitivity to $m$ decreases as the problem becomes better conditioned.

Why L-BFGS Is the Default

For medium-scale smooth optimization ( $d$ in the thousands to low millions, full-batch gradients available), L-BFGS dominates because:

$O(md)$ memory vs. $O(d^2)$ for BFGS
No learning rate schedule to tune (unlike SGD, which needs step size decay and momentum tuning)
Strong empirical convergence on smooth strongly convex problems
Built into every serious optimization library (scipy.optimize.minimize, PyTorch torch.optim.LBFGS)

Caveat — superlinear convergence is a full-BFGS result. The Dennis-More / Broyden-Dennis-More-Powell theory establishes local superlinear convergence for full-memory BFGS under the assumptions in the theorem above. With fixed memory $m \ll d$ (i.e.\ true L-BFGS), the classical superlinear-convergence proof breaks down because the approximation cannot capture all of $[\nabla^2 f(x^*)]^{-1}$ in $m$ rank-2 updates. Liu-Nocedal (1989, the original L-BFGS paper) only proves $R$ -linear convergence for L-BFGS on strongly convex problems; later analyses (Yuan 1991; Byrd-Liu-Nocedal 1992) sharpen the constants but do not recover superlinear rates in general. In practice, L-BFGS with $m = 10$ - $20$ converges almost as fast as full BFGS on most smooth problems, so the "L-BFGS = practical superlinear" intuition is empirically warranted — but it should not be quoted as a theorem.

L-BFGS is the standard for logistic regression, CRFs, maximum entropy models, and any smooth ML problem where you can compute full gradients. It is also the optimizer behind liblinear, the default solver for scikit-learn's LogisticRegression with L2 regularization.

When L-BFGS Loses

L-BFGS has clear failure modes:

Stochastic gradients: L-BFGS builds its approximation from gradient differences $y_t = g_{t+1} - g_t$ . If $g_t$ has noise from mini-batch sampling, $y_t$ is dominated by noise rather than curvature. Stochastic quasi-Newton methods exist (e.g., online L-BFGS, SQN) but are not yet standard.
Non-smooth objectives: if $f$ has kinks (as in L1 regularization), the gradient is discontinuous and the secant condition loses meaning. Use proximal methods instead.
Very large $d$ : for $d > 10^7$ , even $O(md)$ per iteration can be expensive when $m = 20$ . First-order methods (Adam, SGD with momentum) become the only option.
Non-convex deep learning: the loss surface of neural networks has saddle points, plateaus, and sharp minima. BFGS/L-BFGS applied to full-batch deep learning objectives tends to converge to sharp minima that generalize poorly, and the curvature condition $s_t^\top y_t > 0$ fails frequently.

Connection to Preconditioning

The inverse Hessian approximation $H_t$ acts as a preconditioner for the gradient. Instead of descending along $-\nabla f$ (steepest descent in the Euclidean metric), you descend along $-H_t \nabla f$ , which is steepest descent in the metric defined by $H_t^{-1}$ .

In the eigenspace of the Hessian, this rescaling approximately equalizes the step sizes across all directions: directions with high curvature (large eigenvalues) get smaller steps, and directions with low curvature (small eigenvalues) get larger steps. This is why quasi-Newton methods handle ill-conditioned problems far better than gradient descent: the condition number of the preconditioned system $H_t \nabla^2 f$ is close to 1 when $H_t \approx [\nabla^2 f]^{-1}$ .

Example

Preconditioning on an ill-conditioned quadratic

Consider $f(x) = \frac{1}{2} x^\top A x$ where $A = \text{diag}(1, 1000)$ . The condition number is $\kappa = 1000$ .

Gradient descent with optimal step size converges at rate $(\kappa - 1)/(\kappa + 1) \approx 0.998$ per iteration. To reduce the error by $10^{-6}$ requires about 6900 iterations.

BFGS, after accumulating curvature information from the first few steps, builds $H_t \approx A^{-1} = \text{diag}(1, 0.001)$ . The preconditioned gradient $H_t \nabla f = H_t A x \approx x$ points directly at the minimum. On this quadratic, BFGS terminates in exactly 2 steps (with exact line search).

This is the same principle behind adaptive gradient methods in deep learning: Adam and AdaGrad use diagonal preconditioning (diagonal Hessian approximation), while L-BFGS uses a low-rank-plus-diagonal approximation. The optimizer theory page discusses this connection.

Damped BFGS and Practical Safeguards

On non-convex problems, the curvature condition $s_t^\top y_t > 0$ can fail. Powell (1978) proposed a damped BFGS update that interpolates between $y_t$ and $B_t s_t$ :

$r_t = \theta_t y_t + (1 - \theta_t) B_t s_t$

where $\theta_t = 1$ if $s_t^\top y_t \geq 0.2 \, s_t^\top B_t s_t$ , and otherwise:

$\theta_t = \frac{0.8 \, s_t^\top B_t s_t}{s_t^\top B_t s_t - s_t^\top y_t}$

The modified pair $(s_t, r_t)$ always satisfies $s_t^\top r_t > 0$ , so the update preserves positive definiteness. The cost: the approximation is biased toward the current $B_t$ , slowing convergence slightly. Powell damping is used in most production BFGS implementations, including MATLAB's fminunc.

Common Confusions

Watch Out

BFGS approximates the Hessian, not the gradient

BFGS does not modify the gradient direction in an ad hoc way. It builds a principled approximation to the Hessian (or its inverse) that satisfies a consistency condition (the secant equation). The resulting search direction is a quasi-Newton direction: close to the true Newton direction when the approximation is accurate. Contrast with Adam, which preconditions with a diagonal running average of squared gradients; this is a diagonal Hessian approximation, not a full quasi-Newton update.

Watch Out

L-BFGS is not just BFGS with less memory

L-BFGS does not simply truncate the BFGS matrix. It implicitly represents $H_t$ as a product of rank-2 updates applied to a scaled identity, and computes the matrix-vector product $H_t g$ without ever forming $H_t$ . The two-loop recursion is the key algorithmic insight. The resulting approximation is different from full BFGS: it effectively forgets old curvature information beyond $m$ steps. In practice this limited memory is sufficient; empirically, convergence speed plateaus around $m = 10\text{-}20$ .

Watch Out

Quasi-Newton is not the same as conjugate gradient

Both methods achieve superlinear convergence on quadratics and terminate in $d$ steps. The difference: conjugate gradient uses $O(d)$ storage and generates search directions via a scalar recurrence, while BFGS maintains an $O(d^2)$ matrix and produces search directions via a matrix-vector product. On non-quadratic problems, BFGS adapts its curvature model more aggressively. Conjugate gradient requires exact line searches for its conjugacy property; BFGS is more tolerant of inexact line searches.

Watch Out

The Wolfe line search is not optional for BFGS

Some implementations default to a simple backtracking (Armijo) line search. For steepest descent, this suffices. For BFGS, it does not: the curvature condition $s_t^\top y_t > 0$ is only guaranteed by the strong Wolfe conditions. Without it, the BFGS matrix can lose positive definiteness and the algorithm can diverge. The Wolfe line search adds cost (typically 2-5 extra gradient evaluations per iteration) but is required for correctness.

Summary

Quasi-Newton methods approximate the Hessian using only gradient differences
The secant condition $B_{t+1} s_t = y_t$ is the fundamental constraint, derived from Taylor expansion of the gradient
BFGS uses a rank-2 update preserving positive definiteness: $O(d^2)$ per step
The BFGS update is the unique minimum-change update in a weighted Frobenius norm
L-BFGS stores only $m$ vector pairs: $O(md)$ per step with two-loop recursion
Superlinear convergence for strongly convex problems with Wolfe line search
L-BFGS is the default for medium-scale smooth optimization; it fails on stochastic, non-smooth, or very high-dimensional problems
The Hessian approximation acts as a preconditioner, equalizing curvature across the condition number spectrum

Exercises

ExerciseCore

Problem

Derive the secant condition from a first-order Taylor expansion of $\nabla f$ around $x^{(t)}$ evaluated at $x^{(t+1)}$ . Specifically, show that $\nabla^2 f(x^{(t)}) s_t \approx y_t$ under the assumption that the Hessian is approximately constant.

ExerciseCore

Problem

Verify directly that the BFGS inverse Hessian update satisfies the secant condition $H_{t+1} y_t = s_t$ . That is, substitute $y_t$ into the BFGS formula and show the result is $s_t$ .

ExerciseAdvanced

Problem

Consider BFGS initialized with $H_0 = I$ on a strictly convex quadratic $f(x) = \frac{1}{2} x^\top A x - b^\top x$ with $A \succ 0$ . Show that after $d$ steps with exact line searches, $H_d = A^{-1}$ (i.e., BFGS recovers the exact inverse Hessian in at most $d$ steps on a quadratic). This is called the finite termination property.

ExerciseAdvanced

Problem

Show that the damped BFGS update preserves positive definiteness. That is, if $B_t \succ 0$ and $r_t = \theta_t y_t + (1 - \theta_t) B_t s_t$ with the Powell choice of $\theta_t$ , then $s_t^\top r_t > 0$ .

References

Canonical:

Nocedal & Wright, Numerical Optimization (2006), Chapters 6-7. The standard reference; Chapter 6 covers the quasi-Newton framework, Chapter 7 covers L-BFGS.
Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapters 8-9
Dennis & More, "Characterization of Superlinear Convergence and Its Application to Quasi-Newton Methods" (1974), Mathematics of Computation 28(126)

Current:

Nocedal, "Updating Quasi-Newton Matrices with Limited Storage" (1980), Mathematics of Computation 35(151). The L-BFGS paper.
Liu & Nocedal, "On the Limited Memory BFGS Method for Large Scale Optimization" (1989), Mathematical Programming 45(1-3)
Powell, "Algorithms for Nonlinear Constraints That Use Lagrangian Functions" (1978), Mathematical Programming 14(1). Introduces damped BFGS.
Byrd, Lu, Nocedal & Zhu, "A Limited Memory Algorithm for Bound Constrained Optimization" (1995), SIAM Journal on Scientific Computing 16(5). The L-BFGS-B variant for box constraints.
Berahas, Nocedal & Takac, "A Multi-Batch L-BFGS Method for Machine Learning" (2016), NeurIPS. Extends L-BFGS to stochastic settings.

Next Topics

The natural next steps from quasi-Newton methods:

Proximal gradient methods: when the objective has a non-smooth regularizer (e.g., $\ell_1$ ), quasi-Newton cannot be applied directly; proximal operators handle the non-smooth part
Trust-region methods: an alternative globalization strategy to line search that constrains the step length directly; often paired with SR1 updates
Optimizer theory (SGD, Adam, Muon): how modern deep learning optimizers relate to the preconditioning ideas from quasi-Newton methods

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Newton's Methodlayer 1 · tier 1
Secant Methodlayer 1 · tier 2
Line Search Methodslayer 2 · tier 2

Derived topics

1

Proximal Gradient Methodslayer 2 · tier 1

Graph-backed continuations

Proximal Gradient Methods

Read this page in the graph.

Why This Matters

Quasi-Newton learns curvature from steps and gradients

Mental Model

The Secant Condition

The Curvature Condition in Detail

The BFGS Update

Verifying the Secant Condition

Verifying Positive Definiteness Preservation

The Minimality Property

DFP: The Other Rank-2 Update

SR1: The Rank-1 Alternative

The Full BFGS Algorithm

Initialization Strategy

Superlinear Convergence

Convergence Rate Comparison

L-BFGS: Limited-Memory BFGS

The Two-Loop Recursion

Why It Works

Choosing mmm

Why L-BFGS Is the Default

When L-BFGS Loses

Connection to Preconditioning

Damped BFGS and Practical Safeguards

Common Confusions

Summary

Exercises

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Choosing $m$