Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Line Search Methods

Choose step sizes that guarantee sufficient decrease and curvature conditions. Armijo and Wolfe conditions, backtracking, and why step size selection makes or breaks gradient descent.

CoreTier 2Stable~40 min
0

Why This Matters

Gradient descent tells you which direction to move (negative gradient) but not how far. If you step too far, you overshoot and the objective increases. If you step too little, convergence is painfully slow. Line search methods automate step size selection with provable guarantees: the objective decreases enough at each step to ensure convergence.

Every practical optimizer. L-BFGS, conjugate gradient, Newton's method. relies on a line search to select step sizes. Getting this right is essential.

Mental Model

You are standing on a hilly landscape and you know the downhill direction. But how far should you walk before stopping? Walk too far and you end up climbing again. Walk too little and it takes forever to reach the valley.

Line search says: walk until you have descended "enough" (Armijo condition), and optionally until the slope has flattened "enough" (Wolfe condition). Backtracking is the simplest strategy: try a big step, and if it does not satisfy the condition, halve it and try again.

Formal Setup and Notation

We want to minimize f:RdRf: \mathbb{R}^d \to \mathbb{R} where ff is smooth. At iteration kk, we have a descent direction pkp_k (e.g., pk=f(xk)p_k = -\nabla f(x_k)) and we want to choose a step size αk>0\alpha_k > 0:

xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k

Definition

Armijo (Sufficient Decrease) Condition

The step size α\alpha satisfies the Armijo condition with parameter c1(0,1)c_1 \in (0, 1) if:

f(xk+αpk)f(xk)+c1αf(xk)Tpkf(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f(x_k)^T p_k

The right side is a linear approximation of ff along pkp_k with slope reduced by c1c_1. Typical choice: c1=104c_1 = 10^{-4}. This ensures the objective decreases by at least a fraction of what the linear model predicts.

Definition

Curvature (Wolfe) Condition

The step size α\alpha satisfies the curvature condition with parameter c2(c1,1)c_2 \in (c_1, 1) if:

f(xk+αpk)Tpkc2f(xk)Tpk\nabla f(x_k + \alpha p_k)^T p_k \geq c_2 \nabla f(x_k)^T p_k

This prevents α\alpha from being too small: it requires that the slope at the new point has decreased sufficiently compared to the slope at the starting point. Typical choice: c2=0.9c_2 = 0.9 for Newton/quasi-Newton, c2=0.1c_2 = 0.1 for conjugate gradient.

Definition

Wolfe Conditions

A step size satisfying both the Armijo condition and the curvature condition is said to satisfy the Wolfe conditions. The strong Wolfe conditions replace the curvature condition with f(xk+αpk)Tpkc2f(xk)Tpk|\nabla f(x_k + \alpha p_k)^T p_k| \leq c_2 |\nabla f(x_k)^T p_k|, which also prevents the step from being too large.

Core Definitions

Backtracking line search is the simplest method to find a step size satisfying the Armijo condition:

  1. Start with α=α0\alpha = \alpha_0 (e.g., α0=1\alpha_0 = 1)
  2. While f(xk+αpk)>f(xk)+c1αf(xk)Tpkf(x_k + \alpha p_k) > f(x_k) + c_1 \alpha \nabla f(x_k)^T p_k: set αρα\alpha \leftarrow \rho \alpha where ρ(0,1)\rho \in (0, 1) (e.g., ρ=0.5\rho = 0.5)
  3. Return α\alpha

This always terminates because the Armijo condition holds for all sufficiently small α>0\alpha > 0 (by Taylor expansion).

Exact line search finds α=argminα>0f(xk+αpk)\alpha^* = \arg\min_{\alpha > 0} f(x_k + \alpha p_k). This is rarely used in practice because: (a) it requires solving a one-dimensional optimization problem at each step, and (b) inexact line search with Wolfe conditions gives the same convergence guarantees for much less work.

Main Theorems

Theorem

Convergence of Gradient Descent with Backtracking

Statement

Gradient descent with backtracking line search (Armijo condition, parameter c1(0,1)c_1 \in (0,1), contraction factor ρ(0,1)\rho \in (0,1)) satisfies:

mink=0,,K1f(xk)2f(x0)fKc1min(α0,ρ/L)\min_{k=0,\ldots,K-1} \|\nabla f(x_k)\|^2 \leq \frac{f(x_0) - f^*}{K \cdot c_1 \cdot \min(\alpha_0, \rho/L)}

In particular, f(xk)0\|\nabla f(x_k)\| \to 0 as kk \to \infty.

Intuition

The Armijo condition guarantees that each step makes enough progress. The backtracking ensures the step size is never too large (we shrink until Armijo holds) and never too small (we stop shrinking as soon as Armijo holds). The minimum step size is at least ρ/L\rho / L, ensuring bounded progress. Over KK iterations, at least one gradient must be small.

Proof Sketch

At each iteration, the Armijo condition gives f(xk)f(xk+1)c1αkf(xk)2f(x_k) - f(x_{k+1}) \geq c_1 \alpha_k \|\nabla f(x_k)\|^2. The backtracking ensures αkρ/L\alpha_k \geq \rho/L (if the initial step was rejected, the accepted step is at least ρ\rho times the threshold 1/L1/L). Summing over iterations: f(x0)fc1min(α0,ρ/L)k=0K1f(xk)2f(x_0) - f^* \geq c_1 \min(\alpha_0, \rho/L) \sum_{k=0}^{K-1} \|\nabla f(x_k)\|^2. Divide both sides by KK and take the minimum.

Why It Matters

This shows gradient descent converges to a stationary point without knowing the Lipschitz constant LL in advance. The backtracking line search automatically adapts the step size. You only need ff to be smooth and bounded below.

Failure Mode

This guarantees convergence to a stationary point, not a global minimum. For nonconvex ff, the algorithm may converge to a saddle point or local minimum. Also, if the initial step size α0\alpha_0 is too small, the algorithm wastes iterations not taking advantage of possible larger steps.

Canonical Examples

Example

Backtracking on a quadratic

For f(x)=12xTAxbTxf(x) = \frac{1}{2}x^TAx - b^Tx with AA positive definite, the gradient is f(x)=Axb\nabla f(x) = Ax - b and L=λmax(A)L = \lambda_{\max}(A). With α0=1\alpha_0 = 1 and ρ=0.5\rho = 0.5, backtracking accepts α=1\alpha = 1 whenever the condition number is moderate. For ill-conditioned problems, it may need a few halvings to find an acceptable step.

Common Confusions

Watch Out

Armijo alone is not enough for quasi-Newton methods

The Armijo condition alone allows step sizes that are too small. For quasi-Newton methods (like L-BFGS), you also need the curvature condition (Wolfe conditions) to ensure that the approximate Hessian update is well-defined. Without the curvature condition, the quasi-Newton updates can degenerate.

Watch Out

Exact line search is not better in general

Exact line search gives the smallest possible value along the search direction, but this does not necessarily lead to the fewest total iterations. For quadratics, exact line search with steepest descent produces zigzagging. The Wolfe conditions are sufficient for convergence and much cheaper.

Summary

  • Armijo condition: f(x+αp)f(x)+c1αf(x)Tpf(x + \alpha p) \leq f(x) + c_1 \alpha \nabla f(x)^T p
  • Wolfe adds curvature: gradient at new point is flatter than at old point
  • Backtracking: start big, shrink by factor ρ\rho until Armijo holds
  • Step size too large: objective increases. Too small: slow convergence
  • Backtracking line search does not require knowing LL

Exercises

ExerciseCore

Problem

For f(x)=x4f(x) = x^4, x0=2x_0 = 2, direction p=f(x0)=32p = -f'(x_0) = -32, and c1=104c_1 = 10^{-4}: does the step size α=1\alpha = 1 satisfy the Armijo condition?

ExerciseAdvanced

Problem

Prove that for any c1(0,1)c_1 \in (0,1) and any descent direction pkp_k (with f(xk)Tpk<0\nabla f(x_k)^T p_k < 0), the Armijo condition is satisfied for all sufficiently small α>0\alpha > 0.

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapter 3
  • Armijo, "Minimization of Functions Having Lipschitz Continuous First Partial Derivatives" (1966)

Current:

  • Boyd & Vandenberghe, Convex Optimization (2004), Section 9.2

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)

Next Topics

The natural next steps from line search methods:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics