Numerical Optimization
Line Search Methods
Choose step sizes that guarantee sufficient decrease and curvature conditions. Armijo and Wolfe conditions, backtracking, and why step size selection makes or breaks gradient descent.
Prerequisites
Why This Matters
Gradient descent tells you which direction to move (negative gradient) but not how far. If you step too far, you overshoot and the objective increases. If you step too little, convergence is painfully slow. Line search methods automate step size selection with provable guarantees: the objective decreases enough at each step to ensure convergence.
Every practical optimizer. L-BFGS, conjugate gradient, Newton's method. relies on a line search to select step sizes. Getting this right is essential.
Mental Model
You are standing on a hilly landscape and you know the downhill direction. But how far should you walk before stopping? Walk too far and you end up climbing again. Walk too little and it takes forever to reach the valley.
Line search says: walk until you have descended "enough" (Armijo condition), and optionally until the slope has flattened "enough" (Wolfe condition). Backtracking is the simplest strategy: try a big step, and if it does not satisfy the condition, halve it and try again.
Formal Setup and Notation
We want to minimize where is smooth. At iteration , we have a descent direction (e.g., ) and we want to choose a step size :
Armijo (Sufficient Decrease) Condition
The step size satisfies the Armijo condition with parameter if:
The right side is a linear approximation of along with slope reduced by . Typical choice: . This ensures the objective decreases by at least a fraction of what the linear model predicts.
Curvature (Wolfe) Condition
The step size satisfies the curvature condition with parameter if:
This prevents from being too small: it requires that the slope at the new point has decreased sufficiently compared to the slope at the starting point. Typical choice: for Newton/quasi-Newton, for conjugate gradient.
Wolfe Conditions
A step size satisfying both the Armijo condition and the curvature condition is said to satisfy the Wolfe conditions. The strong Wolfe conditions replace the curvature condition with , which also prevents the step from being too large.
Core Definitions
Backtracking line search is the simplest method to find a step size satisfying the Armijo condition:
- Start with (e.g., )
- While : set where (e.g., )
- Return
This always terminates because the Armijo condition holds for all sufficiently small (by Taylor expansion).
Exact line search finds . This is rarely used in practice because: (a) it requires solving a one-dimensional optimization problem at each step, and (b) inexact line search with Wolfe conditions gives the same convergence guarantees for much less work.
Main Theorems
Convergence of Gradient Descent with Backtracking
Statement
Gradient descent with backtracking line search (Armijo condition, parameter , contraction factor ) satisfies:
In particular, as .
Intuition
The Armijo condition guarantees that each step makes enough progress. The backtracking ensures the step size is never too large (we shrink until Armijo holds) and never too small (we stop shrinking as soon as Armijo holds). The minimum step size is at least , ensuring bounded progress. Over iterations, at least one gradient must be small.
Proof Sketch
At each iteration, the Armijo condition gives . The backtracking ensures (if the initial step was rejected, the accepted step is at least times the threshold ). Summing over iterations: . Divide both sides by and take the minimum.
Why It Matters
This shows gradient descent converges to a stationary point without knowing the Lipschitz constant in advance. The backtracking line search automatically adapts the step size. You only need to be smooth and bounded below.
Failure Mode
This guarantees convergence to a stationary point, not a global minimum. For nonconvex , the algorithm may converge to a saddle point or local minimum. Also, if the initial step size is too small, the algorithm wastes iterations not taking advantage of possible larger steps.
Canonical Examples
Backtracking on a quadratic
For with positive definite, the gradient is and . With and , backtracking accepts whenever the condition number is moderate. For ill-conditioned problems, it may need a few halvings to find an acceptable step.
Common Confusions
Armijo alone is not enough for quasi-Newton methods
The Armijo condition alone allows step sizes that are too small. For quasi-Newton methods (like L-BFGS), you also need the curvature condition (Wolfe conditions) to ensure that the approximate Hessian update is well-defined. Without the curvature condition, the quasi-Newton updates can degenerate.
Exact line search is not better in general
Exact line search gives the smallest possible value along the search direction, but this does not necessarily lead to the fewest total iterations. For quadratics, exact line search with steepest descent produces zigzagging. The Wolfe conditions are sufficient for convergence and much cheaper.
Summary
- Armijo condition:
- Wolfe adds curvature: gradient at new point is flatter than at old point
- Backtracking: start big, shrink by factor until Armijo holds
- Step size too large: objective increases. Too small: slow convergence
- Backtracking line search does not require knowing
Exercises
Problem
For , , direction , and : does the step size satisfy the Armijo condition?
Problem
Prove that for any and any descent direction (with ), the Armijo condition is satisfied for all sufficiently small .
References
Canonical:
- Nocedal & Wright, Numerical Optimization (2006), Chapter 3
- Armijo, "Minimization of Functions Having Lipschitz Continuous First Partial Derivatives" (1966)
Current:
-
Boyd & Vandenberghe, Convex Optimization (2004), Section 9.2
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)
Next Topics
The natural next steps from line search methods:
- Proximal gradient methods: line search for composite objectives
- Quasi-Newton methods: using curvature information with Wolfe conditions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Conjugate Gradient MethodsLayer 2