Line Search Methods

Sneiderman, Robby

Numerical Optimization

Line Search Methods

Line search chooses a step size along a descent direction. Armijo guarantees enough decrease, Wolfe prevents uselessly tiny steps, and backtracking gives a practical step-size controller without knowing the Lipschitz constant.

CoreTier 2StableSupporting~50 min

Prerequisites

Convex Optimization Basics Differentiation in Rn Newtons Method

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 2 | tier 2. This page has 3 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Conjugate Gradient Methods

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Gradient descent gives a direction. It does not give a safe distance.

For a current point $x_k$ and a descent direction $p_k$ , the update is

$x_{k+1}=x_k+\alpha_k p_k.$

The line search problem is to choose $\alpha_k$ . Too large: the objective rises or curvature invalidates the local model. Too small: the method technically descends but wastes iterations. Line search is the step-size controller behind gradient descent, Newton's method, quasi-Newton methods, nonlinear conjugate gradient, and many constrained solvers.

line search geometry

Step size is a one-dimensional acceptance test

Restrict the objective to a single ray from the current point. Armijo asks for enough decrease; Wolfe also asks whether the slope has flattened enough.

true ray objective

Armijo line

accepted step

rejected trial

ray objective

ϕ (α) = f (x_{k} + α p_{k})

Armijo

ϕ (α) \leq ϕ (0) + c_{1} α ϕ^{'} (0)

Wolfe

ϕ^{'} (α) \geq c_{2} ϕ^{'} (0)

Ray View

Reduce the multivariable problem to a one-dimensional function along a ray:

$\phi(\alpha)=f(x_k+\alpha p_k).$

The starting slope is

$\phi'(0)=\nabla f(x_k)^T p_k.$

A direction is a descent direction when $\phi'(0)<0$ . Line search does not need to solve the original optimization problem; it only asks whether a candidate $\alpha$ is acceptable on this ray.

Armijo Sufficient Decrease

Definition

Armijo Condition

A step size $\alpha>0$ satisfies the Armijo condition with $c_1\in(0,1)$ when

$f(x_k+\alpha p_k)\leq f(x_k)+c_1\alpha\nabla f(x_k)^T p_k.$

Because $\nabla f(x_k)^T p_k<0$ , the right-hand side is below $f(x_k)$ . The candidate step must decrease the objective by a fixed fraction of the first-order prediction.

Typical values use $c_1=10^{-4}$ . The number is small because Armijo is not trying to demand the exact one-dimensional minimum; it is trying to reject steps that are obviously too aggressive.

Lemma

Armijo Steps Exist For Descent Directions

Statement

If $\nabla f(x_k)^T p_k<0$ , then every sufficiently small positive $\alpha$ satisfies the Armijo condition.

Intuition

For tiny $\alpha$ , the function behaves like its first-order Taylor approximation. Since $p_k$ points downhill at $\alpha=0$ , a small enough step must decrease the objective.

Proof Sketch

Taylor expansion gives $f(x_k+\alpha p_k)=f(x_k)+\alpha\nabla f(x_k)^T p_k+o(\alpha)$ . Subtract the Armijo right side. The leading term is $\alpha(1-c_1)\nabla f(x_k)^T p_k$ , which is negative. For sufficiently small $\alpha$ , the lower-order term cannot overturn that sign.

report a correction →

Backtracking

Backtracking line search is the practical Armijo algorithm:

Start with $\alpha=\alpha_0$ , often $\alpha_0=1$ for Newton-like methods.
If Armijo fails, set $\alpha\leftarrow \rho\alpha$ with $\rho\in(0,1)$ .
Repeat until Armijo holds.

The algorithm terminates because of the Armijo existence lemma.

Theorem

Gradient Descent With Backtracking Reaches Stationary Points

Statement

For gradient descent with Armijo backtracking, the gradients satisfy

$\min_{0\leq k<K}\|\nabla f(x_k)\|^2 \leq C(f(x_0)-f^*)/K$

for a constant $C$ depending on the line-search parameters and the smoothness constant. In particular, some iterate has small gradient as $K$ grows, and under standard assumptions the gradient norm converges to zero.

Intuition

Backtracking prevents steps from being too large, while stopping at the first acceptable step prevents systematic microscopic steps. Each accepted step pays for a decrease proportional to the squared gradient.

Proof Sketch

Smoothness gives a quadratic upper bound on $f(x_k-\alpha\nabla f(x_k))$ . This bound implies Armijo holds once $\alpha$ is below a constant multiple of the inverse smoothness constant. Therefore accepted steps have a uniform lower bound unless the initial step is smaller. Armijo then gives a telescoping sum of decreases proportional to $\|\nabla f(x_k)\|^2$ .

Failure Mode

This proves first-order stationarity, not global optimality. Non-convex objectives can still converge to saddles or poor local minima.

report a correction →

Wolfe Conditions

Armijo alone rejects steps that are too large, but it can accept steps that are too small. Wolfe adds a curvature condition:

Definition

Wolfe Conditions

A step satisfies the Wolfe conditions when

$f(x_k+\alpha p_k)\leq f(x_k)+c_1\alpha\nabla f(x_k)^T p_k$

and

$\nabla f(x_k+\alpha p_k)^T p_k \geq c_2\nabla f(x_k)^T p_k,$

where $0<c_1<c_2<1$ .

The curvature condition says the directional derivative at the new point should be less negative. In plain language: do not stop while the slope is still strongly downhill.

The strong Wolfe condition replaces the second inequality with

$|\nabla f(x_k+\alpha p_k)^T p_k|\leq c_2|\nabla f(x_k)^T p_k|.$

Strong Wolfe also avoids stepping so far that the slope becomes too positive.

Zoutendijk's Theorem

Theorem

Zoutendijk Condition

Statement

Let $\theta_k$ be the angle between $p_k$ and the negative gradient. Under the stated assumptions,

$\sum_{k=0}^{\infty}\cos^2(\theta_k)\|\nabla f(x_k)\|^2 < \infty.$

If the directions stay gradient-related, meaning $\cos(\theta_k)$ is bounded away from zero, then $\|\nabla f(x_k)\|\to 0$ .

Intuition

Wolfe line search plus descent directions force enough cumulative progress. If the search directions do not become nearly orthogonal to the gradient, the gradients must vanish.

Why It Matters

This is why Wolfe conditions are standard for nonlinear conjugate gradient and quasi-Newton methods: they support convergence for directions more complex than $-\nabla f$ .

report a correction →

Practical Defaults

Method	Typical line search	Reason
gradient descent	Armijo backtracking	cheap and sufficient
Newton	Armijo or Wolfe	full step often works near optimum
BFGS / L-BFGS	strong Wolfe	preserves curvature update quality
nonlinear CG	strong Wolfe	controls conjugacy degradation
stochastic training	usually no line search	noisy minibatch objectives make exact accept/reject unreliable

Common Confusions

Watch Out

Exact line search is not automatically better

Exact line search minimizes along the current ray, but that ray may be a poor direction. On ill-conditioned quadratics, steepest descent with exact line search can still zigzag.

Watch Out

Armijo is not a learning-rate schedule

Armijo chooses a step from function evaluations at the current point. A schedule chooses steps from iteration count or validation heuristics. They solve different problems.

Watch Out

A tiny accepted step can still be bad

Armijo permits very small accepted steps. Wolfe or strong Wolfe conditions are used when small steps would destroy curvature information or make nonlinear CG stall.

Watch Out

Line search assumes the direction is worth taking

Line search fixes step length. It does not repair a non-descent direction. If $\nabla f(x_k)^T p_k\geq 0$ , first fix the direction.

Q&A For Mastery

Why is $\alpha=1$ special for Newton? Near a minimizer, Newton's quadratic model becomes accurate, so the full Newton step often satisfies the line search and gives fast local convergence.

Why do deep nets rarely use classical line search? Minibatch losses are noisy and expensive to evaluate repeatedly. SGD and Adam instead use scheduled or adaptive learning rates. Full-batch or deterministic subproblems still use line search.

What should I log? Track accepted $\alpha$ , number of backtracking reductions, objective decrease, and directional derivative. Many reductions usually mean the proposed direction or initial step is too aggressive.

What To Remember

Line search chooses $\alpha$ along a direction $p_k$ .
Armijo guarantees enough decrease.
Wolfe prevents stopping too early along the ray.
Backtracking works because Armijo holds for small enough steps.
For quasi-Newton and nonlinear CG, strong Wolfe is usually the safe default.

Exercises

ExerciseCore

Problem

For $f(x)=x^4$ , $x_0=2$ , $p=-f'(x_0)=-32$ , and $c_1=10^{-4}$ , does $\alpha=1$ satisfy Armijo?

ExerciseAdvanced

Problem

Why does the Wolfe curvature condition prevent uselessly small steps?

ExerciseResearch

Problem

Explain why line search becomes delicate for stochastic minibatch objectives.

References

Canonical:

Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 3.
Bertsekas, Nonlinear Programming, 3rd ed., Sections 1.2 and 1.3.
Armijo, "Minimization of Functions Having Lipschitz Continuous First Partial Derivatives" (1966).
Wolfe, "Convergence Conditions for Ascent Methods" (1969).

Further:

Boyd and Vandenberghe, Convex Optimization, Section 9.2.
Fletcher, Practical Methods of Optimization, line-search chapters.

Next Topics

Conjugate gradient methods: line search plus conjugate directions for quadratics and nonlinear objectives.
Quasi-Newton methods: why strong Wolfe protects BFGS curvature updates.
Trust region methods: the alternative strategy: control step radius instead of searching along a fixed direction.

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Differentiation in Rⁿlayer 0A · tier 1
Convex Optimization Basicslayer 1 · tier 1
Newton's Methodlayer 1 · tier 1

Derived topics

3

Quasi-Newton Methodslayer 2 · tier 1
Conjugate Gradient Methodslayer 2 · tier 2
Trust Region Methodslayer 2 · tier 2

Graph-backed continuations

Conjugate Gradient Methods Quasi-Newton Methods Trust Region Methods