Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Ascent Algorithms and Hill Climbing

Gradient ascent, hill climbing, and their failure modes: local optima, plateaus, and ridges. Random restarts and simulated annealing as strategies for escaping local optima.

CoreTier 2Stable~35 min
0

Why This Matters

Maximization and minimization are the same problem with a sign flip, but the convention matters. In maximum likelihood estimation, you maximize a likelihood. In reinforcement learning, you maximize a reward. In variational inference, you maximize an ELBO. Many formulations are naturally phrased as maximization, and the algorithms that solve them are ascent methods.

Understanding how ascent methods fail (local optima, plateaus, ridges) is prerequisite to understanding why more sophisticated methods (momentum, adaptive learning rates, second-order methods) exist.

Gradient Ascent

Definition

Gradient Ascent

Given a differentiable function f:RdRf: \mathbb{R}^d \to \mathbb{R} to maximize, gradient ascent updates:

θk+1=θk+ηf(θk)\theta_{k+1} = \theta_k + \eta \nabla f(\theta_k)

where η>0\eta > 0 is the step size (learning rate). This moves in the direction of steepest increase of ff.

This is gradient descent on f-f. Every convergence result from convex optimization applies after the sign flip. If ff is concave, every local maximum is global, and gradient ascent converges to it.

Hill Climbing

Definition

Hill Climbing

Hill climbing is the discrete analogue of gradient ascent. Given a finite set of solutions SS, a neighborhood function N:S2SN: S \to 2^S, and an objective f:SRf: S \to \mathbb{R}:

  1. Start at s0Ss_0 \in S.
  2. At each step, move to sk+1=argmaxsN(sk)f(s)s_{k+1} = \arg\max_{s \in N(s_k)} f(s) if f(sk+1)>f(sk)f(s_{k+1}) > f(s_k).
  3. If no neighbor improves ff, stop. You are at a local maximum.

Hill climbing is greedy: it only accepts improving moves. It terminates in finite time on finite solution spaces because ff strictly increases at each step and S|S| is finite.

Main Theorems

Theorem

Gradient Ascent Convergence for Smooth Concave Functions

Statement

If f:RdRf: \mathbb{R}^d \to \mathbb{R} is concave with LL-Lipschitz continuous gradient, and gradient ascent uses step size η1/L\eta \leq 1/L, then after kk iterations:

f(θ)f(θk)θ0θ22ηkf(\theta^*) - f(\theta_k) \leq \frac{\|\theta_0 - \theta^*\|^2}{2\eta k}

where θ=argmaxf\theta^* = \arg\max f. With η=1/L\eta = 1/L, this gives an O(1/k)O(1/k) convergence rate.

Intuition

The Lipschitz gradient condition means ff does not curve too sharply, so a step of size 1/L1/L always makes progress. The concavity ensures there are no local maxima to trap the algorithm. The convergence rate O(1/k)O(1/k) means halving the error requires doubling the iterations.

Proof Sketch

The proof mirrors gradient descent on f-f. By the descent lemma applied to f-f: f(θk+1)f(θk)ηf(θk)2+Lη22f(θk)2-f(\theta_{k+1}) \leq -f(\theta_k) - \eta \|\nabla f(\theta_k)\|^2 + \frac{L\eta^2}{2}\|\nabla f(\theta_k)\|^2. With η1/L\eta \leq 1/L, each step decreases f-f (increases ff). Telescope the inequalities and use concavity (f(θ)f(θk)f(θk),θθkf(\theta^*) - f(\theta_k) \leq \langle \nabla f(\theta_k), \theta^* - \theta_k \rangle) to get the bound.

Why It Matters

This is the baseline convergence result for maximization. Every more sophisticated method (heavy ball, Nesterov acceleration, Adam) is measured against this O(1/k)O(1/k) rate. If your function is concave and smooth, plain gradient ascent with step size 1/L1/L is a safe default.

Failure Mode

For non-concave ff, gradient ascent converges to a stationary point (f=0\nabla f = 0), but this could be a local maximum, a saddle point, or even a local minimum (if approaching from the wrong direction with bad step size). The guarantee of global optimality requires concavity.

Failure Modes of Ascent

Three landscape features cause ascent methods to fail on non-concave problems:

Local optima. A point θ\theta where f(θ)=0\nabla f(\theta) = 0 and the Hessian is negative definite, but f(θ)<f(θ)f(\theta) < f(\theta^*). Gradient ascent stops here because there is no local direction of improvement. In discrete settings, hill climbing stops at any solution that is better than all its neighbors.

Plateaus. A region where f0\nabla f \approx 0 but the function is not at a maximum. Gradient ascent takes tiny steps and appears to converge, but the true maximum is far away. Plateaus arise in symmetric parameterizations and in functions with flat regions.

Ridges. A narrow region of high function values where the gradient points nearly perpendicular to the ridge direction. Gradient ascent oscillates across the ridge instead of following it upward. This is the maximization analogue of the narrow valley problem in gradient descent.

Escaping Local Optima

Random Restarts

Run hill climbing (or gradient ascent) from mm random initial points. Return the best solution found. If the fraction of the search space that leads to the global optimum is pp, then the probability of missing it after mm restarts is (1p)m(1 - p)^m. For p=0.01p = 0.01 and m=500m = 500, this probability is (0.99)5000.0066(0.99)^{500} \approx 0.0066.

Random restarts are simple and embarrassingly parallel. They work well when basins of attraction are not too small.

Simulated Annealing

Theorem

Asymptotic Convergence of Simulated Annealing

Statement

Let SS be a finite solution space with objective ff to maximize. Simulated annealing accepts a neighbor ss' of current solution ss with probability:

P(accept)={1if f(s)f(s)exp(f(s)f(s)T(k))if f(s)<f(s)P(\text{accept}) = \begin{cases} 1 & \text{if } f(s') \geq f(s) \\ \exp\left(\frac{f(s') - f(s)}{T(k)}\right) & \text{if } f(s') < f(s) \end{cases}

With a logarithmic cooling schedule T(k)=c/log(k+1)T(k) = c / \log(k+1) where cdc \geq d^* (the depth of the deepest non-global local optimum), the probability of being at the global optimum converges to 1 as kk \to \infty.

Intuition

At high temperature, the algorithm behaves like a random walk, exploring the entire space. As the temperature decreases, it increasingly favors improving moves. Logarithmic cooling is slow enough to ensure the algorithm does not get permanently trapped in any local optimum. The constant cc must be large enough to escape the deepest trap.

Proof Sketch

Model the search as a non-homogeneous Markov chain. At temperature TT, the stationary distribution is the Boltzmann distribution πT(s)exp(f(s)/T)\pi_T(s) \propto \exp(f(s)/T). As T0T \to 0, this distribution concentrates on global maxima. The logarithmic cooling schedule ensures convergence to the stationary distribution at each temperature (using detailed balance and mixing time arguments).

Why It Matters

Simulated annealing provides a theoretical guarantee that no local optimum can permanently trap the search. In practice, the logarithmic schedule is too slow; geometric schedules (Tk+1=αTkT_{k+1} = \alpha T_k with α0.95\alpha \approx 0.95) are used instead and work well empirically despite lacking the asymptotic guarantee.

Failure Mode

The logarithmic schedule is impractically slow. For a problem with S=1010|S| = 10^{10}, convergence requires more iterations than are feasible. Practical implementations use faster cooling and may get stuck. Also, the constant cc requires knowing dd^*, the depth of the deepest local optimum, which is typically unknown.

Connection to Optimization Landscape Analysis

The success of ascent methods depends on the landscape structure:

  • Convex/concave landscapes: gradient ascent finds the global optimum. No need for restarts or annealing.
  • Multimodal with a few large basins: random restarts work well.
  • Multimodal with many similar-quality optima: tabu search or simulated annealing explores more systematically.
  • Rugged landscapes with many narrow optima: most local methods struggle. Consider population-based methods (evolutionary algorithms) or problem reformulation.

The choice of method depends on what you know about the landscape. If you know nothing, random restarts are the safest default. If you know the landscape has structure (e.g., solutions near the optimum share features), intensification strategies like tabu search are more efficient.

Common Confusions

Watch Out

Gradient ascent is not hill climbing

Gradient ascent operates on continuous, differentiable functions and uses gradient information. Hill climbing operates on discrete solution spaces and compares objective values of neighbors. They share the same greedy logic (move toward improvement) but apply in different settings.

Watch Out

Simulated annealing is not guaranteed to find the optimum in practice

The asymptotic convergence theorem requires a logarithmic cooling schedule that is too slow for real problems. Practical cooling schedules (geometric, adaptive) are faster but lose the theoretical guarantee. Simulated annealing in practice is a heuristic with good empirical performance, not an exact algorithm.

Summary

  • Gradient ascent: θk+1=θk+ηf(θk)\theta_{k+1} = \theta_k + \eta \nabla f(\theta_k), converges at O(1/k)O(1/k) for concave functions
  • Hill climbing: discrete greedy ascent, terminates at local optima
  • Three failure modes: local optima, plateaus, ridges
  • Random restarts: simple, parallel, work when basins are not too small
  • Simulated annealing: accept worse moves with decreasing probability, asymptotic guarantee requires impractical cooling schedule
  • Method choice depends on landscape structure

Exercises

ExerciseCore

Problem

A function f:R2Rf: \mathbb{R}^2 \to \mathbb{R} has a global maximum at (3,3)(3, 3) with f(3,3)=10f(3,3) = 10, and a local maximum at (0,0)(0, 0) with f(0,0)=7f(0,0) = 7. The basin of attraction for (3,3)(3, 3) covers 20% of a reasonable initialization region. How many random restarts do you need so that the probability of finding the global maximum is at least 0.99?

ExerciseAdvanced

Problem

In simulated annealing with a geometric cooling schedule Tk=T0αkT_k = T_0 \cdot \alpha^k, the depth of the deepest local optimum is d=5d^* = 5, and the initial temperature is T0=10T_0 = 10. At what iteration kk does the acceptance probability of a move that worsens the objective by exactly d=5d^* = 5 drop below 0.01? Use α=0.95\alpha = 0.95.

References

Canonical:

  • Kirkpatrick, Gelatt & Vecchi, "Optimization by Simulated Annealing" (1983), Science
  • Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2021), Chapter 4.1

Current:

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapter 9 (gradient methods)
  • Gendreau & Potvin (eds.), Handbook of Metaheuristics (3rd ed., 2019), Chapters 1-2

Next Topics

  • Tabu search: memory-based local search that prevents cycling

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics