Numerical Optimization
Ascent Algorithms and Hill Climbing
Gradient ascent, hill climbing, and their failure modes: local optima, plateaus, and ridges. Random restarts and simulated annealing as strategies for escaping local optima.
Prerequisites
Why This Matters
Maximization and minimization are the same problem with a sign flip, but the convention matters. In maximum likelihood estimation, you maximize a likelihood. In reinforcement learning, you maximize a reward. In variational inference, you maximize an ELBO. Many formulations are naturally phrased as maximization, and the algorithms that solve them are ascent methods.
Understanding how ascent methods fail (local optima, plateaus, ridges) is prerequisite to understanding why more sophisticated methods (momentum, adaptive learning rates, second-order methods) exist.
Gradient Ascent
Gradient Ascent
Given a differentiable function to maximize, gradient ascent updates:
where is the step size (learning rate). This moves in the direction of steepest increase of .
This is gradient descent on . Every convergence result from convex optimization applies after the sign flip. If is concave, every local maximum is global, and gradient ascent converges to it.
Hill Climbing
Hill Climbing
Hill climbing is the discrete analogue of gradient ascent. Given a finite set of solutions , a neighborhood function , and an objective :
- Start at .
- At each step, move to if .
- If no neighbor improves , stop. You are at a local maximum.
Hill climbing is greedy: it only accepts improving moves. It terminates in finite time on finite solution spaces because strictly increases at each step and is finite.
Main Theorems
Gradient Ascent Convergence for Smooth Concave Functions
Statement
If is concave with -Lipschitz continuous gradient, and gradient ascent uses step size , then after iterations:
where . With , this gives an convergence rate.
Intuition
The Lipschitz gradient condition means does not curve too sharply, so a step of size always makes progress. The concavity ensures there are no local maxima to trap the algorithm. The convergence rate means halving the error requires doubling the iterations.
Proof Sketch
The proof mirrors gradient descent on . By the descent lemma applied to : . With , each step decreases (increases ). Telescope the inequalities and use concavity () to get the bound.
Why It Matters
This is the baseline convergence result for maximization. Every more sophisticated method (heavy ball, Nesterov acceleration, Adam) is measured against this rate. If your function is concave and smooth, plain gradient ascent with step size is a safe default.
Failure Mode
For non-concave , gradient ascent converges to a stationary point (), but this could be a local maximum, a saddle point, or even a local minimum (if approaching from the wrong direction with bad step size). The guarantee of global optimality requires concavity.
Failure Modes of Ascent
Three landscape features cause ascent methods to fail on non-concave problems:
Local optima. A point where and the Hessian is negative definite, but . Gradient ascent stops here because there is no local direction of improvement. In discrete settings, hill climbing stops at any solution that is better than all its neighbors.
Plateaus. A region where but the function is not at a maximum. Gradient ascent takes tiny steps and appears to converge, but the true maximum is far away. Plateaus arise in symmetric parameterizations and in functions with flat regions.
Ridges. A narrow region of high function values where the gradient points nearly perpendicular to the ridge direction. Gradient ascent oscillates across the ridge instead of following it upward. This is the maximization analogue of the narrow valley problem in gradient descent.
Escaping Local Optima
Random Restarts
Run hill climbing (or gradient ascent) from random initial points. Return the best solution found. If the fraction of the search space that leads to the global optimum is , then the probability of missing it after restarts is . For and , this probability is .
Random restarts are simple and embarrassingly parallel. They work well when basins of attraction are not too small.
Simulated Annealing
Asymptotic Convergence of Simulated Annealing
Statement
Let be a finite solution space with objective to maximize. Simulated annealing accepts a neighbor of current solution with probability:
With a logarithmic cooling schedule where (the depth of the deepest non-global local optimum), the probability of being at the global optimum converges to 1 as .
Intuition
At high temperature, the algorithm behaves like a random walk, exploring the entire space. As the temperature decreases, it increasingly favors improving moves. Logarithmic cooling is slow enough to ensure the algorithm does not get permanently trapped in any local optimum. The constant must be large enough to escape the deepest trap.
Proof Sketch
Model the search as a non-homogeneous Markov chain. At temperature , the stationary distribution is the Boltzmann distribution . As , this distribution concentrates on global maxima. The logarithmic cooling schedule ensures convergence to the stationary distribution at each temperature (using detailed balance and mixing time arguments).
Why It Matters
Simulated annealing provides a theoretical guarantee that no local optimum can permanently trap the search. In practice, the logarithmic schedule is too slow; geometric schedules ( with ) are used instead and work well empirically despite lacking the asymptotic guarantee.
Failure Mode
The logarithmic schedule is impractically slow. For a problem with , convergence requires more iterations than are feasible. Practical implementations use faster cooling and may get stuck. Also, the constant requires knowing , the depth of the deepest local optimum, which is typically unknown.
Connection to Optimization Landscape Analysis
The success of ascent methods depends on the landscape structure:
- Convex/concave landscapes: gradient ascent finds the global optimum. No need for restarts or annealing.
- Multimodal with a few large basins: random restarts work well.
- Multimodal with many similar-quality optima: tabu search or simulated annealing explores more systematically.
- Rugged landscapes with many narrow optima: most local methods struggle. Consider population-based methods (evolutionary algorithms) or problem reformulation.
The choice of method depends on what you know about the landscape. If you know nothing, random restarts are the safest default. If you know the landscape has structure (e.g., solutions near the optimum share features), intensification strategies like tabu search are more efficient.
Common Confusions
Gradient ascent is not hill climbing
Gradient ascent operates on continuous, differentiable functions and uses gradient information. Hill climbing operates on discrete solution spaces and compares objective values of neighbors. They share the same greedy logic (move toward improvement) but apply in different settings.
Simulated annealing is not guaranteed to find the optimum in practice
The asymptotic convergence theorem requires a logarithmic cooling schedule that is too slow for real problems. Practical cooling schedules (geometric, adaptive) are faster but lose the theoretical guarantee. Simulated annealing in practice is a heuristic with good empirical performance, not an exact algorithm.
Summary
- Gradient ascent: , converges at for concave functions
- Hill climbing: discrete greedy ascent, terminates at local optima
- Three failure modes: local optima, plateaus, ridges
- Random restarts: simple, parallel, work when basins are not too small
- Simulated annealing: accept worse moves with decreasing probability, asymptotic guarantee requires impractical cooling schedule
- Method choice depends on landscape structure
Exercises
Problem
A function has a global maximum at with , and a local maximum at with . The basin of attraction for covers 20% of a reasonable initialization region. How many random restarts do you need so that the probability of finding the global maximum is at least 0.99?
Problem
In simulated annealing with a geometric cooling schedule , the depth of the deepest local optimum is , and the initial temperature is . At what iteration does the acceptance probability of a move that worsens the objective by exactly drop below 0.01? Use .
References
Canonical:
- Kirkpatrick, Gelatt & Vecchi, "Optimization by Simulated Annealing" (1983), Science
- Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2021), Chapter 4.1
Current:
- Boyd & Vandenberghe, Convex Optimization (2004), Chapter 9 (gradient methods)
- Gendreau & Potvin (eds.), Handbook of Metaheuristics (3rd ed., 2019), Chapters 1-2
Next Topics
- Tabu search: memory-based local search that prevents cycling
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A