Ascent Algorithms and Hill Climbing

Sneiderman, Robby

Numerical Optimization

Ascent Algorithms and Hill Climbing

Gradient ascent, hill climbing, and their failure modes: local optima, plateaus, and ridges. Random restarts and simulated annealing as strategies for escaping local optima.

CoreTier 2StableSupporting~35 min

Prerequisites

Convex Optimization Basics

Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 1 | tier 2. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Tabu Search

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Maximization and minimization are the same problem with a sign flip, but the convention matters. In maximum likelihood estimation, you maximize a likelihood. In reinforcement learning, you maximize a reward. In variational inference, you maximize an ELBO. Many formulations are naturally phrased as maximization, and the algorithms that solve them are ascent methods.

Understanding how ascent methods fail (local optima, plateaus, ridges) is prerequisite to understanding why more sophisticated methods (momentum, adaptive learning rates, second-order methods) exist.

Gradient Ascent

Definition

Gradient Ascent $θ_{k + 1} = θ_{k} + η \nabla f (θ_{k})$

Given a differentiable function $f: \mathbb{R}^d \to \mathbb{R}$ to maximize, gradient ascent updates:

$\theta_{k+1} = \theta_k + \eta \nabla f(\theta_k)$

where $\eta > 0$ is the step size (learning rate). This moves in the direction of steepest increase of $f$ .

This is gradient descent on $-f$ . Every convergence result from convex optimization applies after the sign flip. If $f$ is concave, every local maximum is global, and gradient ascent converges to it.

Hill Climbing

Definition

Hill Climbing

Hill climbing is the discrete analogue of gradient ascent. Given a finite set of solutions $S$ , a neighborhood function $N: S \to 2^S$ , and an objective $f: S \to \mathbb{R}$ :

Start at $s_0 \in S$ .
At each step, move to $s_{k+1} = \arg\max_{s \in N(s_k)} f(s)$ if $f(s_{k+1}) > f(s_k)$ .
If no neighbor improves $f$ , stop. You are at a local maximum.

Hill climbing is greedy: it only accepts improving moves. It terminates in finite time on finite solution spaces because $f$ strictly increases at each step and $|S|$ is finite.

Main Theorems

Theorem

Gradient Ascent Convergence for Smooth Concave Functions

Statement

If $f: \mathbb{R}^d \to \mathbb{R}$ is concave with $L$ -Lipschitz continuous gradient, and gradient ascent uses step size $\eta \leq 1/L$ , then after $k$ iterations:

$f(\theta^*) - f(\theta_k) \leq \frac{\|\theta_0 - \theta^*\|^2}{2\eta k}$

where $\theta^* = \arg\max f$ . With $\eta = 1/L$ , this gives an $O(1/k)$ convergence rate.

Intuition

The Lipschitz gradient condition means $f$ does not curve too sharply, so a step of size $1/L$ always makes progress. The concavity ensures there are no local maxima to trap the algorithm. The convergence rate $O(1/k)$ means halving the error requires doubling the iterations.

Proof Sketch

The proof mirrors gradient descent on $-f$ . By the descent lemma applied to $-f$ : $-f(\theta_{k+1}) \leq -f(\theta_k) - \eta \|\nabla f(\theta_k)\|^2 + \frac{L\eta^2}{2}\|\nabla f(\theta_k)\|^2$ . With $\eta \leq 1/L$ , each step decreases $-f$ (increases $f$ ). Telescope the inequalities and use concavity ( $f(\theta^*) - f(\theta_k) \leq \langle \nabla f(\theta_k), \theta^* - \theta_k \rangle$ ) to get the bound.

Why It Matters

This is the baseline convergence result for maximization. Every more sophisticated method (heavy ball, Nesterov acceleration, Adam) is measured against this $O(1/k)$ rate. If your function is concave and smooth, plain gradient ascent with step size $1/L$ is a safe default.

Failure Mode

For non-concave $f$ , gradient ascent converges to a stationary point ( $\nabla f = 0$ ), but this could be a local maximum, a saddle point, or even a local minimum (if approaching from the wrong direction with bad step size). The guarantee of global optimality requires concavity.

report a correction →

Failure Modes of Ascent

Three landscape features cause ascent methods to fail on non-concave problems:

Local optima. A point $\theta$ where $\nabla f(\theta) = 0$ and the Hessian is negative definite, but $f(\theta) < f(\theta^*)$ . Gradient ascent stops here because there is no local direction of improvement. In discrete settings, hill climbing stops at any solution that is better than all its neighbors.

Plateaus. A region where $\nabla f \approx 0$ but the function is not at a maximum. Gradient ascent takes tiny steps and appears to converge, but the true maximum is far away. Plateaus arise in symmetric parameterizations and in functions with flat regions.

Ridges. A narrow region of high function values where the gradient points nearly perpendicular to the ridge direction. Gradient ascent oscillates across the ridge instead of following it upward. This is the maximization analogue of the narrow valley problem in gradient descent.

Escaping Local Optima

Random Restarts

Run hill climbing (or gradient ascent) from $m$ random initial points. Return the best solution found. If the fraction of the search space that leads to the global optimum is $p$ , then the probability of missing it after $m$ restarts is $(1 - p)^m$ . For $p = 0.01$ and $m = 500$ , this probability is $(0.99)^{500} \approx 0.0066$ .

Random restarts are simple and embarrassingly parallel. They work well when basins of attraction are not too small.

Simulated Annealing

Theorem

Asymptotic Convergence of Simulated Annealing

Statement

Let $S$ be a finite solution space with objective $f$ to maximize. Simulated annealing accepts a neighbor $s'$ of current solution $s$ with probability:

$P(\text{accept}) = \begin{cases} 1 & \text{if } f(s') \geq f(s) \\ \exp\left(\frac{f(s') - f(s)}{T(k)}\right) & \text{if } f(s') < f(s) \end{cases}$

With a logarithmic cooling schedule $T(k) = c / \log(k+1)$ where $c \geq d^*$ (the depth of the deepest non-global local optimum), the probability of being at the global optimum converges to 1 as $k \to \infty$ .

Intuition

At high temperature, the algorithm behaves like a random walk, exploring the entire space. As the temperature decreases, it increasingly favors improving moves. Logarithmic cooling is slow enough to ensure the algorithm does not get permanently trapped in any local optimum. The constant $c$ must be large enough to escape the deepest trap.

Proof Sketch

Model the search as a non-homogeneous Markov chain. At temperature $T$ , the stationary distribution is the Boltzmann distribution $\pi_T(s) \propto \exp(f(s)/T)$ . As $T \to 0$ , this distribution concentrates on global maxima. The logarithmic cooling schedule ensures convergence to the stationary distribution at each temperature (using detailed balance and mixing time arguments).

Why It Matters

Simulated annealing provides a theoretical guarantee that no local optimum can permanently trap the search. In practice, the logarithmic schedule is too slow; geometric schedules ( $T_{k+1} = \alpha T_k$ with $\alpha \approx 0.95$ ) are used instead and work well empirically despite lacking the asymptotic guarantee.

Failure Mode

The logarithmic schedule is impractically slow. For a problem with $|S| = 10^{10}$ , convergence requires more iterations than are feasible. Practical implementations use faster cooling and may get stuck. Also, the constant $c$ requires knowing $d^*$ , the depth of the deepest local optimum, which is typically unknown.

report a correction →

Connection to Optimization Landscape Analysis

The success of ascent methods depends on the landscape structure:

Convex/concave landscapes: gradient ascent finds the global optimum. No need for restarts or annealing.
Multimodal with a few large basins: random restarts work well.
Multimodal with many similar-quality optima: tabu search or simulated annealing explores more systematically.
Rugged landscapes with many narrow optima: most local methods struggle. Consider population-based methods (evolutionary algorithms) or problem reformulation.

The choice of method depends on what you know about the landscape. If you know nothing, random restarts are the safest default. If you know the landscape has structure (e.g., solutions near the optimum share features), intensification strategies like tabu search are more efficient.

Common Confusions

Watch Out

Gradient ascent is not hill climbing

Gradient ascent operates on continuous, differentiable functions and uses gradient information. Hill climbing operates on discrete solution spaces and compares objective values of neighbors. They share the same greedy logic (move toward improvement) but apply in different settings.

Watch Out

Simulated annealing is not guaranteed to find the optimum in practice

The asymptotic convergence theorem requires a logarithmic cooling schedule that is too slow for real problems. Practical cooling schedules (geometric, adaptive) are faster but lose the theoretical guarantee. Simulated annealing in practice is a heuristic with good empirical performance, not an exact algorithm.

Summary

Gradient ascent: $\theta_{k+1} = \theta_k + \eta \nabla f(\theta_k)$ , converges at $O(1/k)$ for concave functions
Hill climbing: discrete greedy ascent, terminates at local optima
Three failure modes: local optima, plateaus, ridges
Random restarts: simple, parallel, work when basins are not too small
Simulated annealing: accept worse moves with decreasing probability, asymptotic guarantee requires impractical cooling schedule
Method choice depends on landscape structure

Exercises

ExerciseCore

Problem

A function $f: \mathbb{R}^2 \to \mathbb{R}$ has a global maximum at $(3, 3)$ with $f(3,3) = 10$ , and a local maximum at $(0, 0)$ with $f(0,0) = 7$ . The basin of attraction for $(3, 3)$ covers 20% of a reasonable initialization region. How many random restarts do you need so that the probability of finding the global maximum is at least 0.99?

ExerciseAdvanced

Problem

In simulated annealing with a geometric cooling schedule $T_k = T_0 \cdot \alpha^k$ , the depth of the deepest local optimum is $d^* = 5$ , and the initial temperature is $T_0 = 10$ . At what iteration $k$ does the acceptance probability of a move that worsens the objective by exactly $d^* = 5$ drop below 0.01? Use $\alpha = 0.95$ .

References

Canonical:

Kirkpatrick, Gelatt & Vecchi, "Optimization by Simulated Annealing" (1983), Science
Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2021), Chapter 4.1

Current:

Boyd & Vandenberghe, Convex Optimization (2004), Chapter 9 (gradient methods)
Gendreau & Potvin (eds.), Handbook of Metaheuristics (3rd ed., 2019), Chapters 1-2

Next Topics

Tabu search: memory-based local search that prevents cycling

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Convex Optimization Basicslayer 1 · tier 1

Derived topics

1

Tabu Searchlayer 2 · tier 3

Graph-backed continuations

Tabu Search