Stability and Optimization Dynamics

Sneiderman, Robby

Optimization Function Classes

Stability and Optimization Dynamics

Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.

AdvancedTier 2CurrentSupporting~60 min

Prerequisites

Convex Optimization Basics Invariants and Monovariants

Prereq Map

Learning position

Read this page in the graph.

optimization-function-classes | layer 2 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Training Dynamics and Loss Landscapes

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Optimization is the computational engine of machine learning. Every model you train uses some variant of gradient descent. Understanding the convergence theory tells you: how to set the learning rate, why smoothness and convexity matter, and what convergence rate to expect.

Modern ML theory increasingly analyzes training through the lens of dynamical systems. The continuous-time gradient flow limit $\dot{x} = -\nabla f(x)$ reveals stability properties that discrete gradient descent inherits (or violates). The edge of stability phenomenon shows that neural network training operates in a regime where classical convergence theory barely applies.

Core Definitions

Definition

L-Smooth Function

A differentiable function $f: \mathbb{R}^d \to \mathbb{R}$ is $L$ -smooth if its gradient is $L$ -Lipschitz:

$\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| \quad \text{for all } x, y$

Equivalently, $f$ has bounded curvature: $-LI \preceq \nabla^2 f(x) \preceq LI$ for all $x$ (when $f$ is twice differentiable).

Definition

Strongly Convex Function $μ -strongly convex$

A differentiable function $f$ is $\mu$ -strongly convex ( $\mu > 0$ ) if:

$f(y) \geq f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2}\|y - x\|^2$

for all $x, y$ . The condition number is $\kappa = L/\mu$ . Larger $\kappa$ means harder optimization.

Main Theorems

Lemma

Descent Lemma

Statement

For any $L$ -smooth function $f$ and any $x, y \in \mathbb{R}^d$ :

$f(y) \leq f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2}\|y - x\|^2$

In particular, for a gradient step $x_{t+1} = x_t - \eta \nabla f(x_t)$ with step size $\eta \leq 1/L$ :

$f(x_{t+1}) \leq f(x_t) - \frac{\eta}{2}\|\nabla f(x_t)\|^2$

Intuition

Smoothness means the function cannot curve faster than a quadratic with coefficient $L/2$ . A gradient step with step size $1/L$ is guaranteed to decrease the function value by at least $\|\nabla f\|^2 / (2L)$ . Larger step sizes risk overshooting the quadratic upper bound.

Proof Sketch

Define $g(t) = f(x + t(y-x))$ . Then $g'(t) = \langle \nabla f(x + t(y-x)), y-x \rangle$ . By $L$ -smoothness: $g'(t) - g'(0) \leq L\|y-x\|^2 t$ . Integrate from $0$ to $1$ : $f(y) - f(x) - \langle \nabla f(x), y-x \rangle = g(1) - g(0) - g'(0) \leq L\|y-x\|^2/2$ . For the gradient step, substitute $y = x - \eta\nabla f(x)$ and optimize over $\eta$ .

Why It Matters

This is the foundation of all gradient descent convergence proofs. The guaranteed per-step decrease $\eta\|\nabla f\|^2/2$ is the "currency" that convergence arguments spend. Every convergence rate is derived by telescoping this decrease over $T$ steps and relating $\sum \|\nabla f(x_t)\|^2$ to the suboptimality gap.

Failure Mode

If $f$ is not smooth (e.g., $f(x) = |x|$ ), the descent lemma does not apply. The gradient can be large without implying that a gradient step decreases $f$ . Non-smooth optimization requires subgradient methods, which converge at the slower rate $O(1/\sqrt{T})$ instead of $O(1/T)$ .

report a correction →

Theorem

GD Convergence for Smooth Convex Functions

Statement

Gradient descent with step size $\eta = 1/L$ on a convex, $L$ -smooth function satisfies:

$f(x_T) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2T}$

The convergence rate is $O(1/T)$ . After $T$ steps, suboptimality is inversely proportional to the number of iterations.

Intuition

Convexity ensures that gradient steps make progress toward $x^*$ , not just downhill. The $O(1/T)$ rate means doubling the accuracy requires doubling the iterations. This is the "slow" rate; strong convexity accelerates it.

Proof Sketch

From the descent lemma: $f(x_t) - f(x^*) \leq \langle \nabla f(x_t), x_t - x^* \rangle - \frac{1}{2L}\|\nabla f(x_t)\|^2$ . Use the identity $\langle \nabla f(x_t), x_t - x^* \rangle = \frac{L}{2}(\|x_t - x^*\|^2 - \|x_{t+1} - x^*\|^2) + \frac{1}{2L}\|\nabla f(x_t)\|^2$ . This gives $f(x_t) - f(x^*) \leq \frac{L}{2}(\|x_t - x^*\|^2 - \|x_{t+1} - x^*\|^2)$ . Telescope over $t = 0, \ldots, T-1$ and use $f(x_T) \leq \frac{1}{T}\sum_{t=0}^{T-1} f(x_t)$ .

Why It Matters

This establishes the baseline convergence rate for gradient descent. Any faster rate requires additional assumptions (strong convexity) or more sophisticated algorithms (Nesterov acceleration, which achieves $O(1/T^2)$ ).

Failure Mode

The $O(1/T)$ rate is tight for smooth convex functions. Without strong convexity, GD cannot converge faster. For non-convex functions, the guarantee weakens to $\min_t \|\nabla f(x_t)\|^2 \leq O(1/T)$ (convergence to a stationary point, not a minimizer).

report a correction →

Theorem

GD Convergence for Smooth Strongly Convex Functions

Statement

Gradient descent with step size $\eta = 1/L$ on a $\mu$ -strongly convex, $L$ -smooth function converges linearly:

$f(x_T) - f(x^*) \leq \left(1 - \frac{1}{\kappa}\right)^T (f(x_0) - f(x^*))$

where $\kappa = L/\mu$ is the condition number. Equivalently:

$\|x_T - x^*\|^2 \leq \left(1 - \frac{1}{\kappa}\right)^T \|x_0 - x^*\|^2$

To achieve $\epsilon$ accuracy, $T = O(\kappa \log(1/\epsilon))$ iterations suffice.

Intuition

Strong convexity ensures that gradients do not vanish near the optimum: $\|\nabla f(x)\|^2 \geq 2\mu(f(x) - f(x^*))$ . Combined with the descent lemma, each step reduces the gap by a factor of $(1 - 1/\kappa)$ . The condition number $\kappa$ controls the contraction rate.

Proof Sketch

From the descent lemma: $f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2$ . By strong convexity: $\|\nabla f(x_t)\|^2 \geq 2\mu(f(x_t) - f(x^*))$ . Substituting: $f(x_{t+1}) - f(x^*) \leq (1 - \mu/L)(f(x_t) - f(x^*))$ . Iterate $T$ times.

Why It Matters

Linear convergence means the number of iterations scales logarithmically with the desired accuracy. This is exponentially faster than the $O(1/T)$ rate for merely convex functions. The condition number $\kappa$ is the key quantity: ill-conditioned problems ( $\kappa \gg 1$ ) converge slowly. This motivates preconditioning and adaptive methods like Adam.

Failure Mode

The learning rate $1/L$ requires knowing (or estimating) $L$ . Too large a step size causes divergence. The linear rate assumes exact gradients; stochastic gradients add noise that prevents convergence below the noise floor without decreasing the step size.

report a correction →

Gradient Flow and Lyapunov Analysis

Definition

Gradient Flow

The gradient flow is the continuous-time limit of gradient descent as the step size goes to zero:

$\frac{dx}{dt} = -\nabla f(x(t))$

Gradient descent with step size $\eta$ is the forward Euler discretization of this ODE.

For gradient flow, $f(x(t))$ is a natural Lyapunov function:

$\frac{d}{dt} f(x(t)) = \langle \nabla f, \dot{x} \rangle = -\|\nabla f(x(t))\|^2 \leq 0$

The function value decreases along trajectories. This is the continuous-time analogue of the descent lemma.

The Edge of Stability

Classical convergence theory says: use step size $\eta \leq 2/L$ to ensure stability. Cohen et al. (2021) observed that in neural network training, GD with a fixed step size evolves through three phases:

Progressive sharpening: the sharpness (largest eigenvalue of $\nabla^2 f$ ) increases during training
Edge of stability: the sharpness stabilizes near $2/\eta$ , the stability threshold
Non-monotone descent: the loss decreases on average but oscillates non-monotonically

This means neural networks self-tune their curvature to match the learning rate. The loss continues to decrease despite violating the classical smoothness condition $\eta L \leq 1$ . This phenomenon is not explained by the standard convergence theory presented above.

Common Confusions

Watch Out

Convergence rate is not the same as convergence speed

The rate $O(1/T)$ vs $O((1-1/\kappa)^T)$ describes the worst case. In practice, the actual convergence can be much faster if the problem has favorable geometry (e.g., the Hessian spectrum is clustered). Condition number captures the worst eigenvalue ratio, but if most eigenvalues are well-conditioned, convergence is faster in practice.

Watch Out

Non-convex does not mean GD fails

For smooth non-convex functions, GD with constant step size $\eta = 1/L$ satisfies $\min_{0 \leq t < T} \|\nabla f(x_t)\|^2 \leq 2L\,(f(x_0) - f^\star)/T$ , i.e. the minimum (or average) squared gradient norm along the trajectory decays at rate $O(1/T)$ . This does not say the iterates converge to a stationary point or that the loss converges to a stationary value: a sublinearly small gradient is observed somewhere in the trajectory, not necessarily at the last iterate. The set of stationary points includes saddle points and local minima. In overparameterized neural networks, empirical evidence suggests that most reachable local minima have similar loss to the global minimum. The hard part is not avoiding bad local minima but understanding generalization.

Summary

Descent lemma: GD with $\eta = 1/L$ decreases $f$ by $\|\nabla f\|^2/(2L)$ per step
Smooth convex: $O(1/T)$ convergence rate
Smooth + strongly convex: $O((1-1/\kappa)^T)$ linear convergence
Condition number $\kappa = L/\mu$ controls difficulty
Gradient flow is the continuous-time limit; $f$ is a Lyapunov function
Edge of stability: neural networks violate classical step size conditions but still converge

Exercises

ExerciseCore

Problem

A quadratic function $f(x) = \frac{1}{2}x^T A x$ has eigenvalues in $[\mu, L]$ . What is the optimal constant step size for gradient descent, and what is the convergence rate?

ExerciseAdvanced

Problem

Prove that for $\mu$ -strongly convex $f$ , the gradient lower bounds the suboptimality: $\|\nabla f(x)\|^2 \geq 2\mu(f(x) - f(x^*))$ .

References

Canonical:

Nesterov, Introductory Lectures on Convex Optimization, Chapters 1-2
Boyd & Vandenberghe, Convex Optimization, Chapter 9

Current:

Cohen et al., "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability" (2021)
Nocedal & Wright, Numerical Optimization (2006), Chapters 2-7
Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review

Next Topics

Training dynamics and loss landscapes: what happens in the non-convex setting
Implicit bias and modern generalization: why overparameterized models generalize

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Convex Optimization Basicslayer 1 · tier 1

Derived topics

2

Implicit Bias and Modern Generalizationlayer 4 · tier 1
Training Dynamics and Loss Landscapeslayer 4 · tier 2

Graph-backed continuations

Training Dynamics and Loss Landscapes Implicit Bias and Modern Generalization