Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Stability and Optimization Dynamics

Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.

AdvancedTier 2Current~60 min
0

Why This Matters

Optimization is the computational engine of machine learning. Every model you train uses some variant of gradient descent. Understanding the convergence theory tells you: how to set the learning rate, why smoothness and convexity matter, and what convergence rate to expect.

Modern ML theory increasingly analyzes training through the lens of dynamical systems. The continuous-time gradient flow limit x˙=f(x)\dot{x} = -\nabla f(x) reveals stability properties that discrete gradient descent inherits (or violates). The edge of stability phenomenon shows that neural network training operates in a regime where classical convergence theory barely applies.

Core Definitions

Definition

L-Smooth Function

A differentiable function f:RdRf: \mathbb{R}^d \to \mathbb{R} is LL-smooth if its gradient is LL-Lipschitz:

f(x)f(y)Lxyfor all x,y\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| \quad \text{for all } x, y

Equivalently, ff has bounded curvature: LI2f(x)LI-LI \preceq \nabla^2 f(x) \preceq LI for all xx (when ff is twice differentiable).

Definition

Strongly Convex Function

A differentiable function ff is μ\mu-strongly convex (μ>0\mu > 0) if:

f(y)f(x)+f(x),yx+μ2yx2f(y) \geq f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2}\|y - x\|^2

for all x,yx, y. The condition number is κ=L/μ\kappa = L/\mu. Larger κ\kappa means harder optimization.

Main Theorems

Lemma

Descent Lemma

Statement

For any LL-smooth function ff and any x,yRdx, y \in \mathbb{R}^d:

f(y)f(x)+f(x),yx+L2yx2f(y) \leq f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2}\|y - x\|^2

In particular, for a gradient step xt+1=xtηf(xt)x_{t+1} = x_t - \eta \nabla f(x_t) with step size η1/L\eta \leq 1/L:

f(xt+1)f(xt)η2f(xt)2f(x_{t+1}) \leq f(x_t) - \frac{\eta}{2}\|\nabla f(x_t)\|^2

Intuition

Smoothness means the function cannot curve faster than a quadratic with coefficient L/2L/2. A gradient step with step size 1/L1/L is guaranteed to decrease the function value by at least f2/(2L)\|\nabla f\|^2 / (2L). Larger step sizes risk overshooting the quadratic upper bound.

Proof Sketch

Define g(t)=f(x+t(yx))g(t) = f(x + t(y-x)). Then g(t)=f(x+t(yx)),yxg'(t) = \langle \nabla f(x + t(y-x)), y-x \rangle. By LL-smoothness: g(t)g(0)Lyx2tg'(t) - g'(0) \leq L\|y-x\|^2 t. Integrate from 00 to 11: f(y)f(x)f(x),yx=g(1)g(0)g(0)Lyx2/2f(y) - f(x) - \langle \nabla f(x), y-x \rangle = g(1) - g(0) - g'(0) \leq L\|y-x\|^2/2. For the gradient step, substitute y=xηf(x)y = x - \eta\nabla f(x) and optimize over η\eta.

Why It Matters

This is the foundation of all gradient descent convergence proofs. The guaranteed per-step decrease ηf2/2\eta\|\nabla f\|^2/2 is the "currency" that convergence arguments spend. Every convergence rate is derived by telescoping this decrease over TT steps and relating f(xt)2\sum \|\nabla f(x_t)\|^2 to the suboptimality gap.

Failure Mode

If ff is not smooth (e.g., f(x)=xf(x) = |x|), the descent lemma does not apply. The gradient can be large without implying that a gradient step decreases ff. Non-smooth optimization requires subgradient methods, which converge at the slower rate O(1/T)O(1/\sqrt{T}) instead of O(1/T)O(1/T).

Theorem

GD Convergence for Smooth Convex Functions

Statement

Gradient descent with step size η=1/L\eta = 1/L on a convex, LL-smooth function satisfies:

f(xT)f(x)Lx0x22Tf(x_T) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2T}

The convergence rate is O(1/T)O(1/T). After TT steps, suboptimality is inversely proportional to the number of iterations.

Intuition

Convexity ensures that gradient steps make progress toward xx^*, not just downhill. The O(1/T)O(1/T) rate means doubling the accuracy requires doubling the iterations. This is the "slow" rate; strong convexity accelerates it.

Proof Sketch

From the descent lemma: f(xt)f(x)f(xt),xtx12Lf(xt)2f(x_t) - f(x^*) \leq \langle \nabla f(x_t), x_t - x^* \rangle - \frac{1}{2L}\|\nabla f(x_t)\|^2. Use the identity f(xt),xtx=L2(xtx2xt+1x2)+12Lf(xt)2\langle \nabla f(x_t), x_t - x^* \rangle = \frac{L}{2}(\|x_t - x^*\|^2 - \|x_{t+1} - x^*\|^2) + \frac{1}{2L}\|\nabla f(x_t)\|^2. This gives f(xt)f(x)L2(xtx2xt+1x2)f(x_t) - f(x^*) \leq \frac{L}{2}(\|x_t - x^*\|^2 - \|x_{t+1} - x^*\|^2). Telescope over t=0,,T1t = 0, \ldots, T-1 and use f(xT)1Tt=0T1f(xt)f(x_T) \leq \frac{1}{T}\sum_{t=0}^{T-1} f(x_t).

Why It Matters

This establishes the baseline convergence rate for gradient descent. Any faster rate requires additional assumptions (strong convexity) or more sophisticated algorithms (Nesterov acceleration, which achieves O(1/T2)O(1/T^2)).

Failure Mode

The O(1/T)O(1/T) rate is tight for smooth convex functions. Without strong convexity, GD cannot converge faster. For non-convex functions, the guarantee weakens to mintf(xt)2O(1/T)\min_t \|\nabla f(x_t)\|^2 \leq O(1/T) (convergence to a stationary point, not a minimizer).

Theorem

GD Convergence for Smooth Strongly Convex Functions

Statement

Gradient descent with step size η=1/L\eta = 1/L on a μ\mu-strongly convex, LL-smooth function converges linearly:

f(xT)f(x)(11κ)T(f(x0)f(x))f(x_T) - f(x^*) \leq \left(1 - \frac{1}{\kappa}\right)^T (f(x_0) - f(x^*))

where κ=L/μ\kappa = L/\mu is the condition number. Equivalently:

xTx2(11κ)Tx0x2\|x_T - x^*\|^2 \leq \left(1 - \frac{1}{\kappa}\right)^T \|x_0 - x^*\|^2

To achieve ϵ\epsilon accuracy, T=O(κlog(1/ϵ))T = O(\kappa \log(1/\epsilon)) iterations suffice.

Intuition

Strong convexity ensures that gradients do not vanish near the optimum: f(x)22μ(f(x)f(x))\|\nabla f(x)\|^2 \geq 2\mu(f(x) - f(x^*)). Combined with the descent lemma, each step reduces the gap by a factor of (11/κ)(1 - 1/\kappa). The condition number κ\kappa controls the contraction rate.

Proof Sketch

From the descent lemma: f(xt+1)f(xt)12Lf(xt)2f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2. By strong convexity: f(xt)22μ(f(xt)f(x))\|\nabla f(x_t)\|^2 \geq 2\mu(f(x_t) - f(x^*)). Substituting: f(xt+1)f(x)(1μ/L)(f(xt)f(x))f(x_{t+1}) - f(x^*) \leq (1 - \mu/L)(f(x_t) - f(x^*)). Iterate TT times.

Why It Matters

Linear convergence means the number of iterations scales logarithmically with the desired accuracy. This is exponentially faster than the O(1/T)O(1/T) rate for merely convex functions. The condition number κ\kappa is the key quantity: ill-conditioned problems (κ1\kappa \gg 1) converge slowly. This motivates preconditioning and adaptive methods like Adam.

Failure Mode

The learning rate 1/L1/L requires knowing (or estimating) LL. Too large a step size causes divergence. The linear rate assumes exact gradients; stochastic gradients add noise that prevents convergence below the noise floor without decreasing the step size.

Gradient Flow and Lyapunov Analysis

Definition

Gradient Flow

The gradient flow is the continuous-time limit of gradient descent as the step size goes to zero:

dxdt=f(x(t))\frac{dx}{dt} = -\nabla f(x(t))

Gradient descent with step size η\eta is the forward Euler discretization of this ODE.

For gradient flow, f(x(t))f(x(t)) is a natural Lyapunov function:

ddtf(x(t))=f,x˙=f(x(t))20\frac{d}{dt} f(x(t)) = \langle \nabla f, \dot{x} \rangle = -\|\nabla f(x(t))\|^2 \leq 0

The function value decreases along trajectories. This is the continuous-time analogue of the descent lemma.

The Edge of Stability

Classical convergence theory says: use step size η2/L\eta \leq 2/L to ensure stability. Cohen et al. (2021) observed that in neural network training, GD with a fixed step size evolves through three phases:

  1. Progressive sharpening: the sharpness (largest eigenvalue of 2f\nabla^2 f) increases during training
  2. Edge of stability: the sharpness stabilizes near 2/η2/\eta, the stability threshold
  3. Non-monotone descent: the loss decreases on average but oscillates non-monotonically

This means neural networks self-tune their curvature to match the learning rate. The loss continues to decrease despite violating the classical smoothness condition ηL1\eta L \leq 1. This phenomenon is not explained by the standard convergence theory presented above.

Common Confusions

Watch Out

Convergence rate is not the same as convergence speed

The rate O(1/T)O(1/T) vs O((11/κ)T)O((1-1/\kappa)^T) describes the worst case. In practice, the actual convergence can be much faster if the problem has favorable geometry (e.g., the Hessian spectrum is clustered). Condition number captures the worst eigenvalue ratio, but if most eigenvalues are well-conditioned, convergence is faster in practice.

Watch Out

Non-convex does not mean GD fails

For non-convex functions, GD still converges to stationary points (f0\nabla f \approx 0) at rate O(1/T)O(1/T). The issue is that stationary points include saddle points and local minima. In overparameterized neural networks, empirical evidence suggests that most local minima have similar loss to the global minimum. The hard part is not avoiding bad local minima but understanding generalization.

Summary

  • Descent lemma: GD with η=1/L\eta = 1/L decreases ff by f2/(2L)\|\nabla f\|^2/(2L) per step
  • Smooth convex: O(1/T)O(1/T) convergence rate
  • Smooth + strongly convex: O((11/κ)T)O((1-1/\kappa)^T) linear convergence
  • Condition number κ=L/μ\kappa = L/\mu controls difficulty
  • Gradient flow is the continuous-time limit; ff is a Lyapunov function
  • Edge of stability: neural networks violate classical step size conditions but still converge

Exercises

ExerciseCore

Problem

A quadratic function f(x)=12xTAxf(x) = \frac{1}{2}x^T A x has eigenvalues in [μ,L][\mu, L]. What is the optimal constant step size for gradient descent, and what is the convergence rate?

ExerciseAdvanced

Problem

Prove that for μ\mu-strongly convex ff, the gradient lower bounds the suboptimality: f(x)22μ(f(x)f(x))\|\nabla f(x)\|^2 \geq 2\mu(f(x) - f(x^*)).

References

Canonical:

  • Nesterov, Introductory Lectures on Convex Optimization, Chapters 1-2
  • Boyd & Vandenberghe, Convex Optimization, Chapter 9

Current:

  • Cohen et al., "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability" (2021)

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 2-7

  • Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics