Optimization Function Classes
Stability and Optimization Dynamics
Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.
Prerequisites
Why This Matters
Optimization is the computational engine of machine learning. Every model you train uses some variant of gradient descent. Understanding the convergence theory tells you: how to set the learning rate, why smoothness and convexity matter, and what convergence rate to expect.
Modern ML theory increasingly analyzes training through the lens of dynamical systems. The continuous-time gradient flow limit reveals stability properties that discrete gradient descent inherits (or violates). The edge of stability phenomenon shows that neural network training operates in a regime where classical convergence theory barely applies.
Core Definitions
L-Smooth Function
A differentiable function is -smooth if its gradient is -Lipschitz:
Equivalently, has bounded curvature: for all (when is twice differentiable).
Strongly Convex Function
A differentiable function is -strongly convex () if:
for all . The condition number is . Larger means harder optimization.
Main Theorems
Descent Lemma
Statement
For any -smooth function and any :
In particular, for a gradient step with step size :
Intuition
Smoothness means the function cannot curve faster than a quadratic with coefficient . A gradient step with step size is guaranteed to decrease the function value by at least . Larger step sizes risk overshooting the quadratic upper bound.
Proof Sketch
Define . Then . By -smoothness: . Integrate from to : . For the gradient step, substitute and optimize over .
Why It Matters
This is the foundation of all gradient descent convergence proofs. The guaranteed per-step decrease is the "currency" that convergence arguments spend. Every convergence rate is derived by telescoping this decrease over steps and relating to the suboptimality gap.
Failure Mode
If is not smooth (e.g., ), the descent lemma does not apply. The gradient can be large without implying that a gradient step decreases . Non-smooth optimization requires subgradient methods, which converge at the slower rate instead of .
GD Convergence for Smooth Convex Functions
Statement
Gradient descent with step size on a convex, -smooth function satisfies:
The convergence rate is . After steps, suboptimality is inversely proportional to the number of iterations.
Intuition
Convexity ensures that gradient steps make progress toward , not just downhill. The rate means doubling the accuracy requires doubling the iterations. This is the "slow" rate; strong convexity accelerates it.
Proof Sketch
From the descent lemma: . Use the identity . This gives . Telescope over and use .
Why It Matters
This establishes the baseline convergence rate for gradient descent. Any faster rate requires additional assumptions (strong convexity) or more sophisticated algorithms (Nesterov acceleration, which achieves ).
Failure Mode
The rate is tight for smooth convex functions. Without strong convexity, GD cannot converge faster. For non-convex functions, the guarantee weakens to (convergence to a stationary point, not a minimizer).
GD Convergence for Smooth Strongly Convex Functions
Statement
Gradient descent with step size on a -strongly convex, -smooth function converges linearly:
where is the condition number. Equivalently:
To achieve accuracy, iterations suffice.
Intuition
Strong convexity ensures that gradients do not vanish near the optimum: . Combined with the descent lemma, each step reduces the gap by a factor of . The condition number controls the contraction rate.
Proof Sketch
From the descent lemma: . By strong convexity: . Substituting: . Iterate times.
Why It Matters
Linear convergence means the number of iterations scales logarithmically with the desired accuracy. This is exponentially faster than the rate for merely convex functions. The condition number is the key quantity: ill-conditioned problems () converge slowly. This motivates preconditioning and adaptive methods like Adam.
Failure Mode
The learning rate requires knowing (or estimating) . Too large a step size causes divergence. The linear rate assumes exact gradients; stochastic gradients add noise that prevents convergence below the noise floor without decreasing the step size.
Gradient Flow and Lyapunov Analysis
Gradient Flow
The gradient flow is the continuous-time limit of gradient descent as the step size goes to zero:
Gradient descent with step size is the forward Euler discretization of this ODE.
For gradient flow, is a natural Lyapunov function:
The function value decreases along trajectories. This is the continuous-time analogue of the descent lemma.
The Edge of Stability
Classical convergence theory says: use step size to ensure stability. Cohen et al. (2021) observed that in neural network training, GD with a fixed step size evolves through three phases:
- Progressive sharpening: the sharpness (largest eigenvalue of ) increases during training
- Edge of stability: the sharpness stabilizes near , the stability threshold
- Non-monotone descent: the loss decreases on average but oscillates non-monotonically
This means neural networks self-tune their curvature to match the learning rate. The loss continues to decrease despite violating the classical smoothness condition . This phenomenon is not explained by the standard convergence theory presented above.
Common Confusions
Convergence rate is not the same as convergence speed
The rate vs describes the worst case. In practice, the actual convergence can be much faster if the problem has favorable geometry (e.g., the Hessian spectrum is clustered). Condition number captures the worst eigenvalue ratio, but if most eigenvalues are well-conditioned, convergence is faster in practice.
Non-convex does not mean GD fails
For non-convex functions, GD still converges to stationary points () at rate . The issue is that stationary points include saddle points and local minima. In overparameterized neural networks, empirical evidence suggests that most local minima have similar loss to the global minimum. The hard part is not avoiding bad local minima but understanding generalization.
Summary
- Descent lemma: GD with decreases by per step
- Smooth convex: convergence rate
- Smooth + strongly convex: linear convergence
- Condition number controls difficulty
- Gradient flow is the continuous-time limit; is a Lyapunov function
- Edge of stability: neural networks violate classical step size conditions but still converge
Exercises
Problem
A quadratic function has eigenvalues in . What is the optimal constant step size for gradient descent, and what is the convergence rate?
Problem
Prove that for -strongly convex , the gradient lower bounds the suboptimality: .
References
Canonical:
- Nesterov, Introductory Lectures on Convex Optimization, Chapters 1-2
- Boyd & Vandenberghe, Convex Optimization, Chapter 9
Current:
-
Cohen et al., "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability" (2021)
-
Nocedal & Wright, Numerical Optimization (2006), Chapters 2-7
-
Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review
Next Topics
- Training dynamics and loss landscapes: what happens in the non-convex setting
- Implicit bias and modern generalization: why overparameterized models generalize
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Algorithmic StabilityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2