Optimization Function Classes
Stochastic Gradient Descent Convergence
SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.
Prerequisites
Why This Matters
Convergence rates: how fast does excess risk decrease with iterations?
Gradient descent computes the full gradient at each step, costing per iteration. When is millions or billions (as in modern ML), this is too expensive. SGD replaces the full gradient with a stochastic estimate: the gradient of a single sample (or small mini-batch). The cost drops to per iteration.
The trade-off: SGD's gradient estimate is noisy, which slows convergence relative to full gradient descent. But this noise also provides implicit regularization, helping SGD find solutions that generalize better. Understanding SGD convergence theory explains learning rate scheduling, batch size and learning dynamics, and why SGD remains the default optimizer for deep learning.
Setup
We minimize where is a random data sample. The theory relies on properties from convex optimization. The SGD update is:
where is a stochastic gradient satisfying (unbiasedness).
Stochastic Gradient Oracle
A stochastic gradient oracle returns such that:
- (unbiased)
- (bounded variance)
The variance measures the noise level. For a single sample from a finite dataset, depends on the heterogeneity of the individual gradients .
Main Theorems
SGD Convergence for Convex Functions
Statement
For convex, -smooth with bounded gradient variance , SGD with learning rate satisfies:
where is the average iterate. The convergence rate is .
Intuition
The first term is the deterministic convergence rate (how fast GD would converge if there were no noise). The second term is the noise penalty. Both decrease as . Compared to full GD, which converges as for smooth convex functions, SGD is slower by a factor of . This is the price of using cheap, noisy gradients.
Proof Sketch
Start with the smoothness inequality: . Substitute and take expectations. The cross term (by unbiasedness). The quadratic term gives . Combine, use convexity to relate to , telescope, and choose to balance terms.
Why It Matters
This rate is optimal for first-order stochastic methods on convex functions. No algorithm using only stochastic gradient queries can do better in the worst case (Nemirovsky and Yudin, 1983). The rate determines how many epochs (passes over the data) you need for a given accuracy.
Failure Mode
The bound requires a constant learning rate tuned to (the total number of iterations), which must be known in advance. In practice, this is handled by decaying the learning rate. If the learning rate does not decay, SGD oscillates around the optimum with radius proportional to .
SGD Convergence for Strongly Convex Functions
Statement
For -strongly convex, -smooth with bounded gradient variance , SGD with learning rate satisfies:
The dominant term is , giving a convergence rate of .
Intuition
Strong convexity provides curvature that pulls iterates toward more aggressively. This improves the rate from to . The condition number does not appear in the leading term, but does: stronger curvature means faster convergence.
Proof Sketch
Use the strong convexity inequality: . Combined with the SGD update and taking expectations, this gives a recurrence on . With the decaying learning rate , the recurrence solves to .
Why It Matters
This rate matches the minimax-optimal rate for strongly convex stochastic optimization. For the special case of least squares regression with data points, this means suboptimality per epoch, so a constant number of passes over the data suffices for a fixed accuracy level.
Failure Mode
The rate requires knowing to set the learning rate. If is overestimated, the learning rate decays too fast and convergence stalls. If is underestimated, the learning rate stays too large and the iterates oscillate. Adaptive methods (AdaGrad, Adam) avoid this by adjusting rates per-coordinate.
Mini-Batch SGD
Using a mini-batch of size reduces variance: if , then:
This reduces the noise term in the convergence bound by a factor of . But each iteration now costs times as much compute. The total work to achieve error is:
for the convex case. The total work is independent of . Mini-batches do not reduce total computation for convex problems; they trade fewer iterations for more work per iteration. The real benefit is parallelism: gradient computations can run simultaneously on a GPU.
Learning Rate Schedules
Robbins-Monro Conditions
A learning rate sequence satisfies the Robbins-Monro conditions if:
The first condition ensures the iterates can reach any point. The second ensures the noise is eventually damped. Examples: satisfies both; satisfies the first but not the second.
Common schedules in practice:
- Constant then decay. Train with a constant for most of training, then decay (linear or cosine) in the final phase.
- Cosine annealing. . Smooth decay that spends more time at low learning rates.
- Linear warmup. Start with a small and increase linearly for the first few thousand steps, then follow one of the above schedules. This stabilizes early training when parameter initialization is far from the eventual scale.
Stationary Points in Nonconvex SGD
For nonconvex objectives, SGD convergence is typically stated in terms of stationarity rather than global optimality. The relevant distinction:
First-Order Stationary Point (FOSP)
A first-order stationary point satisfies . An -FOSP satisfies . Local minima, local maxima, and saddle points are all FOSPs.
Second-Order Stationary Point (SOSP)
A second-order stationary point satisfies and (no negative Hessian eigenvalue). An -SOSP adds for Hessian-Lipschitz constant . Saddle points are FOSPs but not SOSPs.
Standard nonconvex SGD analyses guarantee only an -FOSP in stochastic gradient queries. Jin, Ge, Netrapalli, Kakade, Jordan (2017), "How to Escape Saddle Points Efficiently", showed that perturbed gradient descent finds an -SOSP in iterations (deterministic setting). In the stochastic setting, analogous results (Jin et al., 2021) give polynomial rates for SGD to find approximate SOSPs under Hessian-Lipschitz assumptions.
Noise as Implicit Regularization
SGD noise is not purely harmful. Several observed benefits:
Escaping saddle points. Gradient noise helps SGD escape strict saddle points (where the Hessian has a negative eigenvalue, so they are FOSPs but not SOSPs). Full GD can get stuck at saddle points; SGD almost surely does not.
Flat minima preference. Empirical evidence suggests SGD converges to flatter minima (minima with smaller Hessian eigenvalues), which generalize better. Larger learning rates and smaller batch sizes increase noise, pushing SGD toward flatter regions.
Implicit bias of SGD. For linear models, SGD with small initialization converges to the minimum-norm solution. For matrix factorization problems, it converges to low-rank solutions. These implicit biases help generalization without explicit regularization.
Common Confusions
SGD is not the same as mini-batch GD with B equals n
When (full batch), you recover deterministic GD, not SGD. The convergence rates and implicit regularization properties are qualitatively different. Full-batch GD converges as for smooth convex functions vs for SGD, but SGD's noise provides regularization benefits that full-batch GD lacks.
Adam is not SGD
Adam uses adaptive per-coordinate learning rates and momentum. Its convergence theory is different from SGD. Adam can diverge on simple convex problems (Reddi et al., 2018). The AMSGrad fix addresses this, but in practice, Adam often works well despite the theoretical gap.
Exercises
Problem
You are training a model with SGD on a -strongly convex loss with , , and you want . Using the strongly convex rate , approximately how many iterations do you need?
Problem
Prove that for SGD with constant learning rate on a -strongly convex, -smooth function, the iterates do not converge to but instead oscillate in a ball of radius around .
References
Canonical:
- Nemirovsky & Yudin, Problem Complexity and Method Efficiency in Optimization (1983)
- Robbins & Monro, "A Stochastic Approximation Method" (1951)
Current:
-
Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (SIAM Review, 2018)
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 14
-
Jin, Ge, Netrapalli, Kakade, Jordan, "How to Escape Saddle Points Efficiently" (2017), ICML; arXiv:1703.00887. Perturbed GD finds an -SOSP in iterations.
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
-
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
Builds on This
- Adam OptimizerLayer 2
- Batch Size and Learning DynamicsLayer 2
- GrokkingLayer 4
- Learning Rate SchedulingLayer 2
- Test-Time Training and Adaptive InferenceLayer 5