Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Stochastic Gradient Descent Convergence

SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.

CoreTier 1Stable~55 min

Why This Matters

Convergence rates: how fast does excess risk decrease with iterations?

10.50Excess risk50100150200Iterations TAccelerated GD: O(1/T²)SGD (strongly convex): O(1/T)SGD (convex): O(1/√T)noise floor (variance)

Gradient descent computes the full gradient F(w)=1ni=1nfi(w)\nabla F(w) = \frac{1}{n}\sum_{i=1}^n \nabla f_i(w) at each step, costing O(n)O(n) per iteration. When nn is millions or billions (as in modern ML), this is too expensive. SGD replaces the full gradient with a stochastic estimate: the gradient of a single sample (or small mini-batch). The cost drops to O(1)O(1) per iteration.

The trade-off: SGD's gradient estimate is noisy, which slows convergence relative to full gradient descent. But this noise also provides implicit regularization, helping SGD find solutions that generalize better. Understanding SGD convergence theory explains learning rate scheduling, batch size and learning dynamics, and why SGD remains the default optimizer for deep learning.

Setup

We minimize F(w)=Eξ[f(w;ξ)]F(w) = \mathbb{E}_{\xi}[f(w; \xi)] where ξ\xi is a random data sample. The theory relies on properties from convex optimization. The SGD update is:

wt+1=wtηtgtw_{t+1} = w_t - \eta_t g_t

where gt=f(wt;ξt)g_t = \nabla f(w_t; \xi_t) is a stochastic gradient satisfying E[gtwt]=F(wt)\mathbb{E}[g_t | w_t] = \nabla F(w_t) (unbiasedness).

Definition

Stochastic Gradient Oracle

A stochastic gradient oracle returns gtg_t such that:

  1. E[gtwt]=F(wt)\mathbb{E}[g_t | w_t] = \nabla F(w_t) (unbiased)
  2. E[gtF(wt)2wt]σ2\mathbb{E}[\|g_t - \nabla F(w_t)\|^2 | w_t] \leq \sigma^2 (bounded variance)

The variance σ2\sigma^2 measures the noise level. For a single sample from a finite dataset, σ2\sigma^2 depends on the heterogeneity of the individual gradients fi(w)\nabla f_i(w).

Main Theorems

Theorem

SGD Convergence for Convex Functions

Statement

For convex, LL-smooth FF with bounded gradient variance σ2\sigma^2, SGD with learning rate η=1LT\eta = \frac{1}{L\sqrt{T}} satisfies:

E[F(wˉT)]F(w)Lw0w22T+σ22LT\mathbb{E}[F(\bar{w}_T)] - F(w^*) \leq \frac{L\|w_0 - w^*\|^2}{2\sqrt{T}} + \frac{\sigma^2}{2L\sqrt{T}}

where wˉT=1Tt=0T1wt\bar{w}_T = \frac{1}{T}\sum_{t=0}^{T-1} w_t is the average iterate. The convergence rate is O(1/T)O(1/\sqrt{T}).

Intuition

The first term is the deterministic convergence rate (how fast GD would converge if there were no noise). The second term is the noise penalty. Both decrease as O(1/T)O(1/\sqrt{T}). Compared to full GD, which converges as O(1/T)O(1/T) for smooth convex functions, SGD is slower by a factor of T\sqrt{T}. This is the price of using cheap, noisy gradients.

Proof Sketch

Start with the smoothness inequality: F(wt+1)F(wt)+F(wt),wt+1wt+L2wt+1wt2F(w_{t+1}) \leq F(w_t) + \langle \nabla F(w_t), w_{t+1} - w_t \rangle + \frac{L}{2}\|w_{t+1} - w_t\|^2. Substitute wt+1wt=ηgtw_{t+1} - w_t = -\eta g_t and take expectations. The cross term E[F(wt),ηgt]=ηF(wt)2\mathbb{E}[\langle \nabla F(w_t), -\eta g_t \rangle] = -\eta \|\nabla F(w_t)\|^2 (by unbiasedness). The quadratic term gives Lη22(F(wt)2+σ2)\frac{L\eta^2}{2}(\|\nabla F(w_t)\|^2 + \sigma^2). Combine, use convexity to relate F(wt)2\|\nabla F(w_t)\|^2 to F(wt)F(w)F(w_t) - F(w^*), telescope, and choose η\eta to balance terms.

Why It Matters

This O(1/T)O(1/\sqrt{T}) rate is optimal for first-order stochastic methods on convex functions. No algorithm using only stochastic gradient queries can do better in the worst case (Nemirovsky and Yudin, 1983). The rate determines how many epochs (passes over the data) you need for a given accuracy.

Failure Mode

The bound requires a constant learning rate tuned to TT (the total number of iterations), which must be known in advance. In practice, this is handled by decaying the learning rate. If the learning rate does not decay, SGD oscillates around the optimum with radius proportional to ησ\eta \sigma.

Theorem

SGD Convergence for Strongly Convex Functions

Statement

For μ\mu-strongly convex, LL-smooth FF with bounded gradient variance σ2\sigma^2, SGD with learning rate ηt=2μ(t+1)\eta_t = \frac{2}{\mu(t+1)} satisfies:

E[F(wT)]F(w)2σ2μT+2Lw0w2T2\mathbb{E}[F(w_T)] - F(w^*) \leq \frac{2\sigma^2}{\mu T} + \frac{2L\|w_0 - w^*\|^2}{T^2}

The dominant term is O(σ2/(μT))O(\sigma^2 / (\mu T)), giving a convergence rate of O(1/T)O(1/T).

Intuition

Strong convexity provides curvature that pulls iterates toward ww^* more aggressively. This improves the rate from O(1/T)O(1/\sqrt{T}) to O(1/T)O(1/T). The condition number κ=L/μ\kappa = L/\mu does not appear in the leading term, but μ\mu does: stronger curvature means faster convergence.

Proof Sketch

Use the strong convexity inequality: F(w)F(wt)+F(wt),wwt+μ2wwt2F(w^*) \geq F(w_t) + \langle \nabla F(w_t), w^* - w_t \rangle + \frac{\mu}{2}\|w^* - w_t\|^2. Combined with the SGD update and taking expectations, this gives a recurrence on E[wtw2]\mathbb{E}[\|w_t - w^*\|^2]. With the decaying learning rate ηt=2/(μ(t+1))\eta_t = 2/(\mu(t+1)), the recurrence solves to O(σ2/(μT))O(\sigma^2/(\mu T)).

Why It Matters

This O(1/T)O(1/T) rate matches the minimax-optimal rate for strongly convex stochastic optimization. For the special case of least squares regression with nn data points, this means O(n/T)O(n/T) suboptimality per epoch, so a constant number of passes over the data suffices for a fixed accuracy level.

Failure Mode

The rate requires knowing μ\mu to set the learning rate. If μ\mu is overestimated, the learning rate decays too fast and convergence stalls. If μ\mu is underestimated, the learning rate stays too large and the iterates oscillate. Adaptive methods (AdaGrad, Adam) avoid this by adjusting rates per-coordinate.

Mini-Batch SGD

Using a mini-batch of size BB reduces variance: if gt(B)=1Bj=1Bf(wt;ξt(j))g_t^{(B)} = \frac{1}{B}\sum_{j=1}^B \nabla f(w_t; \xi_t^{(j)}), then:

Var(gt(B))=σ2B\text{Var}(g_t^{(B)}) = \frac{\sigma^2}{B}

This reduces the noise term in the convergence bound by a factor of BB. But each iteration now costs BB times as much compute. The total work to achieve error ϵ\epsilon is:

Work=B×TB×σ2Bϵ2=σ2ϵ2\text{Work} = B \times T \propto B \times \frac{\sigma^2}{B\epsilon^2} = \frac{\sigma^2}{\epsilon^2}

for the convex case. The total work is independent of BB. Mini-batches do not reduce total computation for convex problems; they trade fewer iterations for more work per iteration. The real benefit is parallelism: BB gradient computations can run simultaneously on a GPU.

Learning Rate Schedules

Definition

Robbins-Monro Conditions

A learning rate sequence {ηt}\{\eta_t\} satisfies the Robbins-Monro conditions if:

t=1ηt=andt=1ηt2<\sum_{t=1}^{\infty} \eta_t = \infty \quad \text{and} \quad \sum_{t=1}^{\infty} \eta_t^2 < \infty

The first condition ensures the iterates can reach any point. The second ensures the noise is eventually damped. Examples: ηt=c/t\eta_t = c/t satisfies both; ηt=c/t\eta_t = c/\sqrt{t} satisfies the first but not the second.

Common schedules in practice:

  • Constant then decay. Train with a constant η\eta for most of training, then decay (linear or cosine) in the final phase.
  • Cosine annealing. ηt=ηmin+12(ηmaxηmin)(1+cos(πt/T))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T)). Smooth decay that spends more time at low learning rates.
  • Linear warmup. Start with a small η\eta and increase linearly for the first few thousand steps, then follow one of the above schedules. This stabilizes early training when parameter initialization is far from the eventual scale.

Stationary Points in Nonconvex SGD

For nonconvex objectives, SGD convergence is typically stated in terms of stationarity rather than global optimality. The relevant distinction:

Definition

First-Order Stationary Point (FOSP)

A first-order stationary point satisfies F(w)=0\nabla F(w) = 0. An ϵ\epsilon-FOSP satisfies F(w)ϵ\|\nabla F(w)\| \leq \epsilon. Local minima, local maxima, and saddle points are all FOSPs.

Definition

Second-Order Stationary Point (SOSP)

A second-order stationary point satisfies F(w)=0\nabla F(w) = 0 and 2F(w)0\nabla^2 F(w) \succeq 0 (no negative Hessian eigenvalue). An ϵ\epsilon-SOSP adds λmin(2F(w))ρϵ\lambda_{\min}(\nabla^2 F(w)) \geq -\sqrt{\rho \epsilon} for Hessian-Lipschitz constant ρ\rho. Saddle points are FOSPs but not SOSPs.

Standard nonconvex SGD analyses guarantee only an ϵ\epsilon-FOSP in O(1/ϵ4)O(1/\epsilon^4) stochastic gradient queries. Jin, Ge, Netrapalli, Kakade, Jordan (2017), "How to Escape Saddle Points Efficiently", showed that perturbed gradient descent finds an ϵ\epsilon-SOSP in O~(1/ϵ2)\tilde{O}(1/\epsilon^2) iterations (deterministic setting). In the stochastic setting, analogous results (Jin et al., 2021) give polynomial rates for SGD to find approximate SOSPs under Hessian-Lipschitz assumptions.

Noise as Implicit Regularization

SGD noise is not purely harmful. Several observed benefits:

Escaping saddle points. Gradient noise helps SGD escape strict saddle points (where the Hessian has a negative eigenvalue, so they are FOSPs but not SOSPs). Full GD can get stuck at saddle points; SGD almost surely does not.

Flat minima preference. Empirical evidence suggests SGD converges to flatter minima (minima with smaller Hessian eigenvalues), which generalize better. Larger learning rates and smaller batch sizes increase noise, pushing SGD toward flatter regions.

Implicit bias of SGD. For linear models, SGD with small initialization converges to the minimum-norm solution. For matrix factorization problems, it converges to low-rank solutions. These implicit biases help generalization without explicit regularization.

Common Confusions

Watch Out

SGD is not the same as mini-batch GD with B equals n

When B=nB = n (full batch), you recover deterministic GD, not SGD. The convergence rates and implicit regularization properties are qualitatively different. Full-batch GD converges as O(1/T)O(1/T) for smooth convex functions vs O(1/T)O(1/\sqrt{T}) for SGD, but SGD's noise provides regularization benefits that full-batch GD lacks.

Watch Out

Adam is not SGD

Adam uses adaptive per-coordinate learning rates and momentum. Its convergence theory is different from SGD. Adam can diverge on simple convex problems (Reddi et al., 2018). The AMSGrad fix addresses this, but in practice, Adam often works well despite the theoretical gap.

Exercises

ExerciseCore

Problem

You are training a model with SGD on a μ\mu-strongly convex loss with μ=0.01\mu = 0.01, σ2=1\sigma^2 = 1, and you want E[F(wT)F(w)]0.001\mathbb{E}[F(w_T) - F(w^*)] \leq 0.001. Using the strongly convex rate O(σ2/(μT))O(\sigma^2 / (\mu T)), approximately how many iterations TT do you need?

ExerciseAdvanced

Problem

Prove that for SGD with constant learning rate η\eta on a μ\mu-strongly convex, LL-smooth function, the iterates do not converge to ww^* but instead oscillate in a ball of radius O(ησ2/μ)O(\eta \sigma^2 / \mu) around ww^*.

References

Canonical:

  • Nemirovsky & Yudin, Problem Complexity and Method Efficiency in Optimization (1983)
  • Robbins & Monro, "A Stochastic Approximation Method" (1951)

Current:

  • Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (SIAM Review, 2018)

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 14

  • Jin, Ge, Netrapalli, Kakade, Jordan, "How to Escape Saddle Points Efficiently" (2017), ICML; arXiv:1703.00887. Perturbed GD finds an ϵ\epsilon-SOSP in O~(1/ϵ2)\tilde{O}(1/\epsilon^2) iterations.

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

  • Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics