Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Gradient Descent Variants

From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.

CoreTier 1Stable~45 min

Why This Matters

Every neural network is trained by some variant of gradient descent (or gradient ascent when maximizing a reward or likelihood). The choice of optimizer affects convergence speed, final model quality, and generalization. Understanding the differences between SGD, momentum, and Adam is necessary for debugging training and making informed choices.

Full Batch GDSmooth, slowMini-Batch SGDSome noise, fasterSGD (batch=1)Noisy, fast per step

Full-Batch Gradient Descent

Definition

Full-Batch Gradient Descent

Given a loss function L(θ)=1ni=1n(θ;xi,yi)L(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(\theta; x_i, y_i), full-batch gradient descent updates:

θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)

where η>0\eta > 0 is the learning rate and L(θt)\nabla L(\theta_t) is computed using all nn training examples.

Full-batch GD computes the exact gradient. Each step is expensive (O(n)O(n) per update), and for large nn this is impractical. A single epoch requires one gradient computation and one parameter update.

Stochastic Gradient Descent

Definition

Stochastic Gradient Descent (SGD)

At each step, sample a single example (xi,yi)(x_i, y_i) uniformly at random and update:

θt+1=θtη(θt;xi,yi)\theta_{t+1} = \theta_t - \eta \nabla \ell(\theta_t; x_i, y_i)

The stochastic gradient (θt;xi,yi)\nabla \ell(\theta_t; x_i, y_i) is an unbiased estimate of the full gradient: Ei[(θt;xi,yi)]=L(θt)\mathbb{E}_i[\nabla \ell(\theta_t; x_i, y_i)] = \nabla L(\theta_t).

SGD is cheap per step (O(1)O(1) per update) but noisy. The noise can be beneficial: it helps escape shallow local minima and saddle points. But it also means the iterates oscillate and do not converge to an exact minimum without a decreasing learning rate schedule.

Mini-Batch SGD

Definition

Mini-Batch Gradient Descent

At each step, sample a mini-batch B{1,,n}B \subset \{1, \ldots, n\} of size B=b|B| = b and update:

θt+1=θtη1biB(θt;xi,yi)\theta_{t+1} = \theta_t - \eta \frac{1}{b} \sum_{i \in B} \nabla \ell(\theta_t; x_i, y_i)

Typical batch sizes: 32, 64, 128, 256, or 512.

Mini-batch SGD is the practical default. It balances the noise reduction of averaging over multiple samples with the computational efficiency of not using all samples. Larger batches reduce variance (smoother updates) but cost more per step. Smaller batches add noise but allow more updates per epoch.

Theorem

SGD Convergence for Smooth Convex Functions

Statement

Let L(θ)L(\theta) be convex and LL-smooth. Let the stochastic gradient have bounded variance: E[(θ;xi)L(θ)2]σ2\mathbb{E}[\|\nabla \ell(\theta; x_i) - \nabla L(\theta)\|^2] \leq \sigma^2. With learning rate η=1T\eta = \frac{1}{\sqrt{T}}, after TT steps of SGD:

E[1Tt=1TL(θt)2]O(Lθ0θ2T+σT)\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T} \|\nabla L(\theta_t)\|^2\right] \leq O\left(\frac{L \cdot \|\theta_0 - \theta^*\|^2}{\sqrt{T}} + \frac{\sigma}{\sqrt{T}}\right)

The convergence rate is O(1/T)O(1/\sqrt{T}), which is slower than the O(1/T)O(1/T) rate of full-batch GD for smooth convex functions.

Intuition

The σ/T\sigma / \sqrt{T} term is the price of stochasticity. No matter how many steps you take, the noise in the gradient prevents exact convergence with a fixed learning rate. You need a decreasing schedule (ηt0\eta_t \to 0) for exact convergence, which slows things further.

Proof Sketch

Apply the descent lemma: L(θt+1)L(θt)+L(θt),θt+1θt+L2θt+1θt2L(\theta_{t+1}) \leq L(\theta_t) + \langle \nabla L(\theta_t), \theta_{t+1} - \theta_t \rangle + \frac{L}{2}\|\theta_{t+1} - \theta_t\|^2. Substitute the SGD update, take expectations, and telescope the sum. The variance term σ2\sigma^2 prevents the telescoping from collapsing to zero.

Why It Matters

This rate explains why SGD is slow in theory but fast in practice. The constant hidden in the OO notation depends on nn for full-batch but not for SGD. For large nn, SGD reaches a given accuracy much sooner in wall clock time despite the worse rate per iteration.

Failure Mode

The bound requires convexity. For non-convex objectives (neural networks), SGD convergence to a stationary point (not a minimum) has rate O(1/T)O(1/\sqrt{T}) under smoothness, but there is no guarantee the stationary point is good.

Stationary Points in Nonconvex Optimization

For nonconvex LL, not every stationary point is a local minimum. Saddle points satisfy L(θ)=0\nabla L(\theta) = 0 but have directions of negative curvature. The standard taxonomy:

Definition

First-Order Stationary Point (FOSP)

A point θ\theta is a first-order stationary point of a differentiable LL if L(θ)=0\nabla L(\theta) = 0. An ϵ\epsilon-FOSP satisfies L(θ)ϵ\|\nabla L(\theta)\| \leq \epsilon. FOSPs include local minima, local maxima, and saddle points.

Definition

Second-Order Stationary Point (SOSP)

A point θ\theta is a second-order stationary point of a twice-differentiable LL if L(θ)=0\nabla L(\theta) = 0 and 2L(θ)0\nabla^2 L(\theta) \succeq 0 (no negative eigenvalue). An ϵ\epsilon-SOSP satisfies L(θ)ϵ\|\nabla L(\theta)\| \leq \epsilon and λmin(2L(θ))ρϵ\lambda_{\min}(\nabla^2 L(\theta)) \geq -\sqrt{\rho \epsilon} for a Hessian-Lipschitz constant ρ\rho. Saddle points are FOSPs but not SOSPs.

Plain gradient descent can get stuck at strict saddle points. Jin, Ge, Netrapalli, Kakade, Jordan (2017), "How to Escape Saddle Points Efficiently", showed that perturbed gradient descent (adding occasional isotropic noise) finds an ϵ\epsilon-SOSP in O~(1/ϵ2)\tilde{O}(1/\epsilon^2) iterations for Hessian-Lipschitz objectives. This matches the rate for finding an ϵ\epsilon-FOSP up to polylogarithmic factors, so escaping saddles is nearly free.

Momentum

Definition

SGD with Momentum

Momentum adds a velocity term that accumulates past gradients:

vt+1=βvt+L(θt)v_{t+1} = \beta v_t + \nabla L(\theta_t) θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

where β[0,1)\beta \in [0, 1) is the momentum coefficient, typically 0.9.

Momentum smooths out oscillations in directions with high curvature and accelerates progress in directions with consistent gradient. It acts like a heavy ball rolling downhill: it keeps moving even when the local gradient is small.

Nesterov Momentum

Nesterov's modification computes the gradient at the "look-ahead" position:

vt+1=βvt+L(θtηβvt)v_{t+1} = \beta v_t + \nabla L(\theta_t - \eta \beta v_t) θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

The idea: evaluate the gradient where momentum would take you, not where you currently are. This provides a corrective signal that reduces overshooting.

Adaptive Learning Rate Methods

AdaGrad

Definition

AdaGrad

AdaGrad scales the learning rate per parameter by the accumulated squared gradients:

Gt+1=Gt+gtgtG_{t+1} = G_t + g_t \odot g_t θt+1=θtηGt+1+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_{t+1} + \epsilon}} \odot g_t

where gt=L(θt)g_t = \nabla L(\theta_t), \odot is element-wise multiplication, and ϵ108\epsilon \approx 10^{-8} prevents division by zero.

AdaGrad gives larger updates to infrequent parameters and smaller updates to frequent ones. This is good for sparse features. The problem: GtG_t only grows, so the effective learning rate monotonically decreases and eventually becomes too small to make progress.

RMSProp

RMSProp fixes AdaGrad's decaying learning rate by using an exponential moving average instead of an accumulation:

vt+1=ρvt+(1ρ)gtgtv_{t+1} = \rho v_t + (1 - \rho) g_t \odot g_t θt+1=θtηvt+1+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_{t+1} + \epsilon}} \odot g_t

where ρ0.99\rho \approx 0.99. The running average forgets old gradients, so the effective learning rate adapts to recent gradient magnitudes rather than all past gradients.

Adam

Proposition

Adam Optimizer

Statement

Adam combines momentum (first moment) and RMSProp (second moment) with bias correction. Let gt=L(θt)g_t = \nabla L(\theta_t):

mt+1=β1mt+(1β1)gtm_{t+1} = \beta_1 m_t + (1 - \beta_1) g_t vt+1=β2vt+(1β2)gtgtv_{t+1} = \beta_2 v_t + (1 - \beta_2) g_t \odot g_t m^t+1=mt+11β1t+1,v^t+1=vt+11β2t+1\hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}}, \quad \hat{v}_{t+1} = \frac{v_{t+1}}{1 - \beta_2^{t+1}} θt+1=θtηv^t+1+ϵm^t+1\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_{t+1}} + \epsilon} \odot \hat{m}_{t+1}

Default hyperparameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, η=103\eta = 10^{-3}, ϵ=108\epsilon = 10^{-8}.

Intuition

Adam adapts the learning rate per parameter using the ratio of the first moment (gradient direction) to the square root of the second moment (gradient magnitude). Parameters with consistently large gradients get smaller effective learning rates; parameters with small gradients get larger ones. The bias correction accounts for the zero initialization of mm and vv.

Proof Sketch

The bias correction ensures E[m^t]=E[gt]\mathbb{E}[\hat{m}_t] = \mathbb{E}[g_t] and E[v^t]=E[gt2]\mathbb{E}[\hat{v}_t] = \mathbb{E}[g_t^2] under stationarity. Without correction, the initial estimates are biased toward zero because m0=0m_0 = 0 and v0=0v_0 = 0.

Why It Matters

Adam is the most widely used optimizer for deep learning. It converges reliably with the default hyperparameters (β1=0.9,β2=0.999,ϵ=108)(\beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8}) for Transformer language models at standard scales (Kingma and Ba, 2015) and for most supervised deep learning tasks. Its per-parameter adaptive learning rate means it requires less tuning than SGD with momentum.

Failure Mode

Adam can converge to worse solutions than SGD with momentum on some problems, particularly in computer vision (ResNets on ImageNet). The adaptive learning rate can be too aggressive, leading to flat minima that generalize poorly. AdamW (Adam with decoupled weight decay) fixes some of these issues.

Choosing an Optimizer

The practical default for most deep learning:

  • Start with Adam (or AdamW) with default hyperparameters. It works without tuning for most problems.
  • Switch to SGD with momentum if you have compute budget for hyperparameter tuning and want the best possible generalization (common in computer vision).
  • Use AdaGrad only for sparse problems (NLP with bag-of-words features, recommendation systems with sparse user-item interactions).

Common Confusions

Watch Out

Learning rate for Adam is not comparable to SGD learning rate

Adam's default learning rate is 10310^{-3}, while SGD often uses 10110^{-1}. These are not comparable because Adam divides by the second moment estimate. The effective step size for Adam is η/v^t\eta / \sqrt{\hat{v}_t}, which adapts per parameter.

Watch Out

Batch size affects the implicit learning rate

Doubling the batch size halves the gradient variance. If you double the batch size without adjusting the learning rate, you get a smoother but slower optimizer. The linear scaling rule suggests multiplying the learning rate by the batch size increase factor, but this is an approximation that breaks for very large batches.

Exercises

ExerciseCore

Problem

You train a model with SGD (no momentum) and learning rate 0.1. Training loss oscillates wildly. What two modifications would you try first, and why?

ExerciseAdvanced

Problem

Adam uses bias correction terms 1β1t1 - \beta_1^t and 1β2t1 - \beta_2^t. Explain what happens at t=1t = 1 without bias correction. Compute the uncorrected and corrected first moment estimates for t=1t = 1 with β1=0.9\beta_1 = 0.9 and g1=5.0g_1 = 5.0.

References

Canonical:

  • Robbins & Monro, "A Stochastic Approximation Method" (1951)
  • Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015), ICLR

Current:

  • Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019), ICLR (AdamW)

  • Zhang et al., "Gradient Descent Happens in a Tiny Subspace" (2019), for intuition on adaptive methods

  • Jin, Ge, Netrapalli, Kakade, Jordan, "How to Escape Saddle Points Efficiently" (2017), ICML; arXiv:1703.00887. Perturbed GD finds an ϵ\epsilon-SOSP in O~(1/ϵ2)\tilde{O}(1/\epsilon^2) iterations.

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

  • Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3

Next Topics

  • Learning rate scheduling: how to adjust the learning rate during training
  • Optimizer theory: deeper analysis of SGD, Adam, and newer optimizers

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics