Optimization Function Classes
Gradient Descent Variants
From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.
Prerequisites
Why This Matters
Every neural network is trained by some variant of gradient descent (or gradient ascent when maximizing a reward or likelihood). The choice of optimizer affects convergence speed, final model quality, and generalization. Understanding the differences between SGD, momentum, and Adam is necessary for debugging training and making informed choices.
Full-Batch Gradient Descent
Full-Batch Gradient Descent
Given a loss function , full-batch gradient descent updates:
where is the learning rate and is computed using all training examples.
Full-batch GD computes the exact gradient. Each step is expensive ( per update), and for large this is impractical. A single epoch requires one gradient computation and one parameter update.
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD)
At each step, sample a single example uniformly at random and update:
The stochastic gradient is an unbiased estimate of the full gradient: .
SGD is cheap per step ( per update) but noisy. The noise can be beneficial: it helps escape shallow local minima and saddle points. But it also means the iterates oscillate and do not converge to an exact minimum without a decreasing learning rate schedule.
Mini-Batch SGD
Mini-Batch Gradient Descent
At each step, sample a mini-batch of size and update:
Typical batch sizes: 32, 64, 128, 256, or 512.
Mini-batch SGD is the practical default. It balances the noise reduction of averaging over multiple samples with the computational efficiency of not using all samples. Larger batches reduce variance (smoother updates) but cost more per step. Smaller batches add noise but allow more updates per epoch.
SGD Convergence for Smooth Convex Functions
Statement
Let be convex and -smooth. Let the stochastic gradient have bounded variance: . With learning rate , after steps of SGD:
The convergence rate is , which is slower than the rate of full-batch GD for smooth convex functions.
Intuition
The term is the price of stochasticity. No matter how many steps you take, the noise in the gradient prevents exact convergence with a fixed learning rate. You need a decreasing schedule () for exact convergence, which slows things further.
Proof Sketch
Apply the descent lemma: . Substitute the SGD update, take expectations, and telescope the sum. The variance term prevents the telescoping from collapsing to zero.
Why It Matters
This rate explains why SGD is slow in theory but fast in practice. The constant hidden in the notation depends on for full-batch but not for SGD. For large , SGD reaches a given accuracy much sooner in wall clock time despite the worse rate per iteration.
Failure Mode
The bound requires convexity. For non-convex objectives (neural networks), SGD convergence to a stationary point (not a minimum) has rate under smoothness, but there is no guarantee the stationary point is good.
Stationary Points in Nonconvex Optimization
For nonconvex , not every stationary point is a local minimum. Saddle points satisfy but have directions of negative curvature. The standard taxonomy:
First-Order Stationary Point (FOSP)
A point is a first-order stationary point of a differentiable if . An -FOSP satisfies . FOSPs include local minima, local maxima, and saddle points.
Second-Order Stationary Point (SOSP)
A point is a second-order stationary point of a twice-differentiable if and (no negative eigenvalue). An -SOSP satisfies and for a Hessian-Lipschitz constant . Saddle points are FOSPs but not SOSPs.
Plain gradient descent can get stuck at strict saddle points. Jin, Ge, Netrapalli, Kakade, Jordan (2017), "How to Escape Saddle Points Efficiently", showed that perturbed gradient descent (adding occasional isotropic noise) finds an -SOSP in iterations for Hessian-Lipschitz objectives. This matches the rate for finding an -FOSP up to polylogarithmic factors, so escaping saddles is nearly free.
Momentum
SGD with Momentum
Momentum adds a velocity term that accumulates past gradients:
where is the momentum coefficient, typically 0.9.
Momentum smooths out oscillations in directions with high curvature and accelerates progress in directions with consistent gradient. It acts like a heavy ball rolling downhill: it keeps moving even when the local gradient is small.
Nesterov Momentum
Nesterov's modification computes the gradient at the "look-ahead" position:
The idea: evaluate the gradient where momentum would take you, not where you currently are. This provides a corrective signal that reduces overshooting.
Adaptive Learning Rate Methods
AdaGrad
AdaGrad
AdaGrad scales the learning rate per parameter by the accumulated squared gradients:
where , is element-wise multiplication, and prevents division by zero.
AdaGrad gives larger updates to infrequent parameters and smaller updates to frequent ones. This is good for sparse features. The problem: only grows, so the effective learning rate monotonically decreases and eventually becomes too small to make progress.
RMSProp
RMSProp fixes AdaGrad's decaying learning rate by using an exponential moving average instead of an accumulation:
where . The running average forgets old gradients, so the effective learning rate adapts to recent gradient magnitudes rather than all past gradients.
Adam
Adam Optimizer
Statement
Adam combines momentum (first moment) and RMSProp (second moment) with bias correction. Let :
Default hyperparameters: , , , .
Intuition
Adam adapts the learning rate per parameter using the ratio of the first moment (gradient direction) to the square root of the second moment (gradient magnitude). Parameters with consistently large gradients get smaller effective learning rates; parameters with small gradients get larger ones. The bias correction accounts for the zero initialization of and .
Proof Sketch
The bias correction ensures and under stationarity. Without correction, the initial estimates are biased toward zero because and .
Why It Matters
Adam is the most widely used optimizer for deep learning. It converges reliably with the default hyperparameters for Transformer language models at standard scales (Kingma and Ba, 2015) and for most supervised deep learning tasks. Its per-parameter adaptive learning rate means it requires less tuning than SGD with momentum.
Failure Mode
Adam can converge to worse solutions than SGD with momentum on some problems, particularly in computer vision (ResNets on ImageNet). The adaptive learning rate can be too aggressive, leading to flat minima that generalize poorly. AdamW (Adam with decoupled weight decay) fixes some of these issues.
Choosing an Optimizer
The practical default for most deep learning:
- Start with Adam (or AdamW) with default hyperparameters. It works without tuning for most problems.
- Switch to SGD with momentum if you have compute budget for hyperparameter tuning and want the best possible generalization (common in computer vision).
- Use AdaGrad only for sparse problems (NLP with bag-of-words features, recommendation systems with sparse user-item interactions).
Common Confusions
Learning rate for Adam is not comparable to SGD learning rate
Adam's default learning rate is , while SGD often uses . These are not comparable because Adam divides by the second moment estimate. The effective step size for Adam is , which adapts per parameter.
Batch size affects the implicit learning rate
Doubling the batch size halves the gradient variance. If you double the batch size without adjusting the learning rate, you get a smoother but slower optimizer. The linear scaling rule suggests multiplying the learning rate by the batch size increase factor, but this is an approximation that breaks for very large batches.
Exercises
Problem
You train a model with SGD (no momentum) and learning rate 0.1. Training loss oscillates wildly. What two modifications would you try first, and why?
Problem
Adam uses bias correction terms and . Explain what happens at without bias correction. Compute the uncorrected and corrected first moment estimates for with and .
References
Canonical:
- Robbins & Monro, "A Stochastic Approximation Method" (1951)
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015), ICLR
Current:
-
Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019), ICLR (AdamW)
-
Zhang et al., "Gradient Descent Happens in a Tiny Subspace" (2019), for intuition on adaptive methods
-
Jin, Ge, Netrapalli, Kakade, Jordan, "How to Escape Saddle Points Efficiently" (2017), ICML; arXiv:1703.00887. Perturbed GD finds an -SOSP in iterations.
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
-
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3
Next Topics
- Learning rate scheduling: how to adjust the learning rate during training
- Optimizer theory: deeper analysis of SGD, Adam, and newer optimizers
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Adam OptimizerLayer 2
- Gradient BoostingLayer 2
- Implicit Bias and Modern GeneralizationLayer 4
- Stochastic Gradient Descent ConvergenceLayer 2