Training Techniques
Adam Optimizer
Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.
Why This Matters
Adam is the default optimizer for training deep neural networks. It combines momentum (first moment estimation) with adaptive learning rates (second moment estimation) to handle the challenges of non-convex, high-dimensional optimization. Understanding Adam deeply means understanding why bias correction matters, how AdamW differs from Adam+L2 (they are not the same), and when Adam actually hurts generalization compared to plain SGD.
Mental Model
SGD with momentum keeps a running average of gradients to smooth out noise. RMSprop keeps a running average of squared gradients to adapt the learning rate per-parameter (parameters with large gradients get smaller steps). Adam combines both: it maintains a momentum vector (first moment) and an adaptive scaling vector (second moment), with bias correction to handle initialization.
The Algorithm
Adam Update Rule
Given parameters , learning rate , decay rates , and small constant :
At step , with gradient :
Default hyperparameters: , , .
Components Explained
First Moment (Momentum)
is an exponential moving average of gradients. With , this averages roughly the last 10 gradients. It smooths out gradient noise and accumulates direction, like a ball rolling downhill with momentum.
Expanding: .
Second Moment (Adaptive Scaling)
is an exponential moving average of squared gradients (elementwise). With , this averages roughly the last 1000 squared gradients. It estimates the variance of each gradient component.
Dividing by gives each parameter its own effective learning rate: parameters with consistently large gradients get smaller steps, and parameters with small gradients get larger steps.
Bias Correction
Bias Correction for Exponential Moving Averages
Statement
If and are drawn from a stationary distribution with mean , then the raw exponential moving average is biased:
The bias-corrected estimate satisfies .
Intuition
When you initialize and start averaging, the early estimates are biased toward zero. After one step with , you have , which underestimates the true gradient by a factor of 10. Dividing by corrects this. The correction matters most in the first few iterations and becomes negligible as grows (since ).
Proof Sketch
.
.
So . For the second moment, the same argument applies with replaced by .
Why It Matters
Without bias correction, the first few Adam steps take very small updates because and are close to zero. With , the second moment estimate is severely biased for the first ~1000 steps. Bias correction prevents this slow start and is essential for Adam to work well in practice.
Failure Mode
Bias correction assumes a stationary gradient distribution. In the early phase of training when the loss landscape changes rapidly, the stationarity assumption is violated. This is one motivation for learning rate warmup.
AdamW: Decoupled Weight Decay
AdamW Decouples Weight Decay from Gradient Adaptation
Statement
Adam + L2 regularization adds the L2 gradient to the gradient before moment estimation:
then runs standard Adam. The effective weight decay for parameter is , which varies per parameter.
AdamW applies weight decay directly to the parameters, after the adaptive step:
In AdamW, the weight decay is the same for all parameters, regardless of gradient magnitude.
Intuition
In SGD, L2 regularization and weight decay are equivalent. In Adam, they are not. When Adam divides the gradient by , it also divides the L2 gradient, weakening the regularization for parameters with large gradient history. AdamW avoids this by applying decay separately. This means all parameters decay at the same rate, as intended.
Proof Sketch
With Adam + L2: the effective update for parameter includes . Parameters with large (historically large gradients) receive weaker decay.
With AdamW: the decay term is regardless of . The gradient-based update and the decay are fully decoupled.
Why It Matters
Loshchilov and Hutter (2019) showed that AdamW consistently outperforms Adam+L2 across tasks. The key insight: regularization should not interact with the adaptive learning rate mechanism. AdamW is the default optimizer for training Transformer-based language models. For convolutional architectures, SGD with momentum remains competitive or superior (Wilson et al. 2017).
Failure Mode
The optimal for AdamW is different from the optimal for Adam+L2. You cannot simply swap one for the other without retuning. The typical AdamW weight decay for Transformers is 0.01-0.1, much larger than what you would use for L2 regularization in Adam.
Learning Rate Warmup
In practice, Adam is often combined with learning rate warmup: start with a very small learning rate and linearly increase it over the first steps to the target value. Why?
-
Second moment initialization: At step 1, , a noisy single-sample estimate. A large learning rate with a noisy denominator produces wild parameter updates. Warmup gives time to stabilize.
-
Loss landscape curvature: Early in training, the loss landscape may have regions of very high curvature. Large steps in these regions can be catastrophic. Warmup allows the model to reach a better-conditioned region before taking large steps.
A common schedule is linear warmup for 1-10% of total training steps, followed by cosine decay.
When Adam Fails
Adam is not universally superior to SGD:
-
Generalization gap (contested): Wilson et al. (2017) reported that SGD with momentum often generalizes better than Adam on image classification. One proposed mechanism is that Adam finds sharper minima while SGD's larger noise finds flatter ones (Keskar et al. 2017), but the flat-minima hypothesis itself has been challenged on reparameterization grounds (Dinh et al. 2017). The generalization gap is real; its causal mechanism is not settled.
-
Non-convergence: Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when shrinks. AMSGrad fixes this by taking to ensure the learning rate never increases. In practice AMSGrad is rarely used; the modification has not consistently improved empirical performance and most codebases default to AdamW.
-
Domain dependence: Adam dominates for NLP/Transformers. SGD dominates for CNNs on vision tasks. The optimal optimizer depends on the architecture and data distribution.
Canonical Examples
Why bias correction matters early
With and , at step :
- Raw: (underestimates true second moment by 1000x)
- Corrected: (correct order of magnitude)
Without correction, would be ~30x too small, making the effective learning rate ~30x too large. The first step would be catastrophically large.
Common Confusions
Adam and AdamW are NOT interchangeable
Adam with L2 regularization and AdamW with weight decay produce different parameter trajectories, even with the same value. In Adam+L2, the regularization strength varies per parameter (inversely with gradient history). In AdamW, it is uniform. Always use AdamW when you want weight decay with adaptive optimizers.
The epsilon parameter is not negligible
The default prevents division by zero, but in half-precision training (), this default can cause numerical issues. For fp16 training, set or larger.
Summary
- Adam = momentum (first moment) + adaptive LR (second moment) + bias correction
- First moment: (smooths gradients)
- Second moment: (scales per-parameter)
- Bias correction: divide by to correct for zero initialization
- AdamW decouples weight decay from adaptive scaling. use AdamW, not Adam+L2
- Adam finds sharper minima than SGD; SGD can generalize better for some tasks
- Warmup stabilizes the second moment estimate in early training
Exercises
Problem
Derive the bias-corrected first moment estimate. Starting from and , show that when the gradients have constant expectation .
Problem
Show that for SGD (no adaptive scaling), L2 regularization () is equivalent to weight decay (). Then explain why this equivalence breaks for Adam.
Related Comparisons
References
Canonical:
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015)
- Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019). introduces AdamW
Current:
- Reddi et al., "On the Convergence of Adam and Beyond" (2018). AMSGrad
- Wilson et al., "The Marginal Value of Adaptive Gradient Methods in ML" (2017)
Next Topics
Adam connects to broader optimization topics:
- Convex optimization basics: the theoretical foundation for understanding convergence
- Learning rate schedules: warmup, cosine decay, and their interaction with Adam
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Stochastic Gradient Descent ConvergenceLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A