What Each Measures
Both Adam and SGD (Stochastic Gradient Descent) are first-order optimization algorithms used to train neural networks by minimizing a loss function using gradient information. They differ in how they use gradient history to compute parameter updates.
SGD (with momentum) maintains a single global learning rate and optionally accumulates a velocity vector from past gradients.
Adam (Adaptive Moment Estimation) maintains per-parameter learning rates by tracking exponential moving averages of both the first moment (mean) and second moment (uncentered variance) of gradients.
Side-by-Side Statement
SGD with Momentum
At step , given gradient :
where is the momentum coefficient (typically ) and is the global learning rate. Every parameter gets the same .
Adam Optimizer
At step , given gradient :
Default hyperparameters: , , . Each parameter effectively gets its own learning rate .
Where Each Is Stronger
Adam wins on convergence speed
Adam adapts per-parameter learning rates: parameters with consistently large gradients get smaller effective rates, and parameters with small or sparse gradients get larger rates. This means:
- Rare features (e.g., in NLP embeddings) get meaningful updates even when they appear infrequently.
- The optimizer navigates ravines and ill-conditioned loss surfaces more effectively without manual learning rate tuning.
- In the first few epochs, Adam typically reduces training loss much faster than SGD.
SGD wins on final generalization
A large body of empirical evidence (particularly in computer vision) shows that SGD with momentum, combined with a carefully tuned learning rate schedule, finds solutions that generalize better than those found by Adam. The dominant hypothesis:
- SGD finds flatter minima: The noise in SGD updates helps escape sharp minima and settle into broad, flat basins of the loss landscape. Flat minima are associated with better generalization because the loss does not change much when parameters are perturbed.
- Adam finds sharper minima: The per-parameter adaptation reduces effective noise and can converge to sharper minima that overfit more.
This is the "sharp vs. flat minima" debate. It is not fully settled theoretically, but the empirical pattern is robust in image classification.
Adam wins on hyperparameter robustness
Adam with defaults (, , ) works reasonably well across many problems. SGD requires careful tuning of the learning rate, momentum, and learning rate schedule (step decay, cosine annealing, warmup). A poorly tuned SGD can perform terribly, while Adam with defaults rarely fails catastrophically.
SGD wins in theoretical understanding
SGD has well-understood convergence guarantees:
- Convex: convergence rate for the general case
- Strongly convex: with appropriate step size decay
Adam's convergence theory is more complex. The original Adam paper's convergence proof had a gap (the counterexample of Reddi et al., 2018 showed Adam can diverge on simple convex problems). AMSGrad was proposed as a fix, but in practice, standard Adam works fine on most deep learning problems.
Where Each Fails
Adam fails on generalization (in some domains)
In computer vision (ResNets on ImageNet, for example), Adam consistently achieves higher test error than SGD with momentum, even when both achieve similar training loss. This "generalization gap" of adaptive methods has been documented repeatedly.
SGD fails without careful tuning
SGD with a bad learning rate diverges or stagnates. Finding the right rate often requires grid search, warmup schedules, and patience. For new problems where you have no prior knowledge of the loss landscape, the cost of tuning SGD can be prohibitive.
Adam fails on convergence guarantees
As Reddi et al. (2018) showed, Adam can fail to converge even on simple convex problems because the exponential moving average of past squared gradients can cause the effective learning rate to increase. AMSGrad fixes this by taking the maximum of all past second-moment estimates, but adds memory overhead.
SGD fails on sparse gradient problems
In NLP tasks with large embedding tables, most embeddings receive zero gradients on any given minibatch. SGD gives these parameters zero updates. Adam's first moment estimate carries forward previous gradient information, and the small second moment estimate amplifies the effective learning rate for rare parameters. This is why Adam (and its variants) dominate NLP.
The AdamW Distinction
AdamW (Decoupled Weight Decay)
Standard Adam with L2 regularization applies the penalty inside the gradient, which interacts poorly with the adaptive scaling. AdamW decouples weight decay from the adaptive update:
This means the weight decay is applied at the same scale for all parameters, not divided by . AdamW consistently outperforms Adam with L2 regularization and is the standard choice for training transformers.
The difference: in Adam + L2, the regularization gradient is scaled by , so parameters with large gradient magnitudes get less regularization. AdamW removes this unwanted coupling.
Key Assumptions That Differ
| SGD (with momentum) | Adam | |
|---|---|---|
| Learning rate | Single global | Per-parameter: |
| Gradient memory | Momentum (first moment only) | Both first and second moments |
| Hyperparameters | , , schedule | , , , |
| Tuning effort | High (schedule matters a lot) | Low (defaults often work) |
| Sparse gradients | Poor (zero update for zero gradient) | Good (moment estimates persist) |
| Generalization | Often better (flat minima) | Often worse (sharp minima) |
| Convergence theory | Well-understood | Has known pathologies |
| Weight decay | Standard L2 works fine | Need AdamW (decoupled) |
What to Memorize
- Adam = exponential moving average of (first moment) and (second moment) + bias correction + per-parameter rates
- SGD + momentum = accumulate velocity, single global rate
- Adam converges faster, SGD generalizes better (empirically, especially in vision)
- AdamW decouples weight decay from adaptive scaling: use AdamW, not Adam + L2
- Adam defaults: , ,
- Decision rule: Transformers/NLP use AdamW. ResNets/vision use SGD + cosine schedule. New problem with no tuning budget? Start with Adam.
When a Researcher Would Use Each
Training a large language model (GPT, LLaMA)
Use AdamW with a linear warmup followed by cosine decay. Transformers trained with SGD converge slowly or not at all, because the loss landscape of attention layers is highly non-isotropic. AdamW's per-parameter scaling handles this effectively. This is the universal choice in LLM training.
Training ResNet-50 on ImageNet
Use SGD with momentum () and a step or cosine learning rate schedule. Decades of empirical evidence show SGD achieves 0.5 to 1.0% lower top-1 error than Adam on this benchmark. The extra tuning effort pays off because ImageNet training is expensive and every fraction of a percent matters.
Quick prototype on a new dataset
Use Adam with defaults. You do not know the loss landscape, you do not want to spend time on learning rate sweeps, and you just want a reasonable model to validate your approach. Switch to SGD later if you need to squeeze out the last bit of performance.
Fine-tuning a pretrained transformer
Use AdamW with a small learning rate ( to ) and linear warmup. Fine-tuning requires careful optimization because you are near a good minimum and large steps destroy pretrained features. AdamW's adaptive rates help navigate this delicate landscape.
Common Confusions
Adam is not uniformly better than SGD
Adam's faster convergence on training loss does not imply better test performance. The generalization gap between Adam and SGD is well-documented, particularly in image classification. Faster training loss reduction can mean the optimizer is finding sharp, overfit solutions more efficiently.
Adam's learning rate still matters
Despite being called 'adaptive,' Adam still requires a good base learning rate . The adaptation is relative: each parameter's rate is , so scales all of them. Using with Adam is usually catastrophic. The default is a good starting point, but it is not universally optimal.
The bias correction matters early in training
Without bias correction, the first few steps of Adam use heavily biased (toward zero) moment estimates, leading to excessively large updates. The correction is not optional. It is essential for stable early training. Some implementations omit it; this is a bug.