Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Adam vs. SGD

Adam adapts the learning rate per parameter using first and second moment estimates for fast early convergence; SGD with momentum uses a single global learning rate and often finds flatter minima that generalize better. The choice depends on your priorities: speed to convergence or final model quality.

What Each Measures

Both Adam and SGD (Stochastic Gradient Descent) are first-order optimization algorithms used to train neural networks by minimizing a loss function L(θ)\mathcal{L}(\theta) using gradient information. They differ in how they use gradient history to compute parameter updates.

SGD (with momentum) maintains a single global learning rate η\eta and optionally accumulates a velocity vector from past gradients.

Adam (Adaptive Moment Estimation) maintains per-parameter learning rates by tracking exponential moving averages of both the first moment (mean) and second moment (uncentered variance) of gradients.

Side-by-Side Statement

Definition

SGD with Momentum

At step tt, given gradient gt=θL(θt)g_t = \nabla_\theta \mathcal{L}(\theta_t):

vt=μvt1+gtv_t = \mu \, v_{t-1} + g_t θt+1=θtηvt\theta_{t+1} = \theta_t - \eta \, v_t

where μ[0,1)\mu \in [0, 1) is the momentum coefficient (typically 0.90.9) and η\eta is the global learning rate. Every parameter gets the same η\eta.

Definition

Adam Optimizer

At step tt, given gradient gtg_t:

mt=β1mt1+(1β1)gt(first moment estimate)m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment estimate)} vt=β2vt1+(1β2)gt2(second moment estimate)v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment estimate)} m^t=mt/(1β1t),v^t=vt/(1β2t)(bias correction)\hat{m}_t = m_t / (1 - \beta_1^t), \quad \hat{v}_t = v_t / (1 - \beta_2^t) \quad \text{(bias correction)} θt+1=θtηm^t/(v^t+ϵ)\theta_{t+1} = \theta_t - \eta \, \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)

Default hyperparameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}. Each parameter effectively gets its own learning rate η/(v^t+ϵ)\eta / (\sqrt{\hat{v}_t} + \epsilon).

Where Each Is Stronger

Adam wins on convergence speed

Adam adapts per-parameter learning rates: parameters with consistently large gradients get smaller effective rates, and parameters with small or sparse gradients get larger rates. This means:

SGD wins on final generalization

A large body of empirical evidence (particularly in computer vision) shows that SGD with momentum, combined with a carefully tuned learning rate schedule, finds solutions that generalize better than those found by Adam. The dominant hypothesis:

This is the "sharp vs. flat minima" debate. It is not fully settled theoretically, but the empirical pattern is robust in image classification.

Adam wins on hyperparameter robustness

Adam with defaults (η=0.001\eta = 0.001, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999) works reasonably well across many problems. SGD requires careful tuning of the learning rate, momentum, and learning rate schedule (step decay, cosine annealing, warmup). A poorly tuned SGD can perform terribly, while Adam with defaults rarely fails catastrophically.

SGD wins in theoretical understanding

SGD has well-understood convergence guarantees:

Adam's convergence theory is more complex. The original Adam paper's convergence proof had a gap (the counterexample of Reddi et al., 2018 showed Adam can diverge on simple convex problems). AMSGrad was proposed as a fix, but in practice, standard Adam works fine on most deep learning problems.

Where Each Fails

Adam fails on generalization (in some domains)

In computer vision (ResNets on ImageNet, for example), Adam consistently achieves higher test error than SGD with momentum, even when both achieve similar training loss. This "generalization gap" of adaptive methods has been documented repeatedly.

SGD fails without careful tuning

SGD with a bad learning rate diverges or stagnates. Finding the right rate often requires grid search, warmup schedules, and patience. For new problems where you have no prior knowledge of the loss landscape, the cost of tuning SGD can be prohibitive.

Adam fails on convergence guarantees

As Reddi et al. (2018) showed, Adam can fail to converge even on simple convex problems because the exponential moving average of past squared gradients can cause the effective learning rate to increase. AMSGrad fixes this by taking the maximum of all past second-moment estimates, but adds memory overhead.

SGD fails on sparse gradient problems

In NLP tasks with large embedding tables, most embeddings receive zero gradients on any given minibatch. SGD gives these parameters zero updates. Adam's first moment estimate carries forward previous gradient information, and the small second moment estimate amplifies the effective learning rate for rare parameters. This is why Adam (and its variants) dominate NLP.

The AdamW Distinction

Definition

AdamW (Decoupled Weight Decay)

Standard Adam with L2 regularization applies the penalty inside the gradient, which interacts poorly with the adaptive scaling. AdamW decouples weight decay from the adaptive update:

θt+1=θtη(m^t/(v^t+ϵ)+λθt)\theta_{t+1} = \theta_t - \eta\left(\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) + \lambda \theta_t\right)

This means the weight decay λθt\lambda\theta_t is applied at the same scale for all parameters, not divided by v^t\sqrt{\hat{v}_t}. AdamW consistently outperforms Adam with L2 regularization and is the standard choice for training transformers.

The difference: in Adam + L2, the regularization gradient λθ\lambda\theta is scaled by 1/v^t1/\sqrt{\hat{v}_t}, so parameters with large gradient magnitudes get less regularization. AdamW removes this unwanted coupling.

Key Assumptions That Differ

SGD (with momentum)Adam
Learning rateSingle global η\etaPer-parameter: η/(v^t+ϵ)\eta / (\sqrt{\hat{v}_t} + \epsilon)
Gradient memoryMomentum vtv_t (first moment only)Both first and second moments
Hyperparametersη\eta, μ\mu, scheduleη\eta, β1\beta_1, β2\beta_2, ϵ\epsilon
Tuning effortHigh (schedule matters a lot)Low (defaults often work)
Sparse gradientsPoor (zero update for zero gradient)Good (moment estimates persist)
GeneralizationOften better (flat minima)Often worse (sharp minima)
Convergence theoryWell-understoodHas known pathologies
Weight decayStandard L2 works fineNeed AdamW (decoupled)

What to Memorize

  1. Adam = exponential moving average of gtg_t (first moment) and gt2g_t^2 (second moment) + bias correction + per-parameter rates
  2. SGD + momentum = accumulate velocity, single global rate
  3. Adam converges faster, SGD generalizes better (empirically, especially in vision)
  4. AdamW decouples weight decay from adaptive scaling: use AdamW, not Adam + L2
  5. Adam defaults: η=0.001\eta=0.001, β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999
  6. Decision rule: Transformers/NLP use AdamW. ResNets/vision use SGD + cosine schedule. New problem with no tuning budget? Start with Adam.

When a Researcher Would Use Each

Example

Training a large language model (GPT, LLaMA)

Use AdamW with a linear warmup followed by cosine decay. Transformers trained with SGD converge slowly or not at all, because the loss landscape of attention layers is highly non-isotropic. AdamW's per-parameter scaling handles this effectively. This is the universal choice in LLM training.

Example

Training ResNet-50 on ImageNet

Use SGD with momentum (μ=0.9\mu = 0.9) and a step or cosine learning rate schedule. Decades of empirical evidence show SGD achieves 0.5 to 1.0% lower top-1 error than Adam on this benchmark. The extra tuning effort pays off because ImageNet training is expensive and every fraction of a percent matters.

Example

Quick prototype on a new dataset

Use Adam with defaults. You do not know the loss landscape, you do not want to spend time on learning rate sweeps, and you just want a reasonable model to validate your approach. Switch to SGD later if you need to squeeze out the last bit of performance.

Example

Fine-tuning a pretrained transformer

Use AdamW with a small learning rate (10510^{-5} to 5×1055 \times 10^{-5}) and linear warmup. Fine-tuning requires careful optimization because you are near a good minimum and large steps destroy pretrained features. AdamW's adaptive rates help navigate this delicate landscape.

Common Confusions

Watch Out

Adam is not uniformly better than SGD

Adam's faster convergence on training loss does not imply better test performance. The generalization gap between Adam and SGD is well-documented, particularly in image classification. Faster training loss reduction can mean the optimizer is finding sharp, overfit solutions more efficiently.

Watch Out

Adam's learning rate still matters

Despite being called 'adaptive,' Adam still requires a good base learning rate η\eta. The adaptation is relative: each parameter's rate is η/v^t\eta / \sqrt{\hat{v}_t}, so η\eta scales all of them. Using η=0.1\eta = 0.1 with Adam is usually catastrophic. The default η=0.001\eta = 0.001 is a good starting point, but it is not universally optimal.

Watch Out

The bias correction matters early in training

Without bias correction, the first few steps of Adam use heavily biased (toward zero) moment estimates, leading to excessively large updates. The correction m^t=mt/(1β1t)\hat{m}_t = m_t / (1 - \beta_1^t) is not optional. It is essential for stable early training. Some implementations omit it; this is a bug.