Adam vs. SGD. Adaptive Learning Rates vs. Global Rate

What Each Measures

Both Adam and SGD (Stochastic Gradient Descent) are first-order optimization algorithms used to train neural networks by minimizing a loss function $\mathcal{L}(\theta)$ using gradient information. They differ in how they use gradient history to compute parameter updates.

SGD (with momentum) maintains a single global learning rate $\eta$ and optionally accumulates a velocity vector from past gradients.

Adam (Adaptive Moment Estimation) maintains per-parameter learning rates by tracking exponential moving averages of both the first moment (mean) and second moment (uncentered variance) of gradients.

theorem visual

Adam buys fast adaptation; SGD buys simpler noise geometry

$Both follow gradients. Adam rescales each coordinate using moment estimates, which is forgiving early and strong for sparse gradients. SGD uses one global scale, often requiring tuning but sometimes landing in flatter basins.$

SGD update

$θ_{t + 1} = θ_{t} - η v_{t}$

$One global learning-rate scale. Noise and schedules matter a lot.$

Adam update

$θ_{t + 1} = θ_{t} - η \overset{m}{^}_{t} / (\overset{v}{^}_{t} + ϵ)$

$Coordinates with large recent gradients get smaller effective steps.$

AdamW detail

$θ \leftarrow θ - η u_{t} - η λ θ$

$Decoupled weight decay avoids mixing regularization with adaptive scaling.$

Side-by-Side Statement

Definition

SGD with Momentum

At step $t$ , given gradient $g_t = \nabla_\theta \mathcal{L}(\theta_t)$ :

$v_t = \mu \, v_{t-1} + g_t$ $\theta_{t+1} = \theta_t - \eta \, v_t$

where $\mu \in [0, 1)$ is the momentum coefficient (typically $0.9$ ) and $\eta$ is the global learning rate. Every parameter gets the same $\eta$ .

Definition

Adam Optimizer

At step $t$ , given gradient $g_t$ :

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment estimate)}$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment estimate)}$ $\hat{m}_t = m_t / (1 - \beta_1^t), \quad \hat{v}_t = v_t / (1 - \beta_2^t) \quad \text{(bias correction)}$ $\theta_{t+1} = \theta_t - \eta \, \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ . Each parameter effectively gets its own learning rate $\eta / (\sqrt{\hat{v}_t} + \epsilon)$ .

Where Each Is Stronger

Adam wins on convergence speed

Adam adapts per-parameter learning rates: parameters with consistently large gradients get smaller effective rates, and parameters with small or sparse gradients get larger rates. This means:

Rare features (e.g., in NLP embeddings) get meaningful updates even when they appear infrequently.
The optimizer navigates ravines and ill-conditioned loss surfaces more effectively without manual learning rate tuning.
In the first few epochs, Adam typically reduces training loss much faster than SGD.

SGD wins on final generalization

A large body of empirical evidence (particularly in computer vision) shows that SGD with momentum, combined with a carefully tuned learning rate schedule, finds solutions that generalize better than those found by Adam. The dominant hypothesis:

SGD finds flatter minima: The noise in SGD updates helps escape sharp minima and settle into broad, flat basins of the loss landscape. Flat minima are associated with better generalization because the loss does not change much when parameters are perturbed.
Adam finds sharper minima: The per-parameter adaptation reduces effective noise and can converge to sharper minima that overfit more.

This is the "sharp vs. flat minima" debate. It is not fully settled theoretically, but the empirical pattern is robust in image classification.

Adam wins on hyperparameter robustness

Adam with defaults ( $\eta = 0.001$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ ) works reasonably well across many problems. SGD requires careful tuning of the learning rate, momentum, and learning rate schedule (step decay, cosine annealing, warmup). A poorly tuned SGD can perform terribly, while Adam with defaults rarely fails catastrophically.

SGD wins in theoretical understanding

SGD has well-understood convergence guarantees:

Convex: $O(1/\sqrt{T})$ convergence rate for the general case
Strongly convex: $O(1/T)$ with appropriate step size decay

Adam's convergence theory is more complex. The original Adam paper's convergence proof had a gap (the counterexample of Reddi et al., 2018 showed Adam can diverge on simple convex problems). AMSGrad was proposed as a fix, but in practice, standard Adam works fine on most deep learning problems.

Where Each Fails

Adam fails on generalization (in some domains)

In computer vision (ResNets on ImageNet, for example), Adam consistently achieves higher test error than SGD with momentum, even when both achieve similar training loss. This "generalization gap" of adaptive methods has been documented repeatedly.

SGD fails without careful tuning

SGD with a bad learning rate diverges or stagnates. Finding the right rate often requires grid search, warmup schedules, and patience. For new problems where you have no prior knowledge of the loss landscape, the cost of tuning SGD can be prohibitive.

Adam fails on convergence guarantees

As Reddi et al. (2018) showed, Adam can fail to converge even on simple convex problems because the exponential moving average of past squared gradients can cause the effective learning rate to increase. AMSGrad fixes this by taking the maximum of all past second-moment estimates, but adds memory overhead.

SGD fails on sparse gradient problems

In NLP tasks with large embedding tables, most embeddings receive zero gradients on any given minibatch. SGD gives these parameters zero updates. Adam's first moment estimate carries forward previous gradient information, and the small second moment estimate amplifies the effective learning rate for rare parameters. This is why Adam (and its variants) dominate NLP.

The AdamW Distinction

Definition

AdamW (Decoupled Weight Decay)

Standard Adam with L2 regularization applies the penalty inside the gradient, which interacts poorly with the adaptive scaling. AdamW decouples weight decay from the adaptive update:

$\theta_{t+1} = \theta_t - \eta\left(\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) + \lambda \theta_t\right)$

This means the weight decay $\lambda\theta_t$ is applied at the same scale for all parameters, not divided by $\sqrt{\hat{v}_t}$ . AdamW consistently outperforms Adam with L2 regularization and is the standard choice for training transformers.

The difference: in Adam + L2, the regularization gradient $\lambda\theta$ is scaled by $1/\sqrt{\hat{v}_t}$ , so parameters with large gradient magnitudes get less regularization. AdamW removes this unwanted coupling.

Key Assumptions That Differ

	SGD (with momentum)	Adam
Learning rate	Single global $\eta$	Per-parameter: $\eta / (\sqrt{\hat{v}_t} + \epsilon)$
Gradient memory	Momentum $v_t$ (first moment only)	Both first and second moments
Hyperparameters	$\eta$ , $\mu$ , schedule	$\eta$ , $\beta_1$ , $\beta_2$ , $\epsilon$
Tuning effort	High (schedule matters a lot)	Low (defaults often work)
Sparse gradients	Poor (zero update for zero gradient)	Good (moment estimates persist)
Generalization	Often better (flat minima)	Often worse (sharp minima)
Convergence theory	Well-understood	Has known pathologies
Weight decay	Standard L2 works fine	Need AdamW (decoupled)

What to Memorize

Adam = exponential moving average of $g_t$ (first moment) and $g_t^2$ (second moment) + bias correction + per-parameter rates
SGD + momentum = accumulate velocity, single global rate
Adam converges faster, SGD generalizes better (empirically, especially in vision)
AdamW decouples weight decay from adaptive scaling: use AdamW, not Adam + L2
Adam defaults: $\eta=0.001$ , $\beta_1=0.9$ , $\beta_2=0.999$
Decision rule: Transformers/NLP use AdamW. ResNets/vision use SGD + cosine schedule. New problem with no tuning budget? Start with Adam.

When a Researcher Would Use Each

Example

Training a large language model (GPT, LLaMA)

Use AdamW with a linear warmup followed by cosine decay. Transformers trained with SGD converge slowly or not at all, because the loss landscape of attention layers is highly non-isotropic. AdamW's per-parameter scaling handles this effectively. This is the universal choice in LLM training.

Example

Training ResNet-50 on ImageNet

Use SGD with momentum ( $\mu = 0.9$ ) and a step or cosine learning rate schedule. Decades of empirical evidence show SGD achieves 0.5 to 1.0% lower top-1 error than Adam on this benchmark. The extra tuning effort pays off because ImageNet training is expensive and every fraction of a percent matters.

Example

Quick prototype on a new dataset

Use Adam with defaults. You do not know the loss landscape, you do not want to spend time on learning rate sweeps, and you just want a reasonable model to validate your approach. Switch to SGD later if you need to squeeze out the last bit of performance.

Example

Fine-tuning a pretrained transformer

Use AdamW with a small learning rate ( $10^{-5}$ to $5 \times 10^{-5}$ ) and linear warmup. Fine-tuning requires careful optimization because you are near a good minimum and large steps destroy pretrained features. AdamW's adaptive rates help navigate this delicate landscape.

Common Confusions

Watch Out

Adam is not uniformly better than SGD

Adam's faster convergence on training loss does not imply better test performance. The generalization gap between Adam and SGD is well-documented, particularly in image classification. Faster training loss reduction can mean the optimizer is finding sharp, overfit solutions more efficiently.

Watch Out

Adam's learning rate still matters

Despite being called 'adaptive,' Adam still requires a good base learning rate $\eta$ . The adaptation is relative: each parameter's rate is $\eta / \sqrt{\hat{v}_t}$ , so $\eta$ scales all of them. Using $\eta = 0.1$ with Adam is usually catastrophic. The default $\eta = 0.001$ is a good starting point, but it is not universally optimal.

Watch Out

The bias correction matters early in training

Without bias correction, the first few steps of Adam use heavily biased (toward zero) moment estimates, leading to excessively large updates. The correction $\hat{m}_t = m_t / (1 - \beta_1^t)$ is not optional. It is essential for stable early training. Some implementations omit it; this is a bug.