Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Adam Optimizer

Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.

CoreTier 1Stable~55 min

Why This Matters

Adam is the default optimizer for training deep neural networks. It combines momentum (first moment estimation) with adaptive learning rates (second moment estimation) to handle the challenges of non-convex, high-dimensional optimization. Understanding Adam deeply means understanding why bias correction matters, how AdamW differs from Adam+L2 (they are not the same), and when Adam actually hurts generalization compared to plain SGD.

Adam Update: θₜ₊₁ = θₜ - α · m̂ₜ / (v̂ₜ + ε)β₁ mₜ₋₁ + (1-β₁)gₜβ₂ vₜ₋₁ + (1-β₂)gₜ²/(1-β₁ᵗ)/(1-β₂ᵗ)θ - α·Δθgₜgradientmₜ1st momentvₜ2nd momentm̂ₜbias-correctedv̂ₜbias-correctedΔθupdateθₜ₊₁new paramsα·/(+ε)m tracks gradient direction (like momentum)v tracks gradient magnitude (adaptive LR per param)

Mental Model

SGD with momentum keeps a running average of gradients to smooth out noise. RMSprop keeps a running average of squared gradients to adapt the learning rate per-parameter (parameters with large gradients get smaller steps). Adam combines both: it maintains a momentum vector (first moment) and an adaptive scaling vector (second moment), with bias correction to handle initialization.

The Algorithm

Definition

Adam Update Rule

Given parameters θ\theta, learning rate η\eta, decay rates β1,β2\beta_1, \beta_2, and small constant ϵ\epsilon:

At step tt, with gradient gt=θL(θt1)g_t = \nabla_\theta \mathcal{L}(\theta_{t-1}):

mt=β1mt1+(1β1)gt(first moment estimate)m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment estimate)} vt=β2vt1+(1β2)gt2(second moment estimate)v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment estimate)} m^t=mt1β1t(bias-corrected first moment)\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{(bias-corrected first moment)} v^t=vt1β2t(bias-corrected second moment)\hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias-corrected second moment)} θt=θt1ηm^tv^t+ϵ\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Default hyperparameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}.

Components Explained

Definition

First Moment (Momentum)

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t is an exponential moving average of gradients. With β1=0.9\beta_1 = 0.9, this averages roughly the last 10 gradients. It smooths out gradient noise and accumulates direction, like a ball rolling downhill with momentum.

Expanding: mt=(1β1)i=1tβ1tigim_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i.

Definition

Second Moment (Adaptive Scaling)

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 is an exponential moving average of squared gradients (elementwise). With β2=0.999\beta_2 = 0.999, this averages roughly the last 1000 squared gradients. It estimates the variance of each gradient component.

Dividing by vt\sqrt{v_t} gives each parameter its own effective learning rate: parameters with consistently large gradients get smaller steps, and parameters with small gradients get larger steps.

Bias Correction

Theorem

Bias Correction for Exponential Moving Averages

Statement

If m0=0m_0 = 0 and gtg_t are drawn from a stationary distribution with mean E[gt]=g\mathbb{E}[g_t] = g, then the raw exponential moving average is biased:

E[mt]=(1βt)g\mathbb{E}[m_t] = (1 - \beta^t) \cdot g

The bias-corrected estimate m^t=mt/(1βt)\hat{m}_t = m_t / (1 - \beta^t) satisfies E[m^t]=g\mathbb{E}[\hat{m}_t] = g.

Intuition

When you initialize m0=0m_0 = 0 and start averaging, the early estimates are biased toward zero. After one step with β=0.9\beta = 0.9, you have m1=0.1g1m_1 = 0.1 \cdot g_1, which underestimates the true gradient by a factor of 10. Dividing by (10.91)=0.1(1 - 0.9^1) = 0.1 corrects this. The correction matters most in the first few iterations and becomes negligible as tt grows (since βt0\beta^t \to 0).

Proof Sketch

mt=(1β)i=1tβtigim_t = (1-\beta)\sum_{i=1}^t \beta^{t-i} g_i.

E[mt]=(1β)gi=1tβti=(1β)g1βt1β=(1βt)g\mathbb{E}[m_t] = (1-\beta) g \sum_{i=1}^t \beta^{t-i} = (1-\beta) g \cdot \frac{1-\beta^t}{1-\beta} = (1-\beta^t) g.

So E[mt/(1βt)]=g\mathbb{E}[m_t/(1-\beta^t)] = g. For the second moment, the same argument applies with g2g^2 replaced by E[gt2]\mathbb{E}[g_t^2].

Why It Matters

Without bias correction, the first few Adam steps take very small updates because mtm_t and vtv_t are close to zero. With β2=0.999\beta_2 = 0.999, the second moment estimate is severely biased for the first ~1000 steps. Bias correction prevents this slow start and is essential for Adam to work well in practice.

Failure Mode

Bias correction assumes a stationary gradient distribution. In the early phase of training when the loss landscape changes rapidly, the stationarity assumption is violated. This is one motivation for learning rate warmup.

AdamW: Decoupled Weight Decay

Theorem

AdamW Decouples Weight Decay from Gradient Adaptation

Statement

Adam + L2 regularization adds the L2 gradient to the gradient before moment estimation:

gt=L(θt1)+λθt1g_t = \nabla \mathcal{L}(\theta_{t-1}) + \lambda \theta_{t-1}

then runs standard Adam. The effective weight decay for parameter θj\theta_j is λη/v^j\lambda \cdot \eta / \sqrt{\hat{v}_j}, which varies per parameter.

AdamW applies weight decay directly to the parameters, after the adaptive step:

θt=(1ηλ)θt1ηm^tv^t+ϵ\theta_t = (1 - \eta\lambda)\theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

In AdamW, the weight decay λ\lambda is the same for all parameters, regardless of gradient magnitude.

Intuition

In SGD, L2 regularization and weight decay are equivalent. In Adam, they are not. When Adam divides the gradient by vt\sqrt{v_t}, it also divides the L2 gradient, weakening the regularization for parameters with large gradient history. AdamW avoids this by applying decay separately. This means all parameters decay at the same rate, as intended.

Proof Sketch

With Adam + L2: the effective update for parameter jj includes ηλθj/v^j\eta \cdot \lambda\theta_j / \sqrt{\hat{v}_j}. Parameters with large v^j\hat{v}_j (historically large gradients) receive weaker decay.

With AdamW: the decay term is ηλθj\eta\lambda\theta_j regardless of v^j\hat{v}_j. The gradient-based update and the decay are fully decoupled.

Why It Matters

Loshchilov and Hutter (2019) showed that AdamW consistently outperforms Adam+L2 across tasks. The key insight: regularization should not interact with the adaptive learning rate mechanism. AdamW is the default optimizer for training Transformer-based language models. For convolutional architectures, SGD with momentum remains competitive or superior (Wilson et al. 2017).

Failure Mode

The optimal λ\lambda for AdamW is different from the optimal λ\lambda for Adam+L2. You cannot simply swap one for the other without retuning. The typical AdamW weight decay for Transformers is 0.01-0.1, much larger than what you would use for L2 regularization in Adam.

Learning Rate Warmup

In practice, Adam is often combined with learning rate warmup: start with a very small learning rate and linearly increase it over the first TwT_w steps to the target value. Why?

  1. Second moment initialization: At step 1, v^1=g12\hat{v}_1 = g_1^2, a noisy single-sample estimate. A large learning rate with a noisy denominator produces wild parameter updates. Warmup gives vtv_t time to stabilize.

  2. Loss landscape curvature: Early in training, the loss landscape may have regions of very high curvature. Large steps in these regions can be catastrophic. Warmup allows the model to reach a better-conditioned region before taking large steps.

A common schedule is linear warmup for 1-10% of total training steps, followed by cosine decay.

When Adam Fails

Adam is not universally superior to SGD:

  1. Generalization gap (contested): Wilson et al. (2017) reported that SGD with momentum often generalizes better than Adam on image classification. One proposed mechanism is that Adam finds sharper minima while SGD's larger noise finds flatter ones (Keskar et al. 2017), but the flat-minima hypothesis itself has been challenged on reparameterization grounds (Dinh et al. 2017). The generalization gap is real; its causal mechanism is not settled.

  2. Non-convergence: Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when vtv_t shrinks. AMSGrad fixes this by taking v^t=max(v^t1,vt)\hat{v}_t = \max(\hat{v}_{t-1}, v_t) to ensure the learning rate never increases. In practice AMSGrad is rarely used; the modification has not consistently improved empirical performance and most codebases default to AdamW.

  3. Domain dependence: Adam dominates for NLP/Transformers. SGD dominates for CNNs on vision tasks. The optimal optimizer depends on the architecture and data distribution.

Canonical Examples

Example

Why bias correction matters early

With β2=0.999\beta_2 = 0.999 and m0=v0=0m_0 = v_0 = 0, at step t=1t=1:

  • Raw: v1=0.001g12v_1 = 0.001 \cdot g_1^2 (underestimates true second moment by 1000x)
  • Corrected: v^1=g12\hat{v}_1 = g_1^2 (correct order of magnitude)

Without correction, v^1\sqrt{\hat{v}_1} would be ~30x too small, making the effective learning rate ~30x too large. The first step would be catastrophically large.

Common Confusions

Watch Out

Adam and AdamW are NOT interchangeable

Adam with L2 regularization and AdamW with weight decay produce different parameter trajectories, even with the same λ\lambda value. In Adam+L2, the regularization strength varies per parameter (inversely with gradient history). In AdamW, it is uniform. Always use AdamW when you want weight decay with adaptive optimizers.

Watch Out

The epsilon parameter is not negligible

The default ϵ=108\epsilon = 10^{-8} prevents division by zero, but in half-precision training (ϵmach104\epsilon_{\text{mach}} \approx 10^{-4}), this default can cause numerical issues. For fp16 training, set ϵ=105\epsilon = 10^{-5} or larger.

Summary

  • Adam = momentum (first moment) + adaptive LR (second moment) + bias correction
  • First moment: mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t (smooths gradients)
  • Second moment: vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 (scales per-parameter)
  • Bias correction: divide by (1βt)(1-\beta^t) to correct for zero initialization
  • AdamW decouples weight decay from adaptive scaling. use AdamW, not Adam+L2
  • Adam finds sharper minima than SGD; SGD can generalize better for some tasks
  • Warmup stabilizes the second moment estimate in early training

Exercises

ExerciseCore

Problem

Derive the bias-corrected first moment estimate. Starting from m0=0m_0 = 0 and mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, show that E[mt]=(1β1t)E[g]\mathbb{E}[m_t] = (1-\beta_1^t)\mathbb{E}[g] when the gradients have constant expectation E[gt]=g\mathbb{E}[g_t] = g.

ExerciseAdvanced

Problem

Show that for SGD (no adaptive scaling), L2 regularization (gtgt+λθg_t \leftarrow g_t + \lambda\theta) is equivalent to weight decay (θ(1ηλ)θηgt\theta \leftarrow (1-\eta\lambda)\theta - \eta g_t). Then explain why this equivalence breaks for Adam.

Related Comparisons

References

Canonical:

  • Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015)
  • Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019). introduces AdamW

Current:

  • Reddi et al., "On the Convergence of Adam and Beyond" (2018). AMSGrad
  • Wilson et al., "The Marginal Value of Adaptive Gradient Methods in ML" (2017)

Next Topics

Adam connects to broader optimization topics:

  • Convex optimization basics: the theoretical foundation for understanding convergence
  • Learning rate schedules: warmup, cosine decay, and their interaction with Adam

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This