Adam Optimizer

Sneiderman, Robby

Training Techniques

Adam Optimizer

Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.

CoreTier 1StableCore spine~55 min

Prerequisites

Gradient Descent Variants Stochastic Gradient Descent Convergence

Start 8-question practice · 11 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

training-techniques | layer 2 | tier 1. This page has 2 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Learning Rate Scheduling

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Adam is a common default optimizer for training deep neural networks, especially Transformer-style models. It combines momentum with coordinate-wise adaptive learning rates, which helps on noisy, high-dimensional optimization problems where different parameters live on different scales. See the Adam paper breakdown for the original 2014 derivation, the bias-correction proof, and the AMSGrad/AdamW correction line.

Understanding Adam means understanding four things cold: bias correction, squared-gradient scaling, why AdamW is not the same as Adam plus L2 regularization, and why the optimizer that wins training loss may not win generalization. Adam absorbs many coordinate-scale mistakes, but it does not remove the need to reason about step size, weight decay, warmup, numerical precision, and validation curves.

theorem visual

Adam separates direction from scale

$The gradient feeds two memories: one estimates direction, the other estimates coordinate-wise magnitude. Bias correction makes the early steps match the intended scale.$

first moment

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

$Momentum memory. It smooths gradient direction.$

second moment

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}

$Scale memory. Large recent gradients shrink future coordinate-wise steps.$

Adam step

θ_{t} = θ_{t - 1} - η \overset{m}{^}_{t} / (\overset{v}{^}_{t} + ϵ)

$Bias correction prevents the early update ratio from being distorted by zero initialization.$

Mental Model

SGD with momentum keeps a running average of gradients to smooth out noise. RMSprop keeps a running average of squared gradients to adapt the learning rate per-parameter (parameters with large gradients get smaller steps). Adam combines both: it maintains a momentum vector (first moment) and an adaptive scaling vector based on squared gradients (second raw moment), with bias correction to handle initialization.

The Algorithm

Definition

Adam Update Rule

Given parameters $\theta$ , learning rate $\eta$ , decay rates $\beta_1, \beta_2$ , and small constant $\epsilon$ :

At step $t$ , with gradient $g_t = \nabla_\theta \mathcal{L}(\theta_{t-1})$ :

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment estimate)}$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(squared-gradient estimate)}$ $\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{(bias-corrected first moment)}$ $\hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias-corrected second moment)}$ $\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

Common starting hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ .

Components Explained

Definition

First Moment (Momentum) $m_{t}$

$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ is an exponential moving average of gradients. With $\beta_1 = 0.9$ , this averages roughly the last 10 gradients. It smooths out gradient noise and accumulates direction, like a ball rolling downhill with momentum.

Expanding: $m_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i$ .

Definition

Second Moment (Adaptive Scaling) $v_{t}$

$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ is an exponential moving average of squared gradients (elementwise). With $\beta_2 = 0.999$ , this averages roughly the last 1000 squared gradients. It estimates an uncentered second raw moment, not a centered variance: Adam tracks gradient magnitude, not $\mathbb{E}[(g_t-\mathbb{E}g_t)^2]$ .

Dividing by $\sqrt{v_t}$ gives each parameter its own effective learning rate: parameters with consistently large gradients get smaller steps, and parameters with small gradients get larger steps.

Bias Correction

Theorem

Bias Correction for Exponential Moving Averages

Statement

If $m_0 = 0$ and $g_t$ are drawn from a stationary distribution with mean $\mathbb{E}[g_t] = g$ , then the raw exponential moving average is biased:

$\mathbb{E}[m_t] = (1 - \beta^t) \cdot g$

The bias-corrected estimate $\hat{m}_t = m_t / (1 - \beta^t)$ satisfies $\mathbb{E}[\hat{m}_t] = g$ .

Intuition

When you initialize $m_0 = 0$ and start averaging, the early estimates are biased toward zero. After one step with $\beta = 0.9$ , you have $m_1 = 0.1 \cdot g_1$ , which underestimates the true gradient by a factor of 10. Dividing by $(1 - 0.9^1) = 0.1$ corrects this. The correction matters most in the first few iterations and becomes negligible as $t$ grows (since $\beta^t \to 0$ ).

Proof Sketch

$m_t = (1-\beta)\sum_{i=1}^t \beta^{t-i} g_i$ .

$\mathbb{E}[m_t] = (1-\beta) g \sum_{i=1}^t \beta^{t-i} = (1-\beta) g \cdot \frac{1-\beta^t}{1-\beta} = (1-\beta^t) g$ .

So $\mathbb{E}[m_t/(1-\beta^t)] = g$ . For the second moment, the same argument applies with $g^2$ replaced by $\mathbb{E}[g_t^2]$ .

Why It Matters

Without bias correction, the first few Adam steps are destabilized in magnitude, not shrunk. Both $m_t$ and $v_t$ are biased toward zero at initialization, but they are biased by different factors: at step $t$ , $m_t$ carries a factor of $(1 - \beta_1^t)$ and $v_t$ carries a factor of $(1 - \beta_2^t)$ . With the default $\beta_1 = 0.9$ and $\beta_2 = 0.999$ , at $t = 1$ the first-moment bias is $0.1$ while the second-moment bias is $0.001$ . The uncorrected update ratio $m_t / \sqrt{v_t}$ therefore scales like $0.1 / \sqrt{0.001} \approx 3.16$ times the corrected ratio, so the first step is roughly $3\times$ larger in magnitude than intended. With $\beta_2 = 0.999$ the second-moment bias decays slowly: even at $t = 1000$ the factor $(1 - \beta_2^{1000}) \approx 0.63$ is materially different from 1. Bias correction prevents this early overshoot, which is why Adam includes it. The Canonical Example below works through the factor-of- $\sim 3\times$ overshoot at $t = 1$ explicitly.

Failure Mode

Bias correction assumes a stationary gradient distribution. In the early phase of training when the loss landscape changes rapidly, the stationarity assumption is violated. This is one motivation for learning rate warmup.

report a correction →

AdamW: Decoupled Weight Decay

Theorem

AdamW Decouples Weight Decay from Gradient Adaptation

Statement

Adam + L2 regularization adds the L2 gradient to the gradient before moment estimation:

$g_t^{\mathrm{L2}} = \nabla \mathcal{L}(\theta_{t-1}) + \lambda \theta_{t-1}$

then runs standard Adam on $g_t^{\mathrm{L2}}$ . In a coordinate-wise view, the immediate decay contribution is scaled by the adaptive denominator, so parameter $j$ is shrunk roughly in proportion to $\eta \lambda / (\sqrt{\hat{v}_{t,j}}+\epsilon)$ . The regularization term also enters the future moment estimates.

AdamW applies weight decay directly to the parameters, after the adaptive step:

$\theta_t = (1 - \eta\lambda)\theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

In AdamW, the weight decay $\lambda$ is the same for all parameters, regardless of gradient magnitude.

Intuition

In SGD, L2 regularization and weight decay are equivalent. In Adam, they are not. When Adam divides the gradient by $\sqrt{v_t}$ , it also divides the L2 gradient, weakening the regularization for parameters with large gradient history. AdamW avoids this by applying decay separately. This makes the decay strength easier to tune because it is not tied to the adaptive gradient denominator.

Proof Sketch

With Adam + L2: the effective update for parameter $j$ includes $\eta \cdot \lambda\theta_j / (\sqrt{\hat{v}_{t,j}}+\epsilon)$ in the simplified coordinate view. Parameters with large $\hat{v}_{t,j}$ (historically large gradients) receive weaker immediate decay, and the L2 term also changes the moment statistics used in later steps.

With AdamW: the decay term is $\eta\lambda\theta_j$ regardless of $\hat{v}_j$ . The gradient-based update and the decay are fully decoupled.

Why It Matters

Loshchilov and Hutter (2019) showed in their tested settings that decoupled weight decay fixes a real mismatch between adaptive scaling and L2 regularization. The key insight is operational: regularization should not be silently rescaled by the optimizer's denominator. AdamW is the usual starting point for Transformer training. For some vision and CNN regimes, tuned SGD with momentum remains competitive and can generalize better (Wilson et al. 2017).

Failure Mode

The optimal $\lambda$ for AdamW is different from the optimal $\lambda$ for Adam+L2. You cannot simply swap one for the other without retuning. Common Transformer configurations often use weight decay around 0.01 to 0.1, but the right value depends on schedule, batch size, model size, and which parameters are excluded from decay.

report a correction →

Learning Rate Warmup

In practice, Adam is often combined with learning rate warmup: start with a very small learning rate and linearly increase it over the first $T_w$ steps to the target value. Why?

Second moment initialization: At step 1, $\hat{v}_1 = g_1^2$ , a noisy single-sample estimate. A large learning rate with a noisy denominator produces wild parameter updates. Warmup gives $v_t$ time to stabilize.
Loss landscape curvature: Early in training, the loss landscape may have regions of very high curvature. Large steps in these regions can be catastrophic. Warmup allows the model to reach a better-conditioned region before taking large steps.

A common schedule is linear warmup for 1-10% of total training steps, followed by cosine decay.

When Adam Fails

Adam is not universally superior to SGD:

Generalization gap (contested): Wilson et al. (2017) reported that SGD with momentum often generalizes better than Adam on image classification. One proposed mechanism is that Adam finds sharper minima while SGD's larger noise finds flatter ones (Keskar et al. 2017), but the flat-minima hypothesis itself has been challenged on reparameterization grounds (Dinh et al. 2017). The generalization gap is real; its causal mechanism is not settled.
Non-convergence: Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when $v_t$ shrinks. AMSGrad fixes this by taking $\hat{v}_t = \max(\hat{v}_{t-1}, v_t)$ to ensure the learning rate never increases. In practice AMSGrad is rarely used; the modification has not consistently improved empirical performance, and many deep-learning codebases default to AdamW.
Domain dependence: AdamW is the usual first optimizer for NLP and Transformers. SGD with momentum is often competitive or stronger for CNNs on vision tasks when the schedule and regularization are tuned. The optimizer choice depends on architecture, data distribution, batch size, and compute budget.

Practical Training Checklist

Symptom	Adam knob to inspect	What to check
Training loss spikes early	Warmup and base learning rate	The second moment estimate may be too noisy for the chosen step size
Validation lags training	Weight decay and schedule	Adam can fit quickly while still needing regularization and decay tuning
fp16 or bf16 instability	$\epsilon$ , loss scaling, fused kernels	The denominator and precision path can dominate small-gradient coordinates
Slow adaptation after a regime change	$\beta_2$	A large $\beta_2$ remembers old squared gradients for many steps
LayerNorm or bias parameters decay oddly	Parameter groups	Many Transformer recipes exclude bias and normalization parameters from weight decay

Canonical Examples

Example

Why bias correction matters early

With $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , and $m_0 = v_0 = 0$ , at step $t = 1$ :

Raw first moment: $m_1 = 0.1 \cdot g_1$ (biased by a factor of $0.1$ ).
Raw second moment: $v_1 = 0.001 \cdot g_1^2$ (biased by a factor of $0.001$ ).
Corrected: $\hat{m}_1 = g_1$ , $\hat{v}_1 = g_1^2$ .

Because the two biases cancel partially, the uncorrected update ratio is

$\frac{m_1}{\sqrt{v_1} + \epsilon} \;\approx\; \frac{0.1\, g_1}{0.0316\, |g_1|} \;\approx\; 3.16\, \operatorname{sign}(g_1),$

while the corrected ratio is $\hat{m}_1 / \sqrt{\hat{v}_1} = \operatorname{sign}(g_1)$ . So the raw first step overshoots by roughly $3\times$ , not the naive $\sim 30\times$ that comes from looking at $1/\sqrt{v_1}$ alone. The overshoot is still large enough to destabilize training at realistic learning rates, which is why bias correction is part of the algorithm.

Example

How adaptive scaling changes a step

Suppose two coordinates have the same debiased first moment, $\hat{m}_{t,1}=\hat{m}_{t,2}=0.01$ , but different squared-gradient histories: $\hat{v}_{t,1}=10^{-4}$ and $\hat{v}_{t,2}=10^{-2}$ . Ignoring $\epsilon$ , the update ratios are:

$\frac{0.01}{\sqrt{10^{-4}}}=1,\qquad \frac{0.01}{\sqrt{10^{-2}}}=0.1$

The second coordinate moves ten times less because its recent gradients have been larger. Adam is not just adding momentum; it is changing the geometry of the update by rescaling coordinates.

Common Confusions

Watch Out

Adam and AdamW are NOT interchangeable

Adam with L2 regularization and AdamW with weight decay produce different parameter trajectories, even with the same $\lambda$ value. In Adam+L2, the regularization strength varies per parameter (inversely with gradient history). In AdamW, it is uniform. Always use AdamW when you want weight decay with adaptive optimizers.

Watch Out

The second moment is not the variance

$v_t$ averages $g_t^2$ , not $(g_t-\bar g)^2$ . It is an uncentered squared-gradient scale used to normalize coordinates. Calling it a variance is common shorthand, but it hides the fact that a large mean gradient also makes $v_t$ large.

Watch Out

The epsilon parameter is not negligible

The default $\epsilon = 10^{-8}$ prevents division by zero, but in half-precision training this value may be below the useful numerical scale of the denominator path. IEEE fp16 has minimum normal value around $6 \times 10^{-5}$ and unit roundoff around $5 \times 10^{-4}$ . Many mixed-precision setups keep optimizer state in fp32 or use fused kernels, so the right fix is implementation-specific: inspect the precision path, then tune $\epsilon$ , loss scaling, or the optimizer implementation.

Watch Out

Fast training loss does not prove a better optimizer

Adam often reduces training loss quickly. That is not the same as better validation performance. Compare validation loss, calibration, robustness, and sensitivity to weight decay and schedule before concluding that Adam is better than SGD for a given model family.

Summary

Adam = momentum (first moment) + adaptive LR (second moment) + bias correction
First moment: $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ (smooths gradients)
Second moment: $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ (uncentered squared-gradient scale)
Bias correction: divide by $(1-\beta^t)$ to correct for zero initialization
AdamW decouples weight decay from adaptive scaling. Use AdamW, not Adam+L2
Adam can fit quickly while still generalizing worse than tuned SGD in some regimes
Warmup stabilizes the second moment estimate in early training

Exercises

ExerciseCore

Problem

Derive the bias-corrected first moment estimate. Starting from $m_0 = 0$ and $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ , show that $\mathbb{E}[m_t] = (1-\beta_1^t)\mathbb{E}[g]$ when the gradients have constant expectation $\mathbb{E}[g_t] = g$ .

ExerciseCore

Problem

Two coordinates have the same $\hat{m}_{t,j}$ but one has much larger $\hat{v}_{t,j}$ . Which coordinate gets the smaller Adam step? Explain the answer in terms of squared-gradient history.

ExerciseAdvanced

Problem

Show that for SGD (no adaptive scaling), L2 regularization ( $g_t \leftarrow g_t + \lambda\theta$ ) is equivalent to weight decay ( $\theta \leftarrow (1-\eta\lambda)\theta - \eta g_t$ ). Then explain why this equivalence breaks for Adam.

Related Comparisons

Adam vs. SGD

References

Canonical:

Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015).
Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (ICLR 2019). Introduces AdamW.
Goodfellow, Bengio, and Courville, Deep Learning, Chapter 8. Optimization for training deep models.

Current:

Reddi et al., "On the Convergence of Adam and Beyond" (ICLR 2018). AMSGrad and non-convergence examples.
Wilson et al., "The Marginal Value of Adaptive Gradient Methods in Machine Learning" (NeurIPS 2017).
Bottou, Curtis, and Nocedal, "Optimization Methods for Large-Scale Machine Learning" (SIAM Review 2018), Sections 4-6.
Keskar et al., "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" (ICLR 2017).
Dinh et al., "Sharp Minima Can Generalize for Deep Nets" (ICML 2017).

Next Topics

Adam connects to broader optimization topics:

Learning rate schedules: warmup, cosine decay, and their interaction with Adam.
Optimizer theory: SGD, Adam, Muon: why optimizer choice depends on geometry and architecture.
Mixed precision training: why denominator precision, loss scaling, and fused kernels matter.
Transformer architecture: where AdamW became the standard starting point.

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Gradient Descent Variantslayer 1 · tier 1
Stochastic Gradient Descent Convergencelayer 2 · tier 1

Derived topics

5

Learning Rate Schedulinglayer 2 · tier 1
Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
Batch Size and Learning Dynamicslayer 2 · tier 2
Mixed Precision Traininglayer 3 · tier 2
Transformer Architecturelayer 4 · tier 2

Graph-backed continuations

Learning Rate Scheduling Optimizer Theory: SGD, Adam, and Muon Mixed Precision Training Transformer Architecture Batch Size and Learning Dynamics