Skip to main content

Paper breakdown

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba · 2014 · ICLR 2015

Combines exponentially decayed estimates of the first and second gradient moments to produce a per-coordinate adaptive step size. The default optimizer for almost every modern deep network from 2015 onward.

Overview

Kingma and Ba (2014) proposed an optimizer that maintains an exponentially decayed estimate of the gradient (first moment) and an exponentially decayed estimate of the squared gradient (second moment), then applies a per-coordinate step proportional to the ratio. The paper combines two prior ideas — momentum and per-coordinate scaling à la RMSProp — and adds a bias-correction step that fixes the cold-start underestimate of those moving averages.

The recipe is small. The impact is not. By 2016 Adam was the default optimizer for sequence models, in 2018 the default for transformers, and in 2026 it remains the default fine-tune and pretraining optimizer for almost every public LLM and diffusion model. AdamW, the decoupled-weight-decay variant of Loshchilov and Hutter (2017), is what production training scripts actually run, but the moment-tracking core is unchanged.

Mathematical Contributions

The update rule

Let gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1}) be the stochastic gradient at step tt. Adam maintains two running averages:

mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gtgtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, \qquad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t \odot g_t

with m0=v0=0m_0 = v_0 = 0. Because both averages are initialized at zero, they underestimate the true mean and second moment for small tt. The paper's bias correction divides by the missing mass:

m^t=mt1β1t,v^t=vt1β2t\hat m_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat v_t = \frac{v_t}{1 - \beta_2^t}

The per-step update is then:

θt=θt1αm^tv^t+ϵ\theta_t = \theta_{t-1} - \alpha\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}

with default hyperparameters α=103\alpha = 10^{-3}, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}. The square root and division are coordinate-wise.

Why bias correction matters

Expanding the recursion gives mt=(1β1)k=1tβ1tkgkm_t = (1-\beta_1)\sum_{k=1}^t \beta_1^{t-k} g_k. If the gradient distribution is stationary with mean μ\mu, then E[mt]=μ(1β1t)\mathbb{E}[m_t] = \mu (1 - \beta_1^t). The factor (1β1t)(1-\beta_1^t) is the missing mass; dividing by it returns an unbiased estimate of μ\mu. Without correction the first hundred steps move at less than the intended rate, which matters for short fine-tuning runs and learning-rate warm-ups.

Effective per-coordinate step size

The paper argues that the per-coordinate ratio m^t/v^t\hat m_t / \sqrt{\hat v_t} is approximately bounded:

m^tv^t1β11β2(approximately)\left|\frac{\hat m_t}{\sqrt{\hat v_t}}\right| \le \frac{1 - \beta_1}{\sqrt{1 - \beta_2}}\quad\text{(approximately)}

so the actual displacement per step is on the order of α\alpha regardless of the gradient's scale. This is the property that makes Adam relatively insensitive to gradient magnitude differences across parameters and across layers, which is why it works without per-layer learning-rate tuning on networks where SGD requires careful scheduling.

Convergence claim

For online convex optimization with bounded gradients gtG\|g_t\|_\infty \le G_\infty and a bounded feasible region, the paper proves a regret bound of order:

R(T)=t=1Tft(θt)minθt=1Tft(θ)=O(T)R(T) = \sum_{t=1}^T f_t(\theta_t) - \min_{\theta^*} \sum_{t=1}^T f_t(\theta^*) = O(\sqrt{T})

matching the optimal rate for online convex optimization. The proof has a known gap: Reddi, Kale, and Kumar (2018) construct a one-dimensional convex example where Adam fails to converge, showing the original argument is incomplete. They proposed AMSGrad, which uses max(v^t1,v^t)\max(\hat v_{t-1}, \hat v_t) to ensure a non-increasing effective step. In practice the convergence pathology is rarely observed in deep learning, but the theory side of the paper is technically incorrect as stated.

AdaMax and Nadam variants

Section 7.1 swaps the L2L^2 moment for an LL^\infty moment, ut=max(β2ut1,gt)u_t = \max(\beta_2 u_{t-1}, |g_t|), giving the AdaMax variant — useful when gradient norms are heavy-tailed. Section 7.2 anticipates Nadam (Dozat, 2016), which folds Nesterov-style lookahead into the first-moment correction.

Connections to TheoremPath Topics

Why It Matters Now

Three reasons Adam stayed dominant for a decade:

The per-coordinate scaling by v^t\sqrt{\hat v_t} adapts to gradient magnitude differences across layers, parameters, and steps, which is exactly the regime where vanilla SGD requires manual per-layer learning rates. For transformers the layer-wise gradient scale ratios span several orders of magnitude. Adam absorbs that without tuning.

The bias correction makes Adam usable from cold start, so it composes cleanly with learning-rate warm-up schedules (the standard transformer recipe). RMSProp on its own would underestimate vtv_t for the first hundred steps and produce overlarge updates.

The hyperparameter defaults (β1=0.9,β2=0.999,ϵ=108\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}) work across vision, language, and RL with no retuning. Few optimizers have defaults that survive that broad a domain shift.

The two known issues are addressed in production by minor variants. AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient-scaled update, fixing the long-known L2L^2-regularization-meets-Adam interaction. AMSGrad (Reddi et al., 2018) fixes the convergence-proof gap, but is rarely needed empirically. Most 2026 large-model training uses AdamW with β2[0.95,0.999]\beta_2 \in [0.95, 0.999] and a cosine learning-rate schedule.

References

Canonical:

  • Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR. arXiv:1412.6980.

Direct precursors:

  • Duchi, J., Hazan, E., & Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." JMLR 12. AdaGrad: per-coordinate scaling by accumulated squared gradients.
  • Tieleman, T., & Hinton, G. (2012). "Lecture 6.5 — RMSProp." Coursera Neural Networks for Machine Learning. The exponentially decayed second moment that Adam inherits.
  • Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." USSR Computational Mathematics. Heavy-ball momentum.
  • Nesterov, Y. (1983). "A method for unconstrained convex minimization problem with the rate of convergence O(1/k²)." Soviet Mathematics. Accelerated gradient.

Critique and refinement:

  • Reddi, S. J., Kale, S., & Kumar, S. (2018). "On the Convergence of Adam and Beyond." ICLR. arXiv:1904.09237. Proves the original convergence argument is incomplete and proposes AMSGrad.
  • Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR. arXiv:1711.05101. AdamW: the variant production code actually runs.
  • Wilson, A. C. et al. (2017). "The Marginal Value of Adaptive Gradient Methods in Machine Learning." NeurIPS. arXiv:1705.08292. Argues SGD beats Adam on fully-supervised vision under careful tuning.

Follow-on work:

  • Dozat, T. (2016). "Incorporating Nesterov Momentum into Adam." ICLR Workshop. Nadam.
  • Liu, L. et al. (2020). "On the Variance of the Adaptive Learning Rate and Beyond." ICLR. arXiv:1908.03265. RAdam: variance-rectified Adam at warm-up.
  • Bernstein, J. et al. (2024). "Old Optimizer, New Norm: An Anthology." arXiv:2409.20325. Reframes Adam as steepest descent under a particular norm; motivates Muon.

Standard textbook:

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.5.
  • Bottou, L., Curtis, F. E., & Nocedal, J. (2018). "Optimization Methods for Large-Scale Machine Learning." SIAM Review 60(2). Section 6.

Connected topics

Last reviewed: May 5, 2026