Paper breakdown

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba · 2014 · ICLR 2015

Combines exponentially decayed estimates of the first and second gradient moments to produce a per-coordinate adaptive step size. The default optimizer for almost every modern deep network from 2015 onward.

arXiv:1412.6980

Overview

Kingma and Ba (2014) proposed an optimizer that maintains an exponentially decayed estimate of the gradient (first moment) and an exponentially decayed estimate of the squared gradient (second moment), then applies a per-coordinate step proportional to the ratio. The paper combines two prior ideas — momentum and per-coordinate scaling à la RMSProp — and adds a bias-correction step that fixes the cold-start underestimate of those moving averages.

The recipe is small. The impact is not. By 2016 Adam was the default optimizer for sequence models, in 2018 the default for transformers, and in 2026 it remains the default fine-tune and pretraining optimizer for almost every public LLM and diffusion model. AdamW, the decoupled-weight-decay variant of Loshchilov and Hutter (2017), is what production training scripts actually run, but the moment-tracking core is unchanged.

Mathematical Contributions

The update rule

Let $g_t = \nabla_\theta f_t(\theta_{t-1})$ be the stochastic gradient at step $t$ . Adam maintains two running averages:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, \qquad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t \odot g_t$

with $m_0 = v_0 = 0$ . Because both averages are initialized at zero, they underestimate the true mean and second moment for small $t$ . The paper's bias correction divides by the missing mass:

$\hat m_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat v_t = \frac{v_t}{1 - \beta_2^t}$

The per-step update is then:

$\theta_t = \theta_{t-1} - \alpha\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}$

with default hyperparameters $\alpha = 10^{-3}$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ . The square root and division are coordinate-wise.

Why bias correction matters

Expanding the recursion gives $m_t = (1-\beta_1)\sum_{k=1}^t \beta_1^{t-k} g_k$ . If the gradient distribution is stationary with mean $\mu$ , then $\mathbb{E}[m_t] = \mu (1 - \beta_1^t)$ . The factor $(1-\beta_1^t)$ is the missing mass; dividing by it returns an unbiased estimate of $\mu$ . Without correction the first hundred steps move at less than the intended rate, which matters for short fine-tuning runs and learning-rate warm-ups.

Effective per-coordinate step size

The paper argues that the per-coordinate ratio $\hat m_t / \sqrt{\hat v_t}$ is approximately bounded:

$\left|\frac{\hat m_t}{\sqrt{\hat v_t}}\right| \le \frac{1 - \beta_1}{\sqrt{1 - \beta_2}}\quad\text{(approximately)}$

so the actual displacement per step is on the order of $\alpha$ regardless of the gradient's scale. This is the property that makes Adam relatively insensitive to gradient magnitude differences across parameters and across layers, which is why it works without per-layer learning-rate tuning on networks where SGD requires careful scheduling.

Convergence claim

For online convex optimization with bounded gradients $\|g_t\|_\infty \le G_\infty$ and a bounded feasible region, the paper proves a regret bound of order:

$R(T) = \sum_{t=1}^T f_t(\theta_t) - \min_{\theta^*} \sum_{t=1}^T f_t(\theta^*) = O(\sqrt{T})$

matching the optimal rate for online convex optimization. The proof has a known gap: Reddi, Kale, and Kumar (2018) construct a one-dimensional convex example where Adam fails to converge, showing the original argument is incomplete. They proposed AMSGrad, which uses $\max(\hat v_{t-1}, \hat v_t)$ to ensure a non-increasing effective step. In practice the convergence pathology is rarely observed in deep learning, but the theory side of the paper is technically incorrect as stated.

AdaMax and Nadam variants

Section 7.1 swaps the $L^2$ moment for an $L^\infty$ moment, $u_t = \max(\beta_2 u_{t-1}, |g_t|)$ , giving the AdaMax variant — useful when gradient norms are heavy-tailed. Section 7.2 anticipates Nadam (Dozat, 2016), which folds Nesterov-style lookahead into the first-moment correction.

Connections to TheoremPath Topics

Adam optimizer — the standard treatment with hyperparameter sensitivity and AdamW.
Preconditioned optimizers — Adam as a diagonal preconditioner $\text{diag}(\hat v_t^{-1/2})$ .
Gradient descent variants — momentum (Polyak, Nesterov) and how Adam's $m_t$ generalizes them.
Stochastic gradient descent convergence — convergence rates for SGD-class methods on smooth nonconvex losses.
Optimizer theory: SGD, Adam, Muon — comparative analysis including the modern shampoo/muon line.
Online convex optimization — the regret framework used in the paper's analysis.

Why It Matters Now

Three reasons Adam stayed dominant for a decade:

The per-coordinate scaling by $\sqrt{\hat v_t}$ adapts to gradient magnitude differences across layers, parameters, and steps, which is exactly the regime where vanilla SGD requires manual per-layer learning rates. For transformers the layer-wise gradient scale ratios span several orders of magnitude. Adam absorbs that without tuning.

The bias correction makes Adam usable from cold start, so it composes cleanly with learning-rate warm-up schedules (the standard transformer recipe). RMSProp on its own would underestimate $v_t$ for the first hundred steps and produce overlarge updates.

The hyperparameter defaults ( $\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}$ ) work across vision, language, and RL with no retuning. Few optimizers have defaults that survive that broad a domain shift.

The two known issues are addressed in production by minor variants. AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient-scaled update, fixing the long-known $L^2$ -regularization-meets-Adam interaction. AMSGrad (Reddi et al., 2018) fixes the convergence-proof gap, but is rarely needed empirically. Most 2026 large-model training uses AdamW with $\beta_2 \in [0.95, 0.999]$ and a cosine learning-rate schedule.

References

Canonical:

Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR. arXiv:1412.6980.

Direct precursors:

Duchi, J., Hazan, E., & Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." JMLR 12. AdaGrad: per-coordinate scaling by accumulated squared gradients.
Tieleman, T., & Hinton, G. (2012). "Lecture 6.5 — RMSProp." Coursera Neural Networks for Machine Learning. The exponentially decayed second moment that Adam inherits.
Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." USSR Computational Mathematics. Heavy-ball momentum.
Nesterov, Y. (1983). "A method for unconstrained convex minimization problem with the rate of convergence O(1/k²)." Soviet Mathematics. Accelerated gradient.

Critique and refinement:

Reddi, S. J., Kale, S., & Kumar, S. (2018). "On the Convergence of Adam and Beyond." ICLR. arXiv:1904.09237. Proves the original convergence argument is incomplete and proposes AMSGrad.
Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR. arXiv:1711.05101. AdamW: the variant production code actually runs.
Wilson, A. C. et al. (2017). "The Marginal Value of Adaptive Gradient Methods in Machine Learning." NeurIPS. arXiv:1705.08292. Argues SGD beats Adam on fully-supervised vision under careful tuning.

Follow-on work:

Dozat, T. (2016). "Incorporating Nesterov Momentum into Adam." ICLR Workshop. Nadam.
Liu, L. et al. (2020). "On the Variance of the Adaptive Learning Rate and Beyond." ICLR. arXiv:1908.03265. RAdam: variance-rectified Adam at warm-up.
Bernstein, J. et al. (2024). "Old Optimizer, New Norm: An Anthology." arXiv:2409.20325. Reframes Adam as steepest descent under a particular norm; motivates Muon.

Standard textbook:

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.5.
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). "Optimization Methods for Large-Scale Machine Learning." SIAM Review 60(2). Section 6.

Connected topics

Last reviewed: May 5, 2026