Paper breakdown
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba · 2014 · ICLR 2015
Combines exponentially decayed estimates of the first and second gradient moments to produce a per-coordinate adaptive step size. The default optimizer for almost every modern deep network from 2015 onward.
Overview
Kingma and Ba (2014) proposed an optimizer that maintains an exponentially decayed estimate of the gradient (first moment) and an exponentially decayed estimate of the squared gradient (second moment), then applies a per-coordinate step proportional to the ratio. The paper combines two prior ideas — momentum and per-coordinate scaling à la RMSProp — and adds a bias-correction step that fixes the cold-start underestimate of those moving averages.
The recipe is small. The impact is not. By 2016 Adam was the default optimizer for sequence models, in 2018 the default for transformers, and in 2026 it remains the default fine-tune and pretraining optimizer for almost every public LLM and diffusion model. AdamW, the decoupled-weight-decay variant of Loshchilov and Hutter (2017), is what production training scripts actually run, but the moment-tracking core is unchanged.
Mathematical Contributions
The update rule
Let be the stochastic gradient at step . Adam maintains two running averages:
with . Because both averages are initialized at zero, they underestimate the true mean and second moment for small . The paper's bias correction divides by the missing mass:
The per-step update is then:
with default hyperparameters , , , . The square root and division are coordinate-wise.
Why bias correction matters
Expanding the recursion gives . If the gradient distribution is stationary with mean , then . The factor is the missing mass; dividing by it returns an unbiased estimate of . Without correction the first hundred steps move at less than the intended rate, which matters for short fine-tuning runs and learning-rate warm-ups.
Effective per-coordinate step size
The paper argues that the per-coordinate ratio is approximately bounded:
so the actual displacement per step is on the order of regardless of the gradient's scale. This is the property that makes Adam relatively insensitive to gradient magnitude differences across parameters and across layers, which is why it works without per-layer learning-rate tuning on networks where SGD requires careful scheduling.
Convergence claim
For online convex optimization with bounded gradients and a bounded feasible region, the paper proves a regret bound of order:
matching the optimal rate for online convex optimization. The proof has a known gap: Reddi, Kale, and Kumar (2018) construct a one-dimensional convex example where Adam fails to converge, showing the original argument is incomplete. They proposed AMSGrad, which uses to ensure a non-increasing effective step. In practice the convergence pathology is rarely observed in deep learning, but the theory side of the paper is technically incorrect as stated.
AdaMax and Nadam variants
Section 7.1 swaps the moment for an moment, , giving the AdaMax variant — useful when gradient norms are heavy-tailed. Section 7.2 anticipates Nadam (Dozat, 2016), which folds Nesterov-style lookahead into the first-moment correction.
Connections to TheoremPath Topics
- Adam optimizer — the standard treatment with hyperparameter sensitivity and AdamW.
- Preconditioned optimizers — Adam as a diagonal preconditioner .
- Gradient descent variants — momentum (Polyak, Nesterov) and how Adam's generalizes them.
- Stochastic gradient descent convergence — convergence rates for SGD-class methods on smooth nonconvex losses.
- Optimizer theory: SGD, Adam, Muon — comparative analysis including the modern shampoo/muon line.
- Online convex optimization — the regret framework used in the paper's analysis.
Why It Matters Now
Three reasons Adam stayed dominant for a decade:
The per-coordinate scaling by adapts to gradient magnitude differences across layers, parameters, and steps, which is exactly the regime where vanilla SGD requires manual per-layer learning rates. For transformers the layer-wise gradient scale ratios span several orders of magnitude. Adam absorbs that without tuning.
The bias correction makes Adam usable from cold start, so it composes cleanly with learning-rate warm-up schedules (the standard transformer recipe). RMSProp on its own would underestimate for the first hundred steps and produce overlarge updates.
The hyperparameter defaults () work across vision, language, and RL with no retuning. Few optimizers have defaults that survive that broad a domain shift.
The two known issues are addressed in production by minor variants. AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient-scaled update, fixing the long-known -regularization-meets-Adam interaction. AMSGrad (Reddi et al., 2018) fixes the convergence-proof gap, but is rarely needed empirically. Most 2026 large-model training uses AdamW with and a cosine learning-rate schedule.
References
Canonical:
- Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR. arXiv:1412.6980.
Direct precursors:
- Duchi, J., Hazan, E., & Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." JMLR 12. AdaGrad: per-coordinate scaling by accumulated squared gradients.
- Tieleman, T., & Hinton, G. (2012). "Lecture 6.5 — RMSProp." Coursera Neural Networks for Machine Learning. The exponentially decayed second moment that Adam inherits.
- Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." USSR Computational Mathematics. Heavy-ball momentum.
- Nesterov, Y. (1983). "A method for unconstrained convex minimization problem with the rate of convergence O(1/k²)." Soviet Mathematics. Accelerated gradient.
Critique and refinement:
- Reddi, S. J., Kale, S., & Kumar, S. (2018). "On the Convergence of Adam and Beyond." ICLR. arXiv:1904.09237. Proves the original convergence argument is incomplete and proposes AMSGrad.
- Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR. arXiv:1711.05101. AdamW: the variant production code actually runs.
- Wilson, A. C. et al. (2017). "The Marginal Value of Adaptive Gradient Methods in Machine Learning." NeurIPS. arXiv:1705.08292. Argues SGD beats Adam on fully-supervised vision under careful tuning.
Follow-on work:
- Dozat, T. (2016). "Incorporating Nesterov Momentum into Adam." ICLR Workshop. Nadam.
- Liu, L. et al. (2020). "On the Variance of the Adaptive Learning Rate and Beyond." ICLR. arXiv:1908.03265. RAdam: variance-rectified Adam at warm-up.
- Bernstein, J. et al. (2024). "Old Optimizer, New Norm: An Anthology." arXiv:2409.20325. Reframes Adam as steepest descent under a particular norm; motivates Muon.
Standard textbook:
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.5.
- Bottou, L., Curtis, F. E., & Nocedal, J. (2018). "Optimization Methods for Large-Scale Machine Learning." SIAM Review 60(2). Section 6.
Connected topics
Last reviewed: May 5, 2026