The Core Problem
Both Adam and AdamW optimize the same objective, but they handle regularization differently. The difference is small in notation and large in practice.
In standard SGD, L2 regularization and weight decay are mathematically equivalent. In Adam, they are not. This is because Adam divides the gradient by before applying the update. Any penalty added to the gradient gets divided by this same factor, which distorts the regularization strength per parameter.
Loshchilov and Hutter (2019) identified this problem and proposed AdamW: apply weight decay directly to the parameters, outside the adaptive scaling.
Side-by-Side Updates
Adam with L2 Regularization
The L2 penalty is added to the gradient before moment estimation:
This regularized gradient flows through the full Adam update. The second moment estimate is computed from , so the weight decay term is scaled by .
The effective regularization for parameter becomes:
Parameters with large gradient variance (large ) receive weaker regularization. Parameters with small gradient variance receive stronger regularization. This is the opposite of what you want: heavily updated parameters should be regularized more, not less.
AdamW (Decoupled Weight Decay)
Weight decay is applied directly to the parameters, after the adaptive update:
The term shrinks every parameter by the same relative amount, regardless of gradient magnitude. The adaptive step handles optimization. Weight decay handles regularization. The two do not interact.
Why the Difference Matters
L2 regularization and weight decay are NOT the same in Adam
In SGD, adding to the gradient produces the update . The gradient term and the decay term are both scaled by . L2 regularization and weight decay are identical.
In Adam, the gradient is divided by . If is part of the gradient, it gets divided too. The decay strength becomes parameter-dependent and inversely proportional to gradient magnitude. L2 regularization and weight decay produce different updates. Loshchilov and Hutter showed this mismatch degrades both optimization and generalization.
Consider two parameters: one with large, consistent gradients (large ) and one with rare, small gradients (small ). Under Adam + L2:
- The frequently updated parameter gets weak regularization (divided by large ).
- The rarely updated parameter gets strong regularization (divided by small ).
This is backwards. Frequently updated parameters are more likely to overfit and should be regularized more aggressively. AdamW fixes this by applying uniform decay.
Comparison Table
| Property | Adam + L2 | AdamW |
|---|---|---|
| Regularization mechanism | added to gradient | applied directly to parameters |
| Interaction with adaptive scaling | Yes: decay is divided by | None: decay is independent |
| Effective decay per parameter | Varies inversely with gradient magnitude | Uniform across all parameters |
| Equivalence to SGD weight decay | No | Yes (same principle, adaptive step added) |
| Training transformers | Suboptimal | Standard choice |
| Hyperparameter coupling | and interact through moments | and are separable |
| Learning rate warmup interaction | Warmup affects regularization strength | Warmup affects optimization only |
Hyperparameter Separability
A subtle but important consequence: in Adam + L2, the optimal regularization strength depends on the learning rate because both pass through the adaptive scaling. If you change , you must retune .
In AdamW, controls the optimization step and controls regularization independently. You can tune one without retuning the other. This makes hyperparameter search cheaper and more modular.
When Each Is Used
Training GPT-style language models
All major LLM training runs (GPT-3, LLaMA, PaLM, Chinchilla) use AdamW with , , and weight decay . Using Adam + L2 here would apply inconsistent regularization across embedding, attention, and MLP parameters, degrading both training stability and downstream performance.
Fine-tuning BERT for classification
Standard practice is AdamW with and . The decoupled weight decay prevents the fine-tuning from distorting the pretrained representations unevenly across parameter groups.
Training a small CNN with standard regularization
For small models where you are already using dropout and data augmentation, the difference between Adam + L2 and AdamW may be minor. Both work. But AdamW is strictly better in principle and costs nothing extra, so there is no reason to prefer Adam + L2.
Common Confusions
AdamW is not a different optimizer from Adam
AdamW computes the same adaptive moment estimates as Adam. The only change is where weight decay is applied. The optimization dynamics (step direction, effective learning rate per parameter) are identical for the gradient-based component. The regularization component is what differs.
Most frameworks default to AdamW, not Adam + L2
PyTorch's torch.optim.AdamW implements decoupled weight decay. If you use torch.optim.Adam with weight_decay > 0, you get L2 regularization inside the gradient, which is the wrong behavior. Check which class your code uses. This is a common source of subtle bugs.
The weight decay coefficient is not the same as the L2 coefficient
When converting between Adam + L2 and AdamW, and are not interchangeable. The effective regularization differs by a factor that depends on . Do not copy values between the two without adjustment.
References
- Loshchilov, I. and Hutter, F. "Decoupled Weight Decay Regularization." ICLR 2019. The original AdamW paper. Section 2 derives the L2/weight decay mismatch.
- Kingma, D.P. and Ba, J. "Adam: A Method for Stochastic Optimization." ICLR 2015. The original Adam paper. Section 2 defines the algorithm.
- Zhang, M., Lucas, J., Ba, J., and Hinton, G. "Lookahead Optimizer: k steps forward, 1 step back." NeurIPS 2019. Section 4 discusses weight decay interactions.
- Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." 2023. Section 2.2 specifies AdamW hyperparameters for LLM training.
- Zhuang, J. et al. "Understanding AdamW through Proximal Methods and Scale-Freeness." 2022. Proximal interpretation of decoupled weight decay.
- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. Chapter 8 covers optimization algorithms and regularization.
- Brown, T. et al. "Language Models are Few-Shot Learners." NeurIPS 2020. GPT-3 training details specify AdamW with .