What Each Does
Both techniques shrink neural network weights during training. Given loss and parameters , they appear similar but operate differently.
L2 regularization adds a penalty to the loss function:
The gradient becomes . This modified gradient is what the optimizer processes.
Weight decay subtracts a fraction of each weight directly:
The key distinction: L2 modifies the gradient. Weight decay modifies the parameter update. Under vanilla SGD, these are algebraically identical. Under Adam, they are not.
Why They Diverge Under Adam
Adam scales the gradient by the inverse of the running RMS of past gradients. For parameter at step :
where is the bias-corrected first moment and is the bias-corrected second moment.
With L2 regularization, the penalty gradient is added to before moment estimation. The regularization term gets divided by . Parameters with large gradient variance (large ) receive weaker effective regularization. Parameters with small gradient variance receive stronger regularization. The regularization strength becomes dependent on the optimization landscape, which is not what you intended.
With weight decay (as in AdamW), the subtraction happens after the adaptive update, bypassing the moment scaling entirely:
Every parameter is regularized proportionally to its magnitude, regardless of gradient history.
The AdamW Fix
Loshchilov and Hutter (2019) showed that the original Adam implementation in most frameworks used L2 regularization, not weight decay. They proposed AdamW, which decouples the weight decay from the adaptive gradient:
| Step | Adam + L2 | AdamW |
|---|---|---|
| Gradient | ||
| First moment | ||
| Second moment | ||
| Update |
In Adam + L2, the regularization term contaminates both moment estimates. In AdamW, the moments track only the task gradient, and weight decay is applied as a clean subtraction.
Side-by-Side Comparison
| Property | L2 Regularization | Weight Decay |
|---|---|---|
| Mechanism | Adds to gradient | Subtracts from weights |
| Equivalent to WD under SGD | Yes | Yes |
| Equivalent to WD under Adam | No | N/A (is itself) |
| Affected by adaptive scaling | Yes, divided by | No, applied uniformly |
| Effect on moment estimates | Contaminates and | Moments track only task loss |
| Regularization strength | Varies per parameter | Uniform across parameters |
| Hyperparameter coupling | interacts with learning rate schedule | independent of LR schedule |
| Modern default | Deprecated in most LLM training | Standard (AdamW, LAMB) |
When It Matters in Practice
The difference is negligible for SGD with momentum, because the gradient and parameter updates are linearly related.
For Adam, the difference is significant. Loshchilov and Hutter showed that AdamW matches SGD generalization on ImageNet while retaining Adam's fast convergence. The key practical consequences:
Hyperparameter transfer. With L2 regularization in Adam, the effective regularization changes when you change the learning rate, because both interact through the moment estimates. With decoupled weight decay, and are independent. You can tune them separately, and optimal transfers across learning rate schedules.
Scale-invariant regularization. Weight decay penalizes all parameters equally by magnitude. L2 in Adam penalizes parameters with small gradients more than those with large gradients. For transformer training where different parameter groups (embeddings, attention weights, FFN weights) have different gradient scales, uniform regularization is more predictable.
Large-scale training. GPT-3, PaLM, LLaMA, and virtually all modern LLMs use AdamW. The decoupled formulation is the standard for fine-tuning and pretraining.
Common Confusions
Weight decay is NOT just L2 regularization by another name
Many textbooks and tutorials treat these as synonyms. They are only equivalent for plain SGD without momentum scaling or adaptive learning rates. For Adam, AdaGrad, RMSProp, and any optimizer that scales gradients per-parameter, L2 regularization and weight decay produce different training dynamics and different final models.
The lambda values are not interchangeable
If you switch from Adam + L2 to AdamW, you cannot keep the same . The effective regularization strength differs because L2 regularization is attenuated by the adaptive scaling. A typical AdamW weight decay of 0.01 to 0.1 does not correspond to the same L2 .
Weight decay does not require a loss modification
L2 regularization modifies the loss function: you minimize . Weight decay is purely an optimizer-level operation. No term is added to the loss. This distinction matters for computing training loss curves: with L2, reported loss includes the penalty. With weight decay, it does not (unless you add it manually for logging).
Bias terms and LayerNorm parameters are typically excluded from weight decay
Weight decay is usually applied only to weight matrices, not to bias vectors or normalization parameters. These parameters operate at a different scale and do not benefit from magnitude penalization. Most frameworks support parameter group configuration to exclude them.
References
- Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (Original AdamW paper proving the L2/WD divergence under adaptive optimizers.)
- Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. (Original Adam, which used L2 regularization.)
- Hanson, S. J. and Pratt, L. Y. (1989). "Comparing biases for minimal network construction with back-propagation." NIPS 1989. (Early weight decay for neural networks.)
- Krogh, A. and Hertz, J. A. (1991). "A simple weight decay can improve generalization." NIPS 1991. (Theoretical analysis of weight decay as regularization in neural networks.)
- Zhang, M. et al. (2019). "Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model." NeurIPS 2019. (Analysis of optimizer hyperparameter interactions including weight decay.)
- Zhuang, J. et al. (2022). "Understanding AdamW through Proximal Methods and Scale-Freeness." Transactions on Machine Learning Research. (Formal proximal interpretation of decoupled weight decay.)
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 7.1 (Parameter norm penalties and regularization).