Weight Decay vs L2 Regularization: Why They Differ in Adam and AdamW

What Each Does

Both techniques shrink neural network weights during training. Given loss $\mathcal{L}(\theta)$ and parameters $\theta$ , they appear similar but operate differently.

L2 regularization adds a penalty to the loss function:

$\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \frac{\lambda}{2} \|\theta\|_2^2$

The gradient becomes $\nabla \mathcal{L}(\theta) + \lambda \theta$ . This modified gradient is what the optimizer processes.

Weight decay subtracts a fraction of each weight directly:

$\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t) - \eta \lambda \theta_t$

The key distinction: L2 modifies the gradient. Weight decay modifies the parameter update. Under vanilla SGD, these are algebraically identical. Under Adam, they are not.

Why They Diverge Under Adam

Adam scales the gradient by the inverse of the running RMS of past gradients. For parameter $\theta_j$ at step $t$ :

$\theta_j \leftarrow \theta_j - \eta \cdot \frac{\hat{m}_{t,j}}{\sqrt{\hat{v}_{t,j}} + \epsilon}$

where $\hat{m}_t$ is the bias-corrected first moment and $\hat{v}_t$ is the bias-corrected second moment.

With L2 regularization, the penalty gradient $\lambda \theta_j$ is added to $\nabla_{\theta_j} \mathcal{L}$ before moment estimation. The regularization term gets divided by $\sqrt{\hat{v}_{t,j}} + \epsilon$ . Parameters with large gradient variance (large $\hat{v}_{t,j}$ ) receive weaker effective regularization. Parameters with small gradient variance receive stronger regularization. The regularization strength becomes dependent on the optimization landscape, which is not what you intended.

With weight decay (as in AdamW), the subtraction $\lambda \theta_j$ happens after the adaptive update, bypassing the moment scaling entirely:

$\theta_j \leftarrow \theta_j - \eta \left(\frac{\hat{m}_{t,j}}{\sqrt{\hat{v}_{t,j}} + \epsilon} + \lambda \theta_j\right)$

Every parameter is regularized proportionally to its magnitude, regardless of gradient history.

The AdamW Fix

Loshchilov and Hutter (2019) showed that the original Adam implementation in most frameworks used L2 regularization, not weight decay. They proposed AdamW, which decouples the weight decay from the adaptive gradient:

Step	Adam + L2	AdamW
Gradient	$g_t = \nabla \mathcal{L}(\theta_t) + \lambda \theta_t$	$g_t = \nabla \mathcal{L}(\theta_t)$
First moment	$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$	$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
Second moment	$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$	$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$
Update	$\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}$	$\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta \lambda \theta_t$

In Adam + L2, the regularization term $\lambda \theta_t$ contaminates both moment estimates. In AdamW, the moments track only the task gradient, and weight decay is applied as a clean subtraction.

Side-by-Side Comparison

Property	L2 Regularization	Weight Decay
Mechanism	Adds $\lambda \theta$ to gradient	Subtracts $\lambda \theta$ from weights
Equivalent to WD under SGD	Yes	Yes
Equivalent to WD under Adam	No	N/A (is itself)
Affected by adaptive scaling	Yes, divided by $\sqrt{v_t}$	No, applied uniformly
Effect on moment estimates	Contaminates $m_t$ and $v_t$	Moments track only task loss
Regularization strength	Varies per parameter	Uniform across parameters
Hyperparameter coupling	$\lambda$ interacts with learning rate schedule	$\lambda$ independent of LR schedule
Modern default	Deprecated in most LLM training	Standard (AdamW, LAMB)

When It Matters in Practice

The difference is negligible for SGD with momentum, because the gradient and parameter updates are linearly related.

For Adam, the difference is significant. Loshchilov and Hutter showed that AdamW matches SGD generalization on ImageNet while retaining Adam's fast convergence. The key practical consequences:

Hyperparameter transfer. With L2 regularization in Adam, the effective regularization changes when you change the learning rate, because both interact through the moment estimates. With decoupled weight decay, $\lambda$ and $\eta$ are independent. You can tune them separately, and optimal $\lambda$ transfers across learning rate schedules.

Scale-invariant regularization. Weight decay penalizes all parameters equally by magnitude. L2 in Adam penalizes parameters with small gradients more than those with large gradients. For transformer training where different parameter groups (embeddings, attention weights, FFN weights) have different gradient scales, uniform regularization is more predictable.

Large-scale training. GPT-3, PaLM, LLaMA, and virtually all modern LLMs use AdamW. The decoupled formulation is the standard for fine-tuning and pretraining.

Common Confusions

Watch Out

Weight decay is NOT just L2 regularization by another name

Many textbooks and tutorials treat these as synonyms. They are only equivalent for plain SGD without momentum scaling or adaptive learning rates. For Adam, AdaGrad, RMSProp, and any optimizer that scales gradients per-parameter, L2 regularization and weight decay produce different training dynamics and different final models.

Watch Out

The lambda values are not interchangeable

If you switch from Adam + L2 to AdamW, you cannot keep the same $\lambda$ . The effective regularization strength differs because L2 regularization is attenuated by the adaptive scaling. A typical AdamW weight decay of 0.01 to 0.1 does not correspond to the same L2 $\lambda$ .

Watch Out

Weight decay does not require a loss modification

L2 regularization modifies the loss function: you minimize $\mathcal{L} + \frac{\lambda}{2}\|\theta\|^2$ . Weight decay is purely an optimizer-level operation. No term is added to the loss. This distinction matters for computing training loss curves: with L2, reported loss includes the penalty. With weight decay, it does not (unless you add it manually for logging).

Watch Out

Bias terms and LayerNorm parameters are typically excluded from weight decay

Weight decay is usually applied only to weight matrices, not to bias vectors or normalization parameters. These parameters operate at a different scale and do not benefit from magnitude penalization. Most frameworks support parameter group configuration to exclude them.

References

Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (Original AdamW paper proving the L2/WD divergence under adaptive optimizers.)
Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. (Original Adam, which used L2 regularization.)
Hanson, S. J. and Pratt, L. Y. (1989). "Comparing biases for minimal network construction with back-propagation." NIPS 1989. (Early weight decay for neural networks.)
Krogh, A. and Hertz, J. A. (1991). "A simple weight decay can improve generalization." NIPS 1991. (Theoretical analysis of weight decay as regularization in neural networks.)
Zhang, M. et al. (2019). "Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model." NeurIPS 2019. (Analysis of optimizer hyperparameter interactions including weight decay.)
Zhuang, J. et al. (2022). "Understanding AdamW through Proximal Methods and Scale-Freeness." Transactions on Machine Learning Research. (Formal proximal interpretation of decoupled weight decay.)
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 7.1 (Parameter norm penalties and regularization).