Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Weight Decay vs. L2 Regularization

Weight decay and L2 regularization are identical for SGD but diverge under adaptive optimizers. L2 adds the penalty gradient before adaptive scaling, so heavily updated parameters get less regularization. Weight decay subtracts directly from weights after the update, applying uniform regularization regardless of gradient history.

What Each Does

Both techniques shrink neural network weights during training. Given loss L(θ)\mathcal{L}(\theta) and parameters θ\theta, they appear similar but operate differently.

L2 regularization adds a penalty to the loss function:

Lreg(θ)=L(θ)+λ2θ22\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \frac{\lambda}{2} \|\theta\|_2^2

The gradient becomes L(θ)+λθ\nabla \mathcal{L}(\theta) + \lambda \theta. This modified gradient is what the optimizer processes.

Weight decay subtracts a fraction of each weight directly:

θt+1=θtηL(θt)ηλθt\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t) - \eta \lambda \theta_t

The key distinction: L2 modifies the gradient. Weight decay modifies the parameter update. Under vanilla SGD, these are algebraically identical. Under Adam, they are not.

Why They Diverge Under Adam

Adam scales the gradient by the inverse of the running RMS of past gradients. For parameter θj\theta_j at step tt:

θjθjηm^t,jv^t,j+ϵ\theta_j \leftarrow \theta_j - \eta \cdot \frac{\hat{m}_{t,j}}{\sqrt{\hat{v}_{t,j}} + \epsilon}

where m^t\hat{m}_t is the bias-corrected first moment and v^t\hat{v}_t is the bias-corrected second moment.

With L2 regularization, the penalty gradient λθj\lambda \theta_j is added to θjL\nabla_{\theta_j} \mathcal{L} before moment estimation. The regularization term gets divided by v^t,j+ϵ\sqrt{\hat{v}_{t,j}} + \epsilon. Parameters with large gradient variance (large v^t,j\hat{v}_{t,j}) receive weaker effective regularization. Parameters with small gradient variance receive stronger regularization. The regularization strength becomes dependent on the optimization landscape, which is not what you intended.

With weight decay (as in AdamW), the subtraction λθj\lambda \theta_j happens after the adaptive update, bypassing the moment scaling entirely:

θjθjη(m^t,jv^t,j+ϵ+λθj)\theta_j \leftarrow \theta_j - \eta \left(\frac{\hat{m}_{t,j}}{\sqrt{\hat{v}_{t,j}} + \epsilon} + \lambda \theta_j\right)

Every parameter is regularized proportionally to its magnitude, regardless of gradient history.

The AdamW Fix

Loshchilov and Hutter (2019) showed that the original Adam implementation in most frameworks used L2 regularization, not weight decay. They proposed AdamW, which decouples the weight decay from the adaptive gradient:

StepAdam + L2AdamW
Gradientgt=L(θt)+λθtg_t = \nabla \mathcal{L}(\theta_t) + \lambda \theta_tgt=L(θt)g_t = \nabla \mathcal{L}(\theta_t)
First momentmt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_tmt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
Second momentvt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
Updateθt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}θt+1=θtηm^tv^t+ϵηλθt\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta \lambda \theta_t

In Adam + L2, the regularization term λθt\lambda \theta_t contaminates both moment estimates. In AdamW, the moments track only the task gradient, and weight decay is applied as a clean subtraction.

Side-by-Side Comparison

PropertyL2 RegularizationWeight Decay
MechanismAdds λθ\lambda \theta to gradientSubtracts λθ\lambda \theta from weights
Equivalent to WD under SGDYesYes
Equivalent to WD under AdamNoN/A (is itself)
Affected by adaptive scalingYes, divided by vt\sqrt{v_t}No, applied uniformly
Effect on moment estimatesContaminates mtm_t and vtv_tMoments track only task loss
Regularization strengthVaries per parameterUniform across parameters
Hyperparameter couplingλ\lambda interacts with learning rate scheduleλ\lambda independent of LR schedule
Modern defaultDeprecated in most LLM trainingStandard (AdamW, LAMB)

When It Matters in Practice

The difference is negligible for SGD with momentum, because the gradient and parameter updates are linearly related.

For Adam, the difference is significant. Loshchilov and Hutter showed that AdamW matches SGD generalization on ImageNet while retaining Adam's fast convergence. The key practical consequences:

Hyperparameter transfer. With L2 regularization in Adam, the effective regularization changes when you change the learning rate, because both interact through the moment estimates. With decoupled weight decay, λ\lambda and η\eta are independent. You can tune them separately, and optimal λ\lambda transfers across learning rate schedules.

Scale-invariant regularization. Weight decay penalizes all parameters equally by magnitude. L2 in Adam penalizes parameters with small gradients more than those with large gradients. For transformer training where different parameter groups (embeddings, attention weights, FFN weights) have different gradient scales, uniform regularization is more predictable.

Large-scale training. GPT-3, PaLM, LLaMA, and virtually all modern LLMs use AdamW. The decoupled formulation is the standard for fine-tuning and pretraining.

Common Confusions

Watch Out

Weight decay is NOT just L2 regularization by another name

Many textbooks and tutorials treat these as synonyms. They are only equivalent for plain SGD without momentum scaling or adaptive learning rates. For Adam, AdaGrad, RMSProp, and any optimizer that scales gradients per-parameter, L2 regularization and weight decay produce different training dynamics and different final models.

Watch Out

The lambda values are not interchangeable

If you switch from Adam + L2 to AdamW, you cannot keep the same λ\lambda. The effective regularization strength differs because L2 regularization is attenuated by the adaptive scaling. A typical AdamW weight decay of 0.01 to 0.1 does not correspond to the same L2 λ\lambda.

Watch Out

Weight decay does not require a loss modification

L2 regularization modifies the loss function: you minimize L+λ2θ2\mathcal{L} + \frac{\lambda}{2}\|\theta\|^2. Weight decay is purely an optimizer-level operation. No term is added to the loss. This distinction matters for computing training loss curves: with L2, reported loss includes the penalty. With weight decay, it does not (unless you add it manually for logging).

Watch Out

Bias terms and LayerNorm parameters are typically excluded from weight decay

Weight decay is usually applied only to weight matrices, not to bias vectors or normalization parameters. These parameters operate at a different scale and do not benefit from magnitude penalization. Most frameworks support parameter group configuration to exclude them.

References

  1. Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (Original AdamW paper proving the L2/WD divergence under adaptive optimizers.)
  2. Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. (Original Adam, which used L2 regularization.)
  3. Hanson, S. J. and Pratt, L. Y. (1989). "Comparing biases for minimal network construction with back-propagation." NIPS 1989. (Early weight decay for neural networks.)
  4. Krogh, A. and Hertz, J. A. (1991). "A simple weight decay can improve generalization." NIPS 1991. (Theoretical analysis of weight decay as regularization in neural networks.)
  5. Zhang, M. et al. (2019). "Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model." NeurIPS 2019. (Analysis of optimizer hyperparameter interactions including weight decay.)
  6. Zhuang, J. et al. (2022). "Understanding AdamW through Proximal Methods and Scale-Freeness." Transactions on Machine Learning Research. (Formal proximal interpretation of decoupled weight decay.)
  7. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 7.1 (Parameter norm penalties and regularization).