AdamW vs Adam: Decoupled Weight Decay vs L2 Regularization Explained

The Core Problem

Both Adam and AdamW optimize the same objective, but they handle regularization differently. The difference is small in notation and large in practice.

In standard SGD, L2 regularization and weight decay are mathematically equivalent. In Adam, they are not. This is because Adam divides the gradient by $\sqrt{\hat{v}_t} + \epsilon$ before applying the update. Any penalty added to the gradient gets divided by this same factor, which distorts the regularization strength per parameter.

Loshchilov and Hutter (2019) identified this problem and proposed AdamW: apply weight decay directly to the parameters, outside the adaptive scaling.

Side-by-Side Updates

Definition

Adam with L2 Regularization

The L2 penalty $\lambda \theta_t$ is added to the gradient before moment estimation:

$g_t^{\text{reg}} = \nabla_\theta \mathcal{L}(\theta_t) + \lambda \theta_t$

This regularized gradient flows through the full Adam update. The second moment estimate $\hat{v}_t$ is computed from $g_t^{\text{reg}}$ , so the weight decay term is scaled by $1/(\sqrt{\hat{v}_t} + \epsilon)$ .

The effective regularization for parameter $\theta_i$ becomes:

$\lambda \theta_i \cdot \frac{1}{\sqrt{\hat{v}_{t,i}} + \epsilon}$

Parameters with large gradient variance (large $\hat{v}_t$ ) receive weaker regularization. Parameters with small gradient variance receive stronger regularization. This is the opposite of what you want: heavily updated parameters should be regularized more, not less.

Definition

AdamW (Decoupled Weight Decay)

Weight decay is applied directly to the parameters, after the adaptive update:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$ $\hat{m}_t = m_t / (1 - \beta_1^t), \quad \hat{v}_t = v_t / (1 - \beta_2^t)$ $\theta_{t+1} = (1 - \eta \lambda) \theta_t - \eta \, \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$

The term $(1 - \eta\lambda)\theta_t$ shrinks every parameter by the same relative amount, regardless of gradient magnitude. The adaptive step $\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$ handles optimization. Weight decay handles regularization. The two do not interact.

Why the Difference Matters

Watch Out

L2 regularization and weight decay are NOT the same in Adam

In SGD, adding $\lambda\theta$ to the gradient produces the update $\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta g_t$ . The gradient term and the decay term are both scaled by $\eta$ . L2 regularization and weight decay are identical.

In Adam, the gradient is divided by $\sqrt{\hat{v}_t} + \epsilon$ . If $\lambda\theta$ is part of the gradient, it gets divided too. The decay strength becomes parameter-dependent and inversely proportional to gradient magnitude. L2 regularization and weight decay produce different updates. Loshchilov and Hutter showed this mismatch degrades both optimization and generalization.

Consider two parameters: one with large, consistent gradients (large $\hat{v}_t$ ) and one with rare, small gradients (small $\hat{v}_t$ ). Under Adam + L2:

The frequently updated parameter gets weak regularization (divided by large $\sqrt{\hat{v}_t}$ ).
The rarely updated parameter gets strong regularization (divided by small $\sqrt{\hat{v}_t}$ ).

This is backwards. Frequently updated parameters are more likely to overfit and should be regularized more aggressively. AdamW fixes this by applying uniform decay.

Comparison Table

Property	Adam + L2	AdamW
Regularization mechanism	$\lambda\theta$ added to gradient	$\lambda\theta$ applied directly to parameters
Interaction with adaptive scaling	Yes: decay is divided by $\sqrt{\hat{v}_t}$	None: decay is independent
Effective decay per parameter	Varies inversely with gradient magnitude	Uniform across all parameters
Equivalence to SGD weight decay	No	Yes (same principle, adaptive step added)
Training transformers	Suboptimal	Standard choice
Hyperparameter coupling	$\lambda$ and $\eta$ interact through moments	$\lambda$ and $\eta$ are separable
Learning rate warmup interaction	Warmup affects regularization strength	Warmup affects optimization only

Hyperparameter Separability

A subtle but important consequence: in Adam + L2, the optimal regularization strength $\lambda$ depends on the learning rate $\eta$ because both pass through the adaptive scaling. If you change $\eta$ , you must retune $\lambda$ .

In AdamW, $\eta$ controls the optimization step and $\lambda$ controls regularization independently. You can tune one without retuning the other. This makes hyperparameter search cheaper and more modular.

When Each Is Used

Example

Training GPT-style language models

All major LLM training runs (GPT-3, LLaMA, PaLM, Chinchilla) use AdamW with $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , and weight decay $\lambda = 0.1$ . Using Adam + L2 here would apply inconsistent regularization across embedding, attention, and MLP parameters, degrading both training stability and downstream performance.

Example

Fine-tuning BERT for classification

Standard practice is AdamW with $\eta \in [2 \times 10^{-5}, 5 \times 10^{-5}]$ and $\lambda = 0.01$ . The decoupled weight decay prevents the fine-tuning from distorting the pretrained representations unevenly across parameter groups.

Example

Training a small CNN with standard regularization

For small models where you are already using dropout and data augmentation, the difference between Adam + L2 and AdamW may be minor. Both work. But AdamW is strictly better in principle and costs nothing extra, so there is no reason to prefer Adam + L2.

Common Confusions

Watch Out

AdamW is not a different optimizer from Adam

AdamW computes the same adaptive moment estimates as Adam. The only change is where weight decay is applied. The optimization dynamics (step direction, effective learning rate per parameter) are identical for the gradient-based component. The regularization component is what differs.

Watch Out

Most frameworks default to AdamW, not Adam + L2

PyTorch's torch.optim.AdamW implements decoupled weight decay. If you use torch.optim.Adam with weight_decay > 0, you get L2 regularization inside the gradient, which is the wrong behavior. Check which class your code uses. This is a common source of subtle bugs.

Watch Out

The weight decay coefficient is not the same as the L2 coefficient

When converting between Adam + L2 and AdamW, $\lambda_{\text{L2}}$ and $\lambda_{\text{AdamW}}$ are not interchangeable. The effective regularization differs by a factor that depends on $\sqrt{\hat{v}_t}$ . Do not copy $\lambda$ values between the two without adjustment.

References

Loshchilov, I. and Hutter, F. "Decoupled Weight Decay Regularization." ICLR 2019. The original AdamW paper. Section 2 derives the L2/weight decay mismatch.
Kingma, D.P. and Ba, J. "Adam: A Method for Stochastic Optimization." ICLR 2015. The original Adam paper. Section 2 defines the algorithm.
Zhang, M., Lucas, J., Ba, J., and Hinton, G. "Lookahead Optimizer: k steps forward, 1 step back." NeurIPS 2019. Section 4 discusses weight decay interactions.
Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." 2023. Section 2.2 specifies AdamW hyperparameters for LLM training.
Zhuang, J. et al. "Understanding AdamW through Proximal Methods and Scale-Freeness." 2022. Proximal interpretation of decoupled weight decay.
Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. Chapter 8 covers optimization algorithms and regularization.
Brown, T. et al. "Language Models are Few-Shot Learners." NeurIPS 2020. GPT-3 training details specify AdamW with $\beta_2 = 0.95$ .