Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

AdamW vs. Adam

Adam applies L2 regularization inside the gradient, where the adaptive scaling distorts the penalty. AdamW decouples weight decay from the adaptive step, applying it directly to parameters. This distinction matters: every modern transformer uses AdamW, not Adam with L2.

The Core Problem

Both Adam and AdamW optimize the same objective, but they handle regularization differently. The difference is small in notation and large in practice.

In standard SGD, L2 regularization and weight decay are mathematically equivalent. In Adam, they are not. This is because Adam divides the gradient by v^t+ϵ\sqrt{\hat{v}_t} + \epsilon before applying the update. Any penalty added to the gradient gets divided by this same factor, which distorts the regularization strength per parameter.

Loshchilov and Hutter (2019) identified this problem and proposed AdamW: apply weight decay directly to the parameters, outside the adaptive scaling.

Side-by-Side Updates

Definition

Adam with L2 Regularization

The L2 penalty λθt\lambda \theta_t is added to the gradient before moment estimation:

gtreg=θL(θt)+λθtg_t^{\text{reg}} = \nabla_\theta \mathcal{L}(\theta_t) + \lambda \theta_t

This regularized gradient flows through the full Adam update. The second moment estimate v^t\hat{v}_t is computed from gtregg_t^{\text{reg}}, so the weight decay term is scaled by 1/(v^t+ϵ)1/(\sqrt{\hat{v}_t} + \epsilon).

The effective regularization for parameter θi\theta_i becomes:

λθi1v^t,i+ϵ\lambda \theta_i \cdot \frac{1}{\sqrt{\hat{v}_{t,i}} + \epsilon}

Parameters with large gradient variance (large v^t\hat{v}_t) receive weaker regularization. Parameters with small gradient variance receive stronger regularization. This is the opposite of what you want: heavily updated parameters should be regularized more, not less.

Definition

AdamW (Decoupled Weight Decay)

Weight decay is applied directly to the parameters, after the adaptive update:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 m^t=mt/(1β1t),v^t=vt/(1β2t)\hat{m}_t = m_t / (1 - \beta_1^t), \quad \hat{v}_t = v_t / (1 - \beta_2^t) θt+1=(1ηλ)θtηm^t/(v^t+ϵ)\theta_{t+1} = (1 - \eta \lambda) \theta_t - \eta \, \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)

The term (1ηλ)θt(1 - \eta\lambda)\theta_t shrinks every parameter by the same relative amount, regardless of gradient magnitude. The adaptive step m^t/(v^t+ϵ)\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) handles optimization. Weight decay handles regularization. The two do not interact.

Why the Difference Matters

Watch Out

L2 regularization and weight decay are NOT the same in Adam

In SGD, adding λθ\lambda\theta to the gradient produces the update θt+1=(1ηλ)θtηgt\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta g_t. The gradient term and the decay term are both scaled by η\eta. L2 regularization and weight decay are identical.

In Adam, the gradient is divided by v^t+ϵ\sqrt{\hat{v}_t} + \epsilon. If λθ\lambda\theta is part of the gradient, it gets divided too. The decay strength becomes parameter-dependent and inversely proportional to gradient magnitude. L2 regularization and weight decay produce different updates. Loshchilov and Hutter showed this mismatch degrades both optimization and generalization.

Consider two parameters: one with large, consistent gradients (large v^t\hat{v}_t) and one with rare, small gradients (small v^t\hat{v}_t). Under Adam + L2:

This is backwards. Frequently updated parameters are more likely to overfit and should be regularized more aggressively. AdamW fixes this by applying uniform decay.

Comparison Table

PropertyAdam + L2AdamW
Regularization mechanismλθ\lambda\theta added to gradientλθ\lambda\theta applied directly to parameters
Interaction with adaptive scalingYes: decay is divided by v^t\sqrt{\hat{v}_t}None: decay is independent
Effective decay per parameterVaries inversely with gradient magnitudeUniform across all parameters
Equivalence to SGD weight decayNoYes (same principle, adaptive step added)
Training transformersSuboptimalStandard choice
Hyperparameter couplingλ\lambda and η\eta interact through momentsλ\lambda and η\eta are separable
Learning rate warmup interactionWarmup affects regularization strengthWarmup affects optimization only

Hyperparameter Separability

A subtle but important consequence: in Adam + L2, the optimal regularization strength λ\lambda depends on the learning rate η\eta because both pass through the adaptive scaling. If you change η\eta, you must retune λ\lambda.

In AdamW, η\eta controls the optimization step and λ\lambda controls regularization independently. You can tune one without retuning the other. This makes hyperparameter search cheaper and more modular.

When Each Is Used

Example

Training GPT-style language models

All major LLM training runs (GPT-3, LLaMA, PaLM, Chinchilla) use AdamW with β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95, and weight decay λ=0.1\lambda = 0.1. Using Adam + L2 here would apply inconsistent regularization across embedding, attention, and MLP parameters, degrading both training stability and downstream performance.

Example

Fine-tuning BERT for classification

Standard practice is AdamW with η[2×105,5×105]\eta \in [2 \times 10^{-5}, 5 \times 10^{-5}] and λ=0.01\lambda = 0.01. The decoupled weight decay prevents the fine-tuning from distorting the pretrained representations unevenly across parameter groups.

Example

Training a small CNN with standard regularization

For small models where you are already using dropout and data augmentation, the difference between Adam + L2 and AdamW may be minor. Both work. But AdamW is strictly better in principle and costs nothing extra, so there is no reason to prefer Adam + L2.

Common Confusions

Watch Out

AdamW is not a different optimizer from Adam

AdamW computes the same adaptive moment estimates as Adam. The only change is where weight decay is applied. The optimization dynamics (step direction, effective learning rate per parameter) are identical for the gradient-based component. The regularization component is what differs.

Watch Out

Most frameworks default to AdamW, not Adam + L2

PyTorch's torch.optim.AdamW implements decoupled weight decay. If you use torch.optim.Adam with weight_decay > 0, you get L2 regularization inside the gradient, which is the wrong behavior. Check which class your code uses. This is a common source of subtle bugs.

Watch Out

The weight decay coefficient is not the same as the L2 coefficient

When converting between Adam + L2 and AdamW, λL2\lambda_{\text{L2}} and λAdamW\lambda_{\text{AdamW}} are not interchangeable. The effective regularization differs by a factor that depends on v^t\sqrt{\hat{v}_t}. Do not copy λ\lambda values between the two without adjustment.

References

  1. Loshchilov, I. and Hutter, F. "Decoupled Weight Decay Regularization." ICLR 2019. The original AdamW paper. Section 2 derives the L2/weight decay mismatch.
  2. Kingma, D.P. and Ba, J. "Adam: A Method for Stochastic Optimization." ICLR 2015. The original Adam paper. Section 2 defines the algorithm.
  3. Zhang, M., Lucas, J., Ba, J., and Hinton, G. "Lookahead Optimizer: k steps forward, 1 step back." NeurIPS 2019. Section 4 discusses weight decay interactions.
  4. Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." 2023. Section 2.2 specifies AdamW hyperparameters for LLM training.
  5. Zhuang, J. et al. "Understanding AdamW through Proximal Methods and Scale-Freeness." 2022. Proximal interpretation of decoupled weight decay.
  6. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. Chapter 8 covers optimization algorithms and regularization.
  7. Brown, T. et al. "Language Models are Few-Shot Learners." NeurIPS 2020. GPT-3 training details specify AdamW with β2=0.95\beta_2 = 0.95.