Gradient Clipping vs Weight Decay: Stability vs Generalization

What Each Does

These two techniques are often confused because both involve limiting a quantity during training. They limit different things for different reasons.

Gradient clipping rescales the gradient when its norm exceeds a threshold $c$ . The two common variants are:

Gradient norm clipping: if $\|\nabla \mathcal{L}\|_2 > c$ , rescale to $\nabla \mathcal{L} \leftarrow c \cdot \frac{\nabla \mathcal{L}}{\|\nabla \mathcal{L}\|_2}$

This preserves gradient direction but caps the step size. The parameter update becomes:

$\theta_{t+1} = \theta_t - \eta \cdot \min\left(1, \frac{c}{\|\nabla \mathcal{L}(\theta_t)\|_2}\right) \nabla \mathcal{L}(\theta_t)$

Value clipping: clip each gradient component independently to $[-c, c]$ . This changes the gradient direction and is less commonly used.

Weight decay shrinks weights toward zero at every step:

$\theta_{t+1} = (1 - \lambda \eta) \theta_t - \eta \nabla \mathcal{L}(\theta_t)$

where $\lambda$ is the decay coefficient and $\eta$ is the learning rate. Equivalently, this minimizes $\mathcal{L}(\theta) + \frac{\lambda}{2}\|\theta\|_2^2$ .

Different Problems, Different Solutions

Gradient clipping addresses training stability. Large gradients cause the optimizer to take steps that overshoot, potentially entering regions of even larger gradients and diverging. This is the exploding gradient problem. Clipping ensures no single step is too large, keeping optimization in a stable region.

Weight decay addresses generalization. Large weights allow the model to memorize training data by creating sharp decision boundaries that do not generalize. Weight decay penalizes this by shrinking all weights toward zero, favoring smoother functions. It acts as a regularizer, not a stability mechanism.

The distinction is temporal: gradient clipping prevents catastrophic failures during a single step, while weight decay provides a persistent bias toward small-weight solutions across all of training.

Side-by-Side Comparison

Property	Gradient Clipping	Weight Decay
What it limits	Gradient magnitude	Weight magnitude
Primary goal	Training stability	Generalization
Problem it solves	Exploding gradients	Overfitting
When it activates	Only when $\\|\nabla \mathcal{L}\\| > c$	Every step
Effect on gradient direction	Preserved (norm clipping)	Not applicable (modifies weights)
Effect on weights	Indirect (limits step size)	Direct (shrinks toward zero)
Modifies loss function	No	Yes (adds $\frac{\lambda}{2}\\|\theta\\|_2^2$ )
Hyperparameter	Clip threshold $c$	Decay coefficient $\lambda$
Typical value	$c = 1.0$ for LLMs	$\lambda = 0.01$ to $0.1$
Without it	Training may diverge	Training may overfit
Theoretical grounding	Convergence guarantees for non-smooth objectives	Bayesian interpretation as Gaussian prior
Used in	All LLM training, RNNs, RL	All deep learning (via AdamW)

When Each Is Necessary

Gradient clipping is critical: recurrent networks and large transformers

RNNs suffer from exponential gradient growth through time steps. The gradient through $T$ time steps scales as the product of $T$ Jacobian matrices, and if the spectral radius exceeds 1, gradients explode exponentially. Clipping is not optional for training RNNs: without it, training diverges.

For large transformers, gradient spikes occur due to rare but high-loss examples, numerical instabilities in attention (large logits before softmax), or data irregularities. LLM training at scale (100B+ parameters) routinely encounters loss spikes that gradient clipping contains. The standard clip value for LLM training is $c = 1.0$ .

Weight decay is critical: overparameterized models

Modern neural networks have far more parameters than training examples. Without regularization, they can memorize the training set completely while failing to generalize. Weight decay provides a continuous pressure toward simpler (smaller-norm) solutions. For transformers trained with AdamW, weight decay is applied to all weight matrices but typically excluded from bias terms and normalization parameters.

Both together: the standard recipe

LLM training universally uses both gradient clipping ( $c = 1.0$ ) and weight decay ( $\lambda = 0.1$ is common). They are complementary: clipping prevents rare large gradients from destabilizing the optimization trajectory, while weight decay continuously shapes the solution toward better generalization. Removing either degrades training: without clipping, loss spikes can cause irrecoverable divergence; without weight decay, the model overfits.

Interaction Effects

Gradient clipping and weight decay interact subtly. Weight decay increases the gradient norm (the gradient of the penalty term $\lambda \theta$ adds to the gradient of the loss). With large $\lambda$ , the penalty gradient can dominate, making clipping activate more frequently. This interaction is another reason why decoupled weight decay (AdamW) is preferred: it separates the weight shrinkage from the gradient computation, avoiding this contamination.

When clipping is active (gradient norm exceeds $c$ ), the effective learning rate for that step is reduced to $\eta \cdot c / \|\nabla \mathcal{L}\|$ . If weight decay increases the gradient norm enough to trigger clipping frequently, the effective learning rate for the actual loss gradient is reduced. This is an unintended side effect that decoupled weight decay avoids.

Common Confusions

Watch Out

Gradient clipping is not regularization

Gradient clipping does not add an inductive bias toward simpler models. It is a stability mechanism. A model trained with gradient clipping but no other regularization can still memorize the training set. Clipping prevents divergence, not overfitting.

Watch Out

Gradient norm clipping preserves direction

When using norm clipping (the standard variant), the gradient direction is unchanged. Only the magnitude is reduced. This means the optimizer still moves toward the (local) steepest descent direction but takes a shorter step. Value clipping, by contrast, clips each component independently and can change the direction, but this variant is less common.

Watch Out

Weight decay is not applied to all parameters

Standard practice excludes bias terms and normalization parameters (LayerNorm/RMSNorm scales) from weight decay. These parameters have few degrees of freedom and benefit less from regularization. Applying weight decay to biases can harm performance. The AdamW implementation in most frameworks supports per-parameter-group decay settings.

Watch Out

The clip threshold requires tuning

The standard $c = 1.0$ works for most LLM training, but it is not universally optimal. Too small a clip threshold effectively reduces the learning rate, slowing training. Too large a threshold provides insufficient protection against spikes. The gradient norm distribution should be monitored during training: if clipping activates on more than 10-20% of steps, the threshold may be too aggressive or there may be a deeper optimization issue.

References

Pascanu, R., Mikolov, T., and Bengio, Y. (2013). "On the difficulty of training Recurrent Neural Networks." ICML 2013. (Gradient clipping for RNNs, exploding gradient analysis.)
Zhang, J. et al. (2020). "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity." ICLR 2020. (Convergence theory for clipped SGD under relaxed smoothness assumptions.)
Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (AdamW, interaction between weight decay and adaptive learning rates.)
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Section 10.11.1 (Gradient clipping for RNNs) and Section 7.1.1 (L2 parameter regularization).
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. (Chinchilla training recipe: gradient clipping at 1.0, weight decay at 0.1.)
Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.2 (Training hyperparameters: clip norm 1.0, weight decay 0.1.)