What Each Does
These two techniques are often confused because both involve limiting a quantity during training. They limit different things for different reasons.
Gradient clipping rescales the gradient when its norm exceeds a threshold . The two common variants are:
Gradient norm clipping: if , rescale to
This preserves gradient direction but caps the step size. The parameter update becomes:
Value clipping: clip each gradient component independently to . This changes the gradient direction and is less commonly used.
Weight decay shrinks weights toward zero at every step:
where is the decay coefficient and is the learning rate. Equivalently, this minimizes .
Different Problems, Different Solutions
Gradient clipping addresses training stability. Large gradients cause the optimizer to take steps that overshoot, potentially entering regions of even larger gradients and diverging. This is the exploding gradient problem. Clipping ensures no single step is too large, keeping optimization in a stable region.
Weight decay addresses generalization. Large weights allow the model to memorize training data by creating sharp decision boundaries that do not generalize. Weight decay penalizes this by shrinking all weights toward zero, favoring smoother functions. It acts as a regularizer, not a stability mechanism.
The distinction is temporal: gradient clipping prevents catastrophic failures during a single step, while weight decay provides a persistent bias toward small-weight solutions across all of training.
Side-by-Side Comparison
| Property | Gradient Clipping | Weight Decay |
|---|---|---|
| What it limits | Gradient magnitude | Weight magnitude |
| Primary goal | Training stability | Generalization |
| Problem it solves | Exploding gradients | Overfitting |
| When it activates | Only when | Every step |
| Effect on gradient direction | Preserved (norm clipping) | Not applicable (modifies weights) |
| Effect on weights | Indirect (limits step size) | Direct (shrinks toward zero) |
| Modifies loss function | No | Yes (adds ) |
| Hyperparameter | Clip threshold | Decay coefficient |
| Typical value | for LLMs | to |
| Without it | Training may diverge | Training may overfit |
| Theoretical grounding | Convergence guarantees for non-smooth objectives | Bayesian interpretation as Gaussian prior |
| Used in | All LLM training, RNNs, RL | All deep learning (via AdamW) |
When Each Is Necessary
Gradient clipping is critical: recurrent networks and large transformers
RNNs suffer from exponential gradient growth through time steps. The gradient through time steps scales as the product of Jacobian matrices, and if the spectral radius exceeds 1, gradients explode exponentially. Clipping is not optional for training RNNs: without it, training diverges.
For large transformers, gradient spikes occur due to rare but high-loss examples, numerical instabilities in attention (large logits before softmax), or data irregularities. LLM training at scale (100B+ parameters) routinely encounters loss spikes that gradient clipping contains. The standard clip value for LLM training is .
Weight decay is critical: overparameterized models
Modern neural networks have far more parameters than training examples. Without regularization, they can memorize the training set completely while failing to generalize. Weight decay provides a continuous pressure toward simpler (smaller-norm) solutions. For transformers trained with AdamW, weight decay is applied to all weight matrices but typically excluded from bias terms and normalization parameters.
Both together: the standard recipe
LLM training universally uses both gradient clipping () and weight decay ( is common). They are complementary: clipping prevents rare large gradients from destabilizing the optimization trajectory, while weight decay continuously shapes the solution toward better generalization. Removing either degrades training: without clipping, loss spikes can cause irrecoverable divergence; without weight decay, the model overfits.
Interaction Effects
Gradient clipping and weight decay interact subtly. Weight decay increases the gradient norm (the gradient of the penalty term adds to the gradient of the loss). With large , the penalty gradient can dominate, making clipping activate more frequently. This interaction is another reason why decoupled weight decay (AdamW) is preferred: it separates the weight shrinkage from the gradient computation, avoiding this contamination.
When clipping is active (gradient norm exceeds ), the effective learning rate for that step is reduced to . If weight decay increases the gradient norm enough to trigger clipping frequently, the effective learning rate for the actual loss gradient is reduced. This is an unintended side effect that decoupled weight decay avoids.
Common Confusions
Gradient clipping is not regularization
Gradient clipping does not add an inductive bias toward simpler models. It is a stability mechanism. A model trained with gradient clipping but no other regularization can still memorize the training set. Clipping prevents divergence, not overfitting.
Gradient norm clipping preserves direction
When using norm clipping (the standard variant), the gradient direction is unchanged. Only the magnitude is reduced. This means the optimizer still moves toward the (local) steepest descent direction but takes a shorter step. Value clipping, by contrast, clips each component independently and can change the direction, but this variant is less common.
Weight decay is not applied to all parameters
Standard practice excludes bias terms and normalization parameters (LayerNorm/RMSNorm scales) from weight decay. These parameters have few degrees of freedom and benefit less from regularization. Applying weight decay to biases can harm performance. The AdamW implementation in most frameworks supports per-parameter-group decay settings.
The clip threshold requires tuning
The standard works for most LLM training, but it is not universally optimal. Too small a clip threshold effectively reduces the learning rate, slowing training. Too large a threshold provides insufficient protection against spikes. The gradient norm distribution should be monitored during training: if clipping activates on more than 10-20% of steps, the threshold may be too aggressive or there may be a deeper optimization issue.
References
- Pascanu, R., Mikolov, T., and Bengio, Y. (2013). "On the difficulty of training Recurrent Neural Networks." ICML 2013. (Gradient clipping for RNNs, exploding gradient analysis.)
- Zhang, J. et al. (2020). "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity." ICLR 2020. (Convergence theory for clipped SGD under relaxed smoothness assumptions.)
- Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (AdamW, interaction between weight decay and adaptive learning rates.)
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Section 10.11.1 (Gradient clipping for RNNs) and Section 7.1.1 (L2 parameter regularization).
- Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. (Chinchilla training recipe: gradient clipping at 1.0, weight decay at 0.1.)
- Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.2 (Training hyperparameters: clip norm 1.0, weight decay 0.1.)