Training Techniques
Label Smoothing and Regularization
Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.
Prerequisites
Why This Matters
A classifier trained with standard cross-entropy loss on one-hot labels is incentivized to push logits toward infinity: the loss decreases as the predicted probability of the correct class approaches 1. This produces overconfident predictions where the model assigns near-zero probability to all incorrect classes, even when the true label is ambiguous.
Label smoothing is a simple fix: replace the one-hot target with a soft target that reserves some probability mass for incorrect classes. This single change improves calibration, reduces overfitting (complementing other techniques like dropout and batch normalization), and often improves test accuracy.
Formal Setup
Hard Target
For a -class classification problem, the standard one-hot target for class is:
Label-Smoothed Target
For smoothing parameter , the label-smoothed target is:
For the correct class : . For incorrect classes: .
The cross-entropy loss with label-smoothed targets is:
Main Theorem
Label Smoothing as KL Regularization
Statement
The label-smoothed cross-entropy loss decomposes as:
where is the standard cross-entropy with the hard label, is the cross-entropy with the uniform distribution , and is the model's softmax output.
Equivalently:
The term (where is the KL divergence) penalizes the model for being far from the uniform distribution, encouraging higher entropy in predictions.
Intuition
Label smoothing adds a penalty for overconfidence. The term is minimized when is uniform (maximum uncertainty). The standard loss term pushes toward the correct class. The balance between these two forces prevents the model from pushing logits to extreme values.
Proof Sketch
Expand . Substitute into using the definition of KL divergence.
Why It Matters
This decomposition reveals that label smoothing is equivalent to standard training plus a maximum entropy regularizer. The strength of regularization is controlled by . Typical values are . The original Inception v2 paper (Szegedy et al., 2016) used and the original Transformer paper (Vaswani et al., 2017) used .
Failure Mode
Label smoothing assumes all incorrect classes are equally plausible (the is uniform across wrong classes). When there is strong class hierarchy (e.g., misclassifying a dog as a cat is more reasonable than misclassifying a dog as a truck), uniform smoothing wastes probability mass on implausible classes. Non-uniform smoothing based on class similarity can help but requires additional information.
Effects on Calibration
A model is well-calibrated if its predicted probabilities match empirical frequencies: when it says "80% probability of class A," class A occurs about 80% of the time.
Standard training with hard labels produces overconfident models: predicted probabilities are too close to 0 and 1. Label smoothing reduces overconfidence because the model can never achieve zero loss (even a perfect predictor has loss ), so it does not need to push logits to infinity.
Muller et al. (2019) showed that label smoothing improves calibration (lower expected calibration error) but introduces a subtle bias: the penultimate layer representations become more clustered. This clustering can hurt knowledge distillation because the teacher's soft outputs contain less inter-class information.
Effect on Logit Magnitudes
Without label smoothing, minimizing cross-entropy drives the logit of the correct class toward and all other logits toward . The loss approaches zero only in this limit.
With label smoothing at parameter , the optimal logits satisfy:
This is finite. The gap between the correct-class logit and other logits is bounded by a quantity that depends on and . For and : the optimal gap is , which is moderate.
When NOT to Use Label Smoothing
Calibrated probability estimates are required. If downstream decisions depend on the exact predicted probability (e.g., medical diagnosis thresholds, betting markets), label smoothing changes the probability scale in ways that require recalibration. The model no longer tries to output true probabilities but instead targets a smoothed version.
Knowledge distillation from the model. Muller et al. (2019) found that label-smoothed teachers produce worse students than hard-label teachers. The clustering of penultimate features reduces the information content of soft targets.
Extreme class imbalance. With classes and , each wrong class gets probability from smoothing. This negligible amount provides no regularization benefit for rare classes while still degrading the signal for common classes.
Common Confusions
Label smoothing is not the same as mixup
Mixup creates new training examples by interpolating between input-label pairs: . The soft labels in mixup reflect actual interpolation of inputs. Label smoothing applies a fixed softening to every example regardless of the input. Mixup is data augmentation; label smoothing is regularization.
Label smoothing does not change the argmax
The optimal prediction under label smoothing still assigns the highest probability to the correct class. Smoothing only prevents the model from being infinitely confident. The ranking of classes is preserved; only the magnitude of probabilities changes.
Canonical Examples
Effect on a 3-class problem
With and , the target for class 1 changes from to . The cross-entropy loss for a model predicting : standard loss is ; smoothed loss is . The smoothed loss is higher because it also penalizes the low probability assigned to incorrect classes.
Exercises
Problem
For classes and , what is the label-smoothed target vector for class 3? What is the maximum achievable smoothed cross-entropy loss for a perfect predictor that always outputs the label-smoothed target?
Problem
Prove that the optimal softmax output minimizing the label-smoothed loss satisfies for all . What does this ratio approach as ?
References
Canonical:
- Szegedy et al., "Rethinking the Inception Architecture" (CVPR 2016), Section 7
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), Section 5.4
Current:
-
Muller, Kornblith, Hinton, "When Does Label Smoothing Help?" (NeurIPS 2019)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)
Next Topics
Label smoothing connects to the broader study of regularization, calibration, and training techniques for neural networks.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Logistic RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A