Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Label Smoothing and Regularization

Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.

CoreTier 2Stable~35 min

Prerequisites

0

Why This Matters

A classifier trained with standard cross-entropy loss on one-hot labels is incentivized to push logits toward infinity: the loss decreases as the predicted probability of the correct class approaches 1. This produces overconfident predictions where the model assigns near-zero probability to all incorrect classes, even when the true label is ambiguous.

Label smoothing is a simple fix: replace the one-hot target with a soft target that reserves some probability mass for incorrect classes. This single change improves calibration, reduces overfitting (complementing other techniques like dropout and batch normalization), and often improves test accuracy.

Formal Setup

Definition

Hard Target

For a KK-class classification problem, the standard one-hot target for class cc is:

yk={1if k=c0if kcy_k = \begin{cases} 1 & \text{if } k = c \\ 0 & \text{if } k \neq c \end{cases}

Definition

Label-Smoothed Target

For smoothing parameter ε[0,1)\varepsilon \in [0, 1), the label-smoothed target is:

ykLS=(1ε)yk+εKy_k^{\text{LS}} = (1 - \varepsilon) \cdot y_k + \frac{\varepsilon}{K}

For the correct class cc: ycLS=1ε+ε/K=1ε(K1)/Ky_c^{\text{LS}} = 1 - \varepsilon + \varepsilon/K = 1 - \varepsilon(K-1)/K. For incorrect classes: ykLS=ε/Ky_k^{\text{LS}} = \varepsilon/K.

The cross-entropy loss with label-smoothed targets is:

LLS=k=1KykLSlogpk=(1ε)logpcεKk=1Klogpk\mathcal{L}_{\text{LS}} = -\sum_{k=1}^{K} y_k^{\text{LS}} \log p_k = -(1 - \varepsilon)\log p_c - \frac{\varepsilon}{K} \sum_{k=1}^{K} \log p_k

Main Theorem

Proposition

Label Smoothing as KL Regularization

Statement

The label-smoothed cross-entropy loss decomposes as:

LLS=(1ε)H(y,p)+εH(u,p)\mathcal{L}_{\text{LS}} = (1 - \varepsilon) \cdot H(y, p) + \varepsilon \cdot H(u, p)

where H(y,p)=logpcH(y, p) = -\log p_c is the standard cross-entropy with the hard label, H(u,p)=1Kk=1KlogpkH(u, p) = -\frac{1}{K}\sum_{k=1}^K \log p_k is the cross-entropy with the uniform distribution uu, and pp is the model's softmax output.

Equivalently:

LLS=(1ε)H(y,p)+ε[KL(up)+logK]\mathcal{L}_{\text{LS}} = (1 - \varepsilon) \cdot H(y, p) + \varepsilon \cdot [\mathrm{KL}(u \| p) + \log K]

The εKL(up)\varepsilon \cdot \mathrm{KL}(u \| p) term (where KL\mathrm{KL} is the KL divergence) penalizes the model for being far from the uniform distribution, encouraging higher entropy in predictions.

Intuition

Label smoothing adds a penalty for overconfidence. The KL(up)\mathrm{KL}(u \| p) term is minimized when pp is uniform (maximum uncertainty). The standard loss term pushes toward the correct class. The balance between these two forces prevents the model from pushing logits to extreme values.

Proof Sketch

Expand H(u,p)=1Kklogpk=KL(up)+H(u)=KL(up)+logKH(u, p) = -\frac{1}{K}\sum_k \log p_k = \mathrm{KL}(u \| p) + H(u) = \mathrm{KL}(u \| p) + \log K. Substitute into LLS=(1ε)H(y,p)+εH(u,p)\mathcal{L}_{\text{LS}} = (1-\varepsilon)H(y,p) + \varepsilon H(u,p) using the definition of KL divergence.

Why It Matters

This decomposition reveals that label smoothing is equivalent to standard training plus a maximum entropy regularizer. The strength of regularization is controlled by ε\varepsilon. Typical values are ε[0.05,0.1]\varepsilon \in [0.05, 0.1]. The original Inception v2 paper (Szegedy et al., 2016) used ε=0.1\varepsilon = 0.1 and the original Transformer paper (Vaswani et al., 2017) used ε=0.1\varepsilon = 0.1.

Failure Mode

Label smoothing assumes all incorrect classes are equally plausible (the ε/K\varepsilon/K is uniform across wrong classes). When there is strong class hierarchy (e.g., misclassifying a dog as a cat is more reasonable than misclassifying a dog as a truck), uniform smoothing wastes probability mass on implausible classes. Non-uniform smoothing based on class similarity can help but requires additional information.

Effects on Calibration

A model is well-calibrated if its predicted probabilities match empirical frequencies: when it says "80% probability of class A," class A occurs about 80% of the time.

Standard training with hard labels produces overconfident models: predicted probabilities are too close to 0 and 1. Label smoothing reduces overconfidence because the model can never achieve zero loss (even a perfect predictor has loss εlogK\varepsilon \cdot \log K), so it does not need to push logits to infinity.

Muller et al. (2019) showed that label smoothing improves calibration (lower expected calibration error) but introduces a subtle bias: the penultimate layer representations become more clustered. This clustering can hurt knowledge distillation because the teacher's soft outputs contain less inter-class information.

Effect on Logit Magnitudes

Without label smoothing, minimizing cross-entropy drives the logit of the correct class toward ++\infty and all other logits toward -\infty. The loss approaches zero only in this limit.

With label smoothing at parameter ε\varepsilon, the optimal logits satisfy:

zczk=logK(1ε)+εεfor all kcz_c^* - z_k^* = \log\frac{K(1 - \varepsilon) + \varepsilon}{\varepsilon} \quad \text{for all } k \neq c

This is finite. The gap between the correct-class logit and other logits is bounded by a quantity that depends on ε\varepsilon and KK. For K=1000K = 1000 and ε=0.1\varepsilon = 0.1: the optimal gap is log(9001)9.1\log(9001) \approx 9.1, which is moderate.

When NOT to Use Label Smoothing

Calibrated probability estimates are required. If downstream decisions depend on the exact predicted probability (e.g., medical diagnosis thresholds, betting markets), label smoothing changes the probability scale in ways that require recalibration. The model no longer tries to output true probabilities but instead targets a smoothed version.

Knowledge distillation from the model. Muller et al. (2019) found that label-smoothed teachers produce worse students than hard-label teachers. The clustering of penultimate features reduces the information content of soft targets.

Extreme class imbalance. With K=10,000K = 10{,}000 classes and ε=0.1\varepsilon = 0.1, each wrong class gets probability 10510^{-5} from smoothing. This negligible amount provides no regularization benefit for rare classes while still degrading the signal for common classes.

Common Confusions

Watch Out

Label smoothing is not the same as mixup

Mixup creates new training examples by interpolating between input-label pairs: (x,y)=(λxa+(1λ)xb,λya+(1λ)yb)(x', y') = (\lambda x_a + (1-\lambda)x_b, \lambda y_a + (1-\lambda)y_b). The soft labels in mixup reflect actual interpolation of inputs. Label smoothing applies a fixed softening to every example regardless of the input. Mixup is data augmentation; label smoothing is regularization.

Watch Out

Label smoothing does not change the argmax

The optimal prediction under label smoothing still assigns the highest probability to the correct class. Smoothing only prevents the model from being infinitely confident. The ranking of classes is preserved; only the magnitude of probabilities changes.

Canonical Examples

Example

Effect on a 3-class problem

With K=3K = 3 and ε=0.1\varepsilon = 0.1, the target for class 1 changes from (1,0,0)(1, 0, 0) to (0.933,0.033,0.033)(0.933, 0.033, 0.033). The cross-entropy loss for a model predicting (0.9,0.05,0.05)(0.9, 0.05, 0.05): standard loss is log(0.9)=0.105-\log(0.9) = 0.105; smoothed loss is 0.933log(0.9)0.033log(0.05)0.033log(0.05)=0.098+0.099+0.099=0.296-0.933\log(0.9) - 0.033\log(0.05) - 0.033\log(0.05) = 0.098 + 0.099 + 0.099 = 0.296. The smoothed loss is higher because it also penalizes the low probability assigned to incorrect classes.

Exercises

ExerciseCore

Problem

For K=10K = 10 classes and ε=0.1\varepsilon = 0.1, what is the label-smoothed target vector for class 3? What is the maximum achievable smoothed cross-entropy loss for a perfect predictor that always outputs the label-smoothed target?

ExerciseAdvanced

Problem

Prove that the optimal softmax output pp^* minimizing the label-smoothed loss satisfies pc/pk=(K(1ε)+ε)/εp_c^* / p_k^* = (K(1-\varepsilon) + \varepsilon)/\varepsilon for all kck \neq c. What does this ratio approach as ε0\varepsilon \to 0?

References

Canonical:

  • Szegedy et al., "Rethinking the Inception Architecture" (CVPR 2016), Section 7
  • Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), Section 5.4

Current:

  • Muller, Kornblith, Hinton, "When Does Label Smoothing Help?" (NeurIPS 2019)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)

Next Topics

Label smoothing connects to the broader study of regularization, calibration, and training techniques for neural networks.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.