Label Smoothing and Regularization

Sneiderman, Robby

Training Techniques

Label Smoothing and Regularization

Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.

CoreTier 2StableSupporting~35 min

Prerequisites

Logistic Regression

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

training-techniques | layer 2 | tier 2. This page has 1 direct prerequisite and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A classifier trained with standard cross-entropy loss on one-hot labels is incentivized to push logits toward infinity: the loss decreases as the predicted probability of the correct class approaches 1. This produces overconfident predictions where the model assigns near-zero probability to all incorrect classes, even when the true label is ambiguous.

Label smoothing is a simple fix: replace the one-hot target with a soft target that reserves some probability mass for incorrect classes. This single change improves calibration, reduces overfitting (complementing other techniques like dropout and batch normalization), and often improves test accuracy.

theorem visual

Label smoothing turns an answer key into a calibrated target

$Hard labels reward infinite confidence. Smoothed labels keep the correct class highest while reserving a small probability budget for uncertainty, which acts like an entropy regularizer.$

target rule

$y_{k}^{LS} = (1 - ε) y_{k} + ε / K$

$A small uniform budget is mixed into the one-hot label.$

loss view

$L_{LS} = (1 - ε) H (y, p) + ε H (u, p)$

$Training still rewards the correct class, but also punishes overconfident distributions.$

logit gap

$z_{c}^{⋆} - z_{k}^{⋆} = lo g \frac{K ( 1 - ε ) + ε}{ε}$

$The optimal correct-vs-wrong logit gap becomes finite instead of unbounded.$

Formal Setup

Definition

Hard Target $y$

For a $K$ -class classification problem, the standard one-hot target for class $c$ is:

$y_k = \mathbf{1}\{k=c\}$

That is, $y_c = 1$ and every incorrect class receives probability $0$ .

Definition

Label-Smoothed Target $y^{LS}$

For smoothing parameter $\varepsilon \in [0, 1)$ , the label-smoothed target is:

$y_k^{\text{LS}} = (1 - \varepsilon) \cdot y_k + \frac{\varepsilon}{K}$

For the correct class $c$ : $y_c^{\text{LS}} = 1 - \varepsilon + \varepsilon/K = 1 - \varepsilon(K-1)/K$ . For incorrect classes: $y_k^{\text{LS}} = \varepsilon/K$ .

The cross-entropy loss with label-smoothed targets is:

$\mathcal{L}_{\text{LS}} = -\sum_{k=1}^{K} y_k^{\text{LS}} \log p_k = -(1 - \varepsilon)\log p_c - \frac{\varepsilon}{K} \sum_{k=1}^{K} \log p_k$

Main Theorem

Proposition

Label Smoothing as KL Regularization

Statement

The label-smoothed cross-entropy loss decomposes as:

$\mathcal{L}_{\text{LS}} = (1 - \varepsilon) \cdot H(y, p) + \varepsilon \cdot H(u, p)$

where $H(y, p) = -\log p_c$ is the standard cross-entropy with the hard label, $H(u, p) = -\frac{1}{K}\sum_{k=1}^K \log p_k$ is the cross-entropy with the uniform distribution $u$ , and $p$ is the model's softmax output.

Equivalently:

$\mathcal{L}_{\text{LS}} = (1 - \varepsilon) \cdot H(y, p) + \varepsilon \cdot [\mathrm{KL}(u \| p) + \log K]$

The $\varepsilon \cdot \mathrm{KL}(u \| p)$ term (where $\mathrm{KL}$ is the KL divergence) penalizes the model for being far from the uniform distribution, encouraging higher entropy in predictions.

Intuition

Label smoothing adds a penalty for overconfidence. The $\mathrm{KL}(u \| p)$ term is minimized when $p$ is uniform (maximum uncertainty). The standard loss term pushes toward the correct class. The balance between these two forces prevents the model from pushing logits to extreme values.

Proof Sketch

Expand $H(u, p) = -\frac{1}{K}\sum_k \log p_k = \mathrm{KL}(u \| p) + H(u) = \mathrm{KL}(u \| p) + \log K$ . Substitute into $\mathcal{L}_{\text{LS}} = (1-\varepsilon)H(y,p) + \varepsilon H(u,p)$ using the definition of KL divergence.

Why It Matters

This decomposition reveals that label smoothing is equivalent to standard training plus a maximum entropy regularizer. The strength of regularization is controlled by $\varepsilon$ . Typical values are $\varepsilon \in [0.05, 0.1]$ . The original Inception v2 paper (Szegedy et al., 2016) used $\varepsilon = 0.1$ and the original Transformer paper (Vaswani et al., 2017) used $\varepsilon = 0.1$ .

Failure Mode

Label smoothing assumes all incorrect classes are equally plausible (the $\varepsilon/K$ is uniform across wrong classes). When there is strong class hierarchy (e.g., misclassifying a dog as a cat is more reasonable than misclassifying a dog as a truck), uniform smoothing wastes probability mass on implausible classes. Non-uniform smoothing based on class similarity can help but requires additional information.

report a correction →

Effects on Calibration

A model is well-calibrated if and only if its predicted probabilities match empirical frequencies: when it says "80% probability of class A," class A occurs about 80% of the time.

Standard training with hard labels produces overconfident models: predicted probabilities are too close to 0 and 1. Label smoothing reduces overconfidence because the smoothed loss can never reach zero: its minimum at $p = y^{\text{LS}}$ is the entropy $H(y^{\text{LS}})$ of the smoothed target. (A model that still outputs the one-hot prediction does worse, not better, under smoothing: assigning $p_k = 0$ to incorrect classes makes the $-\varepsilon/K \cdot \log p_k$ term blow up.) For small $\varepsilon$ and moderate $K$ , $H(y^{\text{LS}}) \approx \varepsilon \log K + |\varepsilon \log \varepsilon|$ , but the loss floor is $H(y^{\text{LS}})$ exactly, not $\varepsilon \log K$ .

Muller et al. (2019) showed that label smoothing improves calibration (lower expected calibration error) but introduces a subtle bias: the penultimate layer representations become more clustered. This clustering can hurt knowledge distillation because the teacher's soft outputs contain less inter-class information.

Effect on Logit Magnitudes

Without label smoothing, minimizing cross-entropy drives the logit of the correct class toward $+\infty$ and all other logits toward $-\infty$ . The loss approaches zero only in this limit.

With label smoothing at parameter $\varepsilon$ , the optimal logits satisfy:

$z_c^* - z_k^* = \log\frac{K(1 - \varepsilon) + \varepsilon}{\varepsilon} \quad \text{for all } k \neq c$

This is finite. The gap between the correct-class logit and other logits is bounded by a quantity that depends on $\varepsilon$ and $K$ . For $K = 1000$ and $\varepsilon = 0.1$ : the optimal gap is $\log(9001) \approx 9.1$ , which is moderate.

When NOT to Use Label Smoothing

Calibrated probability estimates are required. If downstream decisions depend on the exact predicted probability (e.g., medical diagnosis thresholds, betting markets), label smoothing changes the probability scale in ways that require recalibration. The model no longer tries to output true probabilities but instead targets a smoothed version.

Knowledge distillation from the model. Muller et al. (2019) found that label-smoothed teachers produce worse students than hard-label teachers. The clustering of penultimate features reduces the information content of soft targets.

Extreme class imbalance. With $K = 10{,}000$ classes and $\varepsilon = 0.1$ , each wrong class gets probability $10^{-5}$ from smoothing. This negligible amount provides no regularization benefit for rare classes while still degrading the signal for common classes.

Common Confusions

Watch Out

Label smoothing is not the same as mixup

Mixup creates new training examples by interpolating between input-label pairs: $(x', y') = (\lambda x_a + (1-\lambda)x_b, \lambda y_a + (1-\lambda)y_b)$ . The soft labels in mixup reflect actual interpolation of inputs. Label smoothing applies a fixed softening to every example regardless of the input. Mixup is data augmentation; label smoothing is regularization.

Watch Out

Label smoothing does not change the argmax

The optimal prediction under label smoothing still assigns the highest probability to the correct class. Smoothing only prevents the model from being infinitely confident. The ranking of classes is preserved; only the magnitude of probabilities changes.

Canonical Examples

Example

Effect on a 3-class problem

With $K = 3$ and $\varepsilon = 0.1$ , the target for class 1 changes from $(1, 0, 0)$ to $(0.933, 0.033, 0.033)$ . The cross-entropy loss for a model predicting $(0.9, 0.05, 0.05)$ : standard loss is $-\log(0.9) = 0.105$ ; smoothed loss is $-0.933\log(0.9) - 0.033\log(0.05) - 0.033\log(0.05) = 0.098 + 0.099 + 0.099 = 0.296$ . The smoothed loss is higher because it also penalizes the low probability assigned to incorrect classes.

Exercises

ExerciseCore

Problem

For $K = 10$ classes and $\varepsilon = 0.1$ , what is the label-smoothed target vector for class 3? What is the maximum achievable smoothed cross-entropy loss for a perfect predictor that always outputs the label-smoothed target?

ExerciseAdvanced

Problem

Prove that the optimal softmax output $p^*$ minimizing the label-smoothed loss satisfies $p_c^* / p_k^* = (K(1-\varepsilon) + \varepsilon)/\varepsilon$ for all $k \neq c$ . What does this ratio approach as $\varepsilon \to 0$ ?

References

Canonical:

Szegedy et al., "Rethinking the Inception Architecture" (CVPR 2016), Section 7. Introduces label smoothing as an Inception-v2 regularizer with $\varepsilon = 0.1$ .
Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), Section 5.4. Transformer training with $\varepsilon_{ls} = 0.1$ ; notes hurts perplexity, helps accuracy and BLEU.
Pereyra, Tucker, Chorowski, Kaiser, Hinton, "Regularizing Neural Networks by Penalizing Confident Output Distributions" (ICLR Workshop 2017, arXiv:1701.06548). Formal confidence-penalty / max-entropy framing that label smoothing implements.

Current:

Muller, Kornblith, Hinton, "When Does Label Smoothing Help?" (NeurIPS 2019, arXiv:1906.02629). Calibration improves; penultimate features cluster; distillation from a smoothed teacher is worse.
Guo, Pleiss, Sun, Weinberger, "On Calibration of Modern Neural Networks" (ICML 2017, arXiv:1706.04599). Temperature scaling as a post-hoc calibration alternative to label smoothing.
Lin, Goyal, Girshick, He, Dollar, "Focal Loss for Dense Object Detection" (ICCV 2017, arXiv:1708.02002). Alternative cross-entropy modification via a $(1-p_t)^\gamma$ focusing factor for class-imbalanced settings where label smoothing is ineffective.

Next Topics

Label smoothing connects to the broader study of regularization, calibration, and training techniques for neural networks.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Logistic Regressionlayer 1 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.