Cross-Entropy Loss: MLE, KL Divergence, and Classification

Sneiderman, Robby

ML Methods

Cross-Entropy Loss: MLE, KL Divergence, and Classification

Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss.

CoreTier 1StableSupporting~20 min

Prerequisites

Information Theory Foundations Logistic Regression Log Probability Computation Multi Class and Multi Label Classification

Start 8-question practice · 14 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 1 | tier 1. This page has 4 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Multi-Class and Multi-Label Classification

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Cross-entropy is the default loss function for classification in every modern ML framework. When you call nn.CrossEntropyLoss in PyTorch or categorical_crossentropy in Keras, you are using this loss. Understanding why it is the right choice (not just that it is the standard) requires connecting three ideas: maximum likelihood, information theory, and optimization geometry.

Cross-entropy is geometry, supervision, and likelihood all at once

Cross-entropy explodes when the model is confidently wrong and stays small when it is confidently right.

Binary Cross-Entropy

Definition

Binary Cross-Entropy $H (y, p)$

For a binary label $y \in \{0, 1\}$ and predicted probability $p \in (0, 1)$ , the binary cross-entropy loss is:

$\ell(y, p) = -[y \log p + (1 - y) \log(1 - p)]$

This is the negative log-likelihood of the observation under a Bernoulli model with parameter $p$ .

When $y = 1$ , the loss is $-\log p$ : the model is penalized for assigning low probability to the correct class. When $y = 0$ , the loss is $-\log(1-p)$ : the model is penalized for assigning high probability to the wrong class. The loss goes to infinity as the prediction moves toward the wrong extreme.

Multi-Class Cross-Entropy

Definition

Categorical Cross-Entropy $H (y, p)$

For a one-hot label vector $\mathbf{y} \in \{0,1\}^K$ with $\sum_k y_k = 1$ and predicted probability vector $\mathbf{p} \in \Delta^{K-1}$ (the probability simplex), the categorical cross-entropy is:

$\ell(\mathbf{y}, \mathbf{p}) = -\sum_{k=1}^{K} y_k \log p_k$

Since $\mathbf{y}$ is one-hot with $y_c = 1$ for the true class $c$ , this reduces to $\ell = -\log p_c$ .

Cross-Entropy Equals Negative Log-Likelihood

Theorem

Cross-Entropy as Maximum Likelihood

Statement

Let $(x_i, y_i)_{i=1}^n$ be iid samples with $y_i \in \{1, \ldots, K\}$ . Let $p_k(x; \theta) = P(Y = k \mid X = x; \theta)$ be a parametric model. The negative log-likelihood is:

$-\frac{1}{n} \sum_{i=1}^{n} \log p_{y_i}(x_i; \theta) = \frac{1}{n} \sum_{i=1}^{n} \ell_{\text{CE}}(\mathbf{y}_i, \mathbf{p}(x_i; \theta))$

Minimizing cross-entropy loss is equivalent to maximum likelihood estimation under the categorical model.

Intuition

Cross-entropy is not an arbitrary choice — it is the unique loss function (up to affine transformations) that corresponds to maximum likelihood estimation for categorical distributions. MLE is the statistically natural way to fit a probability model to data.

Proof Sketch

The likelihood of the data is $L(\theta) = \prod_{i=1}^n p_{y_i}(x_i; \theta)$ . Taking the negative log: $-\log L(\theta) = -\sum_i \log p_{y_i}(x_i; \theta)$ . Writing $y_i$ as a one-hot vector and expanding: this is exactly the cross-entropy summed over samples.

Why It Matters

Cross-entropy is the negative log-likelihood for categorical models, so under the standard regularity assumptions of classical MLE — correct specification, identifiability, fixed dimension, an interior true parameter, nonsingular Fisher information, and global optimization — it inherits MLE's consistency, asymptotic normality, and asymptotic efficiency (Cramér–Rao lower bound). These guarantees do not automatically transfer to overparameterized neural networks, singular models, misspecified families, or training runs that only find local optima. Independently, cross-entropy is a proper scoring rule, uniquely minimized in expectation when the predicted probabilities match the true conditional distribution.

Failure Mode

MLE (and therefore cross-entropy) can overfit when the model class is too rich relative to the sample size. It also assumes the model class contains a good approximation to the true conditional distribution. If the model is badly misspecified, MLE converges to the member of the model class closest in KL divergence to the truth, which may still be far from the truth.

report a correction →

Connection to KL Divergence

Cross-entropy decomposes as:

$H(q, p) = H(q) + D_{\text{KL}}(q \| p)$

where $H(q) = -\sum_k q_k \log q_k$ is the entropy of the true distribution $q$ and $D_{\text{KL}}(q \| p) = \sum_k q_k \log(q_k / p_k)$ is the KL divergence from $p$ to $q$ .

Since $H(q)$ is constant with respect to the model parameters, minimizing cross-entropy is equivalent to minimizing $D_{\text{KL}}(q \| p)$ . The model learns to match its predicted distribution to the true conditional distribution.

Why MSE Fails for Classification

Proposition

MSE is Non-Convex in Logit Space

Statement

For binary classification with sigmoid output $\sigma(z) = 1/(1 + e^{-z})$ , the MSE loss $\ell(y, z) = (y - \sigma(z))^2$ is non-convex in the logit $z$ . Its gradient with respect to $z$ is:

$\frac{\partial \ell}{\partial z} = 2(y - \sigma(z)) \cdot (-\sigma(z)(1 - \sigma(z)))$

When $\sigma(z)$ is near 0 or 1, the gradient is near zero regardless of whether the prediction is correct. This creates plateau regions that slow or stall gradient descent.

Intuition

The sigmoid squashes its input to $(0,1)$ . MSE penalizes squared differences in this squashed space. When the sigmoid saturates (output near 0 or 1), the gradient of the sigmoid is near zero, multiplying the MSE gradient and killing the learning signal. Cross-entropy avoids this because the log cancels the exponential in the sigmoid.

Proof Sketch

Compute $\partial \ell / \partial z$ by chain rule. The factor $\sigma(z)(1 - \sigma(z))$ is the sigmoid derivative, which approaches zero as $|z| \to \infty$ . For cross-entropy, the gradient is $\sigma(z) - y$ , which has no vanishing factor.

Why It Matters

This is why every neural network classification head uses cross-entropy, not MSE. The optimization landscape with MSE has flat regions that trap gradient descent, making training slow and unreliable.

Failure Mode

MSE can still work for classification if the learning rate is carefully tuned and the logits are not allowed to saturate (e.g., with gradient clipping). But there is no reason to accept this inconvenience when cross-entropy works better by default.

report a correction →

Practical Variants

Label Smoothing

Replace the hard one-hot label $\mathbf{y}$ with a softened version:

$\mathbf{y}_{\text{smooth}} = (1 - \alpha) \cdot \mathbf{y} + \frac{\alpha}{K}$

where $\alpha \in (0, 1)$ is typically 0.1. This prevents the model from becoming overconfident (pushing logits to infinity) and acts as a regularizer. The cross-entropy loss with smoothed labels penalizes extreme confidence.

Focal Loss

For datasets with severe class imbalance, focal loss down-weights easy examples:

$\ell_{\text{focal}}(p_t) = -(1 - p_t)^\gamma \log(p_t)$

where $p_t$ is the predicted probability for the true class and $\gamma > 0$ is a focusing parameter (typically 2). When the model is confident and correct ( $p_t$ near 1), the $(1 - p_t)^\gamma$ factor reduces the loss. When the model is wrong ( $p_t$ near 0), the full loss applies. This focuses training on hard and misclassified examples.

Evaluation Checklist

A lower cross-entropy is evidence about probability assignment, not a complete classification report. Read it with the metrics that expose the failure mode:

Failure mode	Pair cross-entropy with	Reason
Class imbalance	Per-class recall, macro F1, precision-recall curve	A low average loss can hide minority-class collapse
Miscalibration	Reliability curve, Brier score, expected calibration error	Cross-entropy rewards probability quality but does not show where calibration fails
Label noise	Clean subset loss, disagreement audit, memorization curve	Confident fitting of wrong labels can lower training loss
Deployment threshold	Cost-weighted confusion matrix at the chosen threshold	The best probability model may still use the wrong action rule
Distribution shift	Same metrics on shifted slices	The model can learn training-set frequencies that do not transfer

For a classifier used in decisions, the defensible claim is not "cross-entropy went down." The claim is that the predicted probabilities support the action threshold, calibration requirement, and class-specific error budget.

Worked Diagnostic Pattern

For a multi-class model, report three rows side by side:

Hard-label cross-entropy training.
Label smoothing with the same architecture and schedule.
Focal loss or class reweighting when the label distribution is imbalanced.

For each row, include cross-entropy, accuracy, macro F1, calibration, and the operating-threshold confusion matrix. This makes it clear whether a loss variant improved probability quality, shifted attention to rare classes, or only changed the confidence scale.

Common Confusions

Watch Out

Cross-entropy is not symmetric

$H(q, p) \neq H(p, q)$ in general. In ML, $q$ is the true label distribution and $p$ is the model prediction. The order matters. Similarly, $D_{\text{KL}}(q \| p) \neq D_{\text{KL}}(p \| q)$ .

Watch Out

Softmax and cross-entropy are separate operations

Softmax converts logits to probabilities. Cross-entropy computes the loss from probabilities and labels. In practice, they are fused into a single numerically stable operation (log-sum-exp trick), but conceptually they are distinct. You can use cross-entropy with any probability output, not just softmax.

Watch Out

Cross-entropy loss of zero does not mean perfect learning

A cross-entropy loss of zero means the model assigns probability 1 to every correct class in the training set. This is perfect memorization of the training labels, not necessarily good generalization. Regularization prevents this.

Exercises

ExerciseCore

Problem

A binary classifier predicts $p = 0.9$ for a sample with true label $y = 1$ . Compute the binary cross-entropy loss. Then compute the MSE loss. Which penalizes the error more?

ExerciseAdvanced

Problem

Show that minimizing the cross-entropy $H(q, p)$ over the model $p$ is equivalent to minimizing the KL divergence $D_{\text{KL}}(q \| p)$ . Under what condition is the minimum achieved, and what is the minimum value?

Related Comparisons

References

Canonical:

Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
Cover & Thomas, Elements of Information Theory (2006), Chapter 2

Current:

Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.2
Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV, Sections 3-4
Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapters 5 and 10
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 4

Next Topics

Multi-class and multi-label classification: softmax vs sigmoid, OvR, OvO
Regularization in practice: preventing overconfident predictions

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Log-Probability Computationlayer 1 · tier 1
Logistic Regressionlayer 1 · tier 1
Information Theory Foundationslayer 0B · tier 2
Multi-Class and Multi-Label Classificationlayer 1 · tier 2

Derived topics

1

Regularization in Practicelayer 2 · tier 1

Graph-backed continuations

Regularization in Practice