Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Cross-Entropy Loss Deep Dive

Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss.

CoreTier 1Stable~40 min

Why This Matters

Cross-entropy is the default loss function for classification in every modern ML framework. When you call nn.CrossEntropyLoss in PyTorch or categorical_crossentropy in Keras, you are using this loss. Understanding why it is the right choice (not just that it is the standard) requires connecting three ideas: maximum likelihood, information theory, and optimization geometry.

high loss zonelow loss zoneConfident + correctp=0.95 loss=0.05Uncertainp=0.5 loss=0.69Confident + wrongp=0.1 loss=2.30L = -log(p)01234Loss0.20.40.60.81Predicted probability for correct class (p)

Binary Cross-Entropy

Definition

Binary Cross-Entropy

For a binary label y{0,1}y \in \{0, 1\} and predicted probability p(0,1)p \in (0, 1), the binary cross-entropy loss is:

(y,p)=[ylogp+(1y)log(1p)]\ell(y, p) = -[y \log p + (1 - y) \log(1 - p)]

This is the negative log-likelihood of the observation under a Bernoulli model with parameter pp.

When y=1y = 1, the loss is logp-\log p: the model is penalized for assigning low probability to the correct class. When y=0y = 0, the loss is log(1p)-\log(1-p): the model is penalized for assigning high probability to the wrong class. The loss goes to infinity as the prediction moves toward the wrong extreme.

Multi-Class Cross-Entropy

Definition

Categorical Cross-Entropy

For a one-hot label vector y{0,1}K\mathbf{y} \in \{0,1\}^K with kyk=1\sum_k y_k = 1 and predicted probability vector pΔK1\mathbf{p} \in \Delta^{K-1} (the probability simplex), the categorical cross-entropy is:

(y,p)=k=1Kyklogpk\ell(\mathbf{y}, \mathbf{p}) = -\sum_{k=1}^{K} y_k \log p_k

Since y\mathbf{y} is one-hot with yc=1y_c = 1 for the true class cc, this reduces to =logpc\ell = -\log p_c.

Cross-Entropy Equals Negative Log-Likelihood

Theorem

Cross-Entropy as Maximum Likelihood

Statement

Let (xi,yi)i=1n(x_i, y_i)_{i=1}^n be iid samples with yi{1,,K}y_i \in \{1, \ldots, K\}. Let pk(x;θ)=P(Y=kX=x;θ)p_k(x; \theta) = P(Y = k \mid X = x; \theta) be a parametric model. The negative log-likelihood is:

1ni=1nlogpyi(xi;θ)=1ni=1nCE(yi,p(xi;θ))-\frac{1}{n} \sum_{i=1}^{n} \log p_{y_i}(x_i; \theta) = \frac{1}{n} \sum_{i=1}^{n} \ell_{\text{CE}}(\mathbf{y}_i, \mathbf{p}(x_i; \theta))

Minimizing cross-entropy loss is equivalent to maximum likelihood estimation under the categorical model.

Intuition

Cross-entropy is not an arbitrary choice. It is the unique loss function (up to affine transformations) that corresponds to maximum likelihood estimation for categorical distributions. MLE is the statistically natural way to fit a probability model to data.

Proof Sketch

The likelihood of the data is L(θ)=i=1npyi(xi;θ)L(\theta) = \prod_{i=1}^n p_{y_i}(x_i; \theta). Taking the negative log: logL(θ)=ilogpyi(xi;θ)-\log L(\theta) = -\sum_i \log p_{y_i}(x_i; \theta). Writing yiy_i as a one-hot vector and expanding: this is exactly the cross-entropy summed over samples.

Why It Matters

This connection means cross-entropy inherits all the good properties of MLE: consistency (converges to the true model as nn \to \infty if the model class contains it), efficiency (achieves the Cramér-Rao lower bound asymptotically), and a clean probabilistic interpretation of the outputs. Cross-entropy is also a proper scoring rule, meaning it is uniquely minimized when the predicted probabilities match the true distribution.

Failure Mode

MLE (and therefore cross-entropy) can overfit when the model class is too rich relative to the sample size. It also assumes the model class contains a good approximation to the true conditional distribution. If the model is badly misspecified, MLE converges to the member of the model class closest in KL divergence to the truth, which may still be far from the truth.

Connection to KL Divergence

Cross-entropy decomposes as:

H(q,p)=H(q)+DKL(qp)H(q, p) = H(q) + D_{\text{KL}}(q \| p)

where H(q)=kqklogqkH(q) = -\sum_k q_k \log q_k is the entropy of the true distribution qq and DKL(qp)=kqklog(qk/pk)D_{\text{KL}}(q \| p) = \sum_k q_k \log(q_k / p_k) is the KL divergence from pp to qq.

Since H(q)H(q) is constant with respect to the model parameters, minimizing cross-entropy is equivalent to minimizing DKL(qp)D_{\text{KL}}(q \| p). The model learns to match its predicted distribution to the true conditional distribution.

Why MSE Fails for Classification

Proposition

MSE is Non-Convex in Logit Space

Statement

For binary classification with sigmoid output σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}), the MSE loss (y,z)=(yσ(z))2\ell(y, z) = (y - \sigma(z))^2 is non-convex in the logit zz. Its gradient with respect to zz is:

z=2(yσ(z))(σ(z)(1σ(z)))\frac{\partial \ell}{\partial z} = 2(y - \sigma(z)) \cdot (-\sigma(z)(1 - \sigma(z)))

When σ(z)\sigma(z) is near 0 or 1, the gradient is near zero regardless of whether the prediction is correct. This creates plateau regions that slow or stall gradient descent.

Intuition

The sigmoid squashes its input to (0,1)(0,1). MSE penalizes squared differences in this squashed space. When the sigmoid saturates (output near 0 or 1), the gradient of the sigmoid is near zero, multiplying the MSE gradient and killing the learning signal. Cross-entropy avoids this because the log cancels the exponential in the sigmoid.

Proof Sketch

Compute /z\partial \ell / \partial z by chain rule. The factor σ(z)(1σ(z))\sigma(z)(1 - \sigma(z)) is the sigmoid derivative, which approaches zero as z|z| \to \infty. For cross-entropy, the gradient is σ(z)y\sigma(z) - y, which has no vanishing factor.

Why It Matters

This is why every neural network classification head uses cross-entropy, not MSE. The optimization landscape with MSE has flat regions that trap gradient descent, making training slow and unreliable.

Failure Mode

MSE can still work for classification if the learning rate is carefully tuned and the logits are not allowed to saturate (e.g., with gradient clipping). But there is no reason to accept this inconvenience when cross-entropy works better by default.

Practical Variants

Label Smoothing

Replace the hard one-hot label y\mathbf{y} with a softened version:

ysmooth=(1α)y+αK\mathbf{y}_{\text{smooth}} = (1 - \alpha) \cdot \mathbf{y} + \frac{\alpha}{K}

where α(0,1)\alpha \in (0, 1) is typically 0.1. This prevents the model from becoming overconfident (pushing logits to infinity) and acts as a regularizer. The cross-entropy loss with smoothed labels penalizes extreme confidence.

Focal Loss

For datasets with severe class imbalance, focal loss down-weights easy examples:

focal(pt)=(1pt)γlog(pt)\ell_{\text{focal}}(p_t) = -(1 - p_t)^\gamma \log(p_t)

where ptp_t is the predicted probability for the true class and γ>0\gamma > 0 is a focusing parameter (typically 2). When the model is confident and correct (ptp_t near 1), the (1pt)γ(1 - p_t)^\gamma factor reduces the loss. When the model is wrong (ptp_t near 0), the full loss applies. This focuses training on hard and misclassified examples.

Common Confusions

Watch Out

Cross-entropy is not symmetric

H(q,p)H(p,q)H(q, p) \neq H(p, q) in general. In ML, qq is the true label distribution and pp is the model prediction. The order matters. Similarly, DKL(qp)DKL(pq)D_{\text{KL}}(q \| p) \neq D_{\text{KL}}(p \| q).

Watch Out

Softmax and cross-entropy are separate operations

Softmax converts logits to probabilities. Cross-entropy computes the loss from probabilities and labels. In practice, they are fused into a single numerically stable operation (log-sum-exp trick), but conceptually they are distinct. You can use cross-entropy with any probability output, not just softmax.

Watch Out

Cross-entropy loss of zero does not mean perfect learning

A cross-entropy loss of zero means the model assigns probability 1 to every correct class in the training set. This is perfect memorization of the training labels, not necessarily good generalization. Regularization prevents this.

Exercises

ExerciseCore

Problem

A binary classifier predicts p=0.9p = 0.9 for a sample with true label y=1y = 1. Compute the binary cross-entropy loss. Then compute the MSE loss. Which penalizes the error more?

ExerciseAdvanced

Problem

Show that minimizing the cross-entropy H(q,p)H(q, p) over the model pp is equivalent to minimizing the KL divergence DKL(qp)D_{\text{KL}}(q \| p). Under what condition is the minimum achieved, and what is the minimum value?

Related Comparisons

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
  • Cover & Thomas, Elements of Information Theory (2006), Chapter 2

Current:

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.2

  • Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics