ML Methods
Cross-Entropy Loss Deep Dive
Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss.
Prerequisites
Why This Matters
Cross-entropy is the default loss function for classification in every modern
ML framework. When you call nn.CrossEntropyLoss in PyTorch or
categorical_crossentropy in Keras, you are using this loss. Understanding
why it is the right choice (not just that it is the standard) requires
connecting three ideas: maximum likelihood, information theory, and
optimization geometry.
Binary Cross-Entropy
Binary Cross-Entropy
For a binary label and predicted probability , the binary cross-entropy loss is:
This is the negative log-likelihood of the observation under a Bernoulli model with parameter .
When , the loss is : the model is penalized for assigning low probability to the correct class. When , the loss is : the model is penalized for assigning high probability to the wrong class. The loss goes to infinity as the prediction moves toward the wrong extreme.
Multi-Class Cross-Entropy
Categorical Cross-Entropy
For a one-hot label vector with and predicted probability vector (the probability simplex), the categorical cross-entropy is:
Since is one-hot with for the true class , this reduces to .
Cross-Entropy Equals Negative Log-Likelihood
Cross-Entropy as Maximum Likelihood
Statement
Let be iid samples with . Let be a parametric model. The negative log-likelihood is:
Minimizing cross-entropy loss is equivalent to maximum likelihood estimation under the categorical model.
Intuition
Cross-entropy is not an arbitrary choice. It is the unique loss function (up to affine transformations) that corresponds to maximum likelihood estimation for categorical distributions. MLE is the statistically natural way to fit a probability model to data.
Proof Sketch
The likelihood of the data is . Taking the negative log: . Writing as a one-hot vector and expanding: this is exactly the cross-entropy summed over samples.
Why It Matters
This connection means cross-entropy inherits all the good properties of MLE: consistency (converges to the true model as if the model class contains it), efficiency (achieves the Cramér-Rao lower bound asymptotically), and a clean probabilistic interpretation of the outputs. Cross-entropy is also a proper scoring rule, meaning it is uniquely minimized when the predicted probabilities match the true distribution.
Failure Mode
MLE (and therefore cross-entropy) can overfit when the model class is too rich relative to the sample size. It also assumes the model class contains a good approximation to the true conditional distribution. If the model is badly misspecified, MLE converges to the member of the model class closest in KL divergence to the truth, which may still be far from the truth.
Connection to KL Divergence
Cross-entropy decomposes as:
where is the entropy of the true distribution and is the KL divergence from to .
Since is constant with respect to the model parameters, minimizing cross-entropy is equivalent to minimizing . The model learns to match its predicted distribution to the true conditional distribution.
Why MSE Fails for Classification
MSE is Non-Convex in Logit Space
Statement
For binary classification with sigmoid output , the MSE loss is non-convex in the logit . Its gradient with respect to is:
When is near 0 or 1, the gradient is near zero regardless of whether the prediction is correct. This creates plateau regions that slow or stall gradient descent.
Intuition
The sigmoid squashes its input to . MSE penalizes squared differences in this squashed space. When the sigmoid saturates (output near 0 or 1), the gradient of the sigmoid is near zero, multiplying the MSE gradient and killing the learning signal. Cross-entropy avoids this because the log cancels the exponential in the sigmoid.
Proof Sketch
Compute by chain rule. The factor is the sigmoid derivative, which approaches zero as . For cross-entropy, the gradient is , which has no vanishing factor.
Why It Matters
This is why every neural network classification head uses cross-entropy, not MSE. The optimization landscape with MSE has flat regions that trap gradient descent, making training slow and unreliable.
Failure Mode
MSE can still work for classification if the learning rate is carefully tuned and the logits are not allowed to saturate (e.g., with gradient clipping). But there is no reason to accept this inconvenience when cross-entropy works better by default.
Practical Variants
Label Smoothing
Replace the hard one-hot label with a softened version:
where is typically 0.1. This prevents the model from becoming overconfident (pushing logits to infinity) and acts as a regularizer. The cross-entropy loss with smoothed labels penalizes extreme confidence.
Focal Loss
For datasets with severe class imbalance, focal loss down-weights easy examples:
where is the predicted probability for the true class and is a focusing parameter (typically 2). When the model is confident and correct ( near 1), the factor reduces the loss. When the model is wrong ( near 0), the full loss applies. This focuses training on hard and misclassified examples.
Common Confusions
Cross-entropy is not symmetric
in general. In ML, is the true label distribution and is the model prediction. The order matters. Similarly, .
Softmax and cross-entropy are separate operations
Softmax converts logits to probabilities. Cross-entropy computes the loss from probabilities and labels. In practice, they are fused into a single numerically stable operation (log-sum-exp trick), but conceptually they are distinct. You can use cross-entropy with any probability output, not just softmax.
Cross-entropy loss of zero does not mean perfect learning
A cross-entropy loss of zero means the model assigns probability 1 to every correct class in the training set. This is perfect memorization of the training labels, not necessarily good generalization. Regularization prevents this.
Exercises
Problem
A binary classifier predicts for a sample with true label . Compute the binary cross-entropy loss. Then compute the MSE loss. Which penalizes the error more?
Problem
Show that minimizing the cross-entropy over the model is equivalent to minimizing the KL divergence . Under what condition is the minimum achieved, and what is the minimum value?
Related Comparisons
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
- Cover & Thomas, Elements of Information Theory (2006), Chapter 2
Current:
-
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.2
-
Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
- Multi-class and multi-label classification: softmax vs sigmoid, OvR, OvO
- Regularization in practice: preventing overconfident predictions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Information Theory FoundationsLayer 0B
- Logistic RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A