What Each Does
Mean Squared Error (MSE, L2 loss) measures the average squared difference between predictions and targets:
For a single example with target and prediction , the loss is . MSE penalizes large errors quadratically.
Cross-Entropy (CE) measures the dissimilarity between the predicted probability distribution and the true distribution :
For binary classification with true label and predicted probability :
Probabilistic Interpretations
Both losses arise naturally as negative log-likelihoods under specific distributional assumptions.
MSE = Gaussian likelihood. If , the negative log-likelihood (ignoring constants) is proportional to . Minimizing MSE is equivalent to maximum likelihood estimation under Gaussian noise. This is why MSE is the canonical loss for regression.
Cross-entropy = Bernoulli/categorical likelihood. If for binary classification, or for multiclass, the negative log-likelihood is exactly the cross-entropy. Minimizing cross-entropy is MLE for the class probabilities.
The Information-Theoretic Decomposition
Cross-entropy decomposes into two terms:
where is the entropy of the true distribution and is the KL divergence from to .
Since is constant with respect to the model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and predicted distributions. This gives cross-entropy a clean information-theoretic interpretation: it measures how many extra bits (or nats) are needed to encode samples from using the code optimized for .
MSE has no analogous decomposition. It measures raw prediction error, not distributional mismatch.
Why Cross-Entropy Beats MSE for Classification
Gradient saturation with MSE on sigmoids
Consider binary classification with a sigmoid output . The gradient of MSE with respect to the logit is:
When the model is confidently wrong (e.g., but ), the factor makes the gradient vanish. The model is maximally wrong but receives almost no learning signal.
The gradient of cross-entropy with respect to is:
The sigmoid derivative cancels with the log in cross-entropy, leaving a clean gradient proportional to the error. When the model is confidently wrong, , giving the strongest possible gradient. This is why cross-entropy trains classification networks faster and more reliably.
The same problem occurs with softmax
For multiclass classification with softmax outputs, the gradient of cross-entropy with respect to logit is . Clean, linear in the error, no saturation. MSE through softmax suffers the same vanishing gradient problem as the binary case.
Side-by-Side Comparison
| Property | Cross-Entropy | MSE |
|---|---|---|
| Formula | ||
| Probabilistic model | Bernoulli / Categorical | Gaussian |
| Natural pairing | Classification (sigmoid, softmax) | Regression (linear output) |
| Gradient through sigmoid | (no saturation) | (saturates) |
| Info-theoretic meaning | None (raw squared error) | |
| Sensitivity to outliers | Moderate (log penalty) | High (quadratic penalty) |
| Output range assumed | Probabilities in | Any real number |
| Calibration | Encourages calibrated probabilities | Does not directly optimize calibration |
| Convexity (in logits) | Convex | Non-convex through sigmoid |
When Each Wins
Cross-entropy wins: classification
For any task where the output is a probability distribution over classes, cross-entropy is the default. This includes binary classification, multiclass classification, language modeling (next-token prediction), and any setting where the target is a discrete distribution.
MSE wins: regression with continuous targets
When predicting a continuous value (price, temperature, distance), MSE is natural. The Gaussian likelihood assumption is reasonable for many real-valued targets, and the quadratic penalty appropriately weights large errors.
MSE wins: autoencoders and reconstruction
In variational autoencoders and image reconstruction tasks, the decoder often predicts pixel values. MSE (or its normalized variant) is appropriate because pixel intensities are continuous and Gaussian noise is a reasonable model for reconstruction error.
Cross-entropy wins: knowledge distillation
When training a student network to match a teacher's soft probability distribution, cross-entropy (or KL divergence, which differs only by a constant) is the correct loss. MSE on logits is sometimes used as an approximation, but it does not properly weight the tails of the distribution.
Where Each Fails
MSE fails at classification
Beyond gradient saturation, MSE treats all errors equally in output space. A prediction of for a true label of 1 gets almost the same MSE loss as , but these correspond to opposite classifications. Cross-entropy correctly assigns much higher loss to than to for .
Cross-entropy fails with noisy labels
Cross-entropy drives the model to assign probability 1 to the given label. With noisy or incorrect labels, this causes overfitting to noise. Label smoothing (replacing hard targets with a mixture like ) mitigates this, but it changes the loss from pure cross-entropy to a smoothed variant.
MSE is sensitive to outliers
The quadratic penalty means a single outlier with large can dominate the loss. Robust alternatives include Huber loss (quadratic for small errors, linear for large errors) and quantile regression.
Common Confusions
Cross-entropy is not only for classification
Cross-entropy applies whenever you are comparing probability distributions. It is used in language modeling, generative models, variational inference (via KL divergence), and density estimation. The binary classification case is just the most common application.
MSE on probabilities is not the same as Brier score in all contexts
The Brier score is where . This looks like MSE applied to probabilities, and it is a proper scoring rule. However, using MSE as a training loss through a sigmoid still has the gradient saturation problem. The Brier score is useful for evaluating calibration, not for training.
Minimizing cross-entropy does not guarantee calibration
A model can achieve low cross-entropy while being poorly calibrated. Cross-entropy rewards correct ranking (assigning higher probability to the correct class) but does not enforce that predicted probabilities match empirical frequencies. Post-hoc calibration methods like Platt scaling or temperature scaling are often needed.
Log loss and cross-entropy are the same thing
In binary classification, log loss, binary cross-entropy, and negative log-likelihood of a Bernoulli model are all identical. The different names come from different communities (information theory, machine learning, statistics) but the mathematical expression is the same.
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Sections 4.3.2 (cross-entropy for classification) and 1.2.5 (MLE and squared error).
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Section 6.2.2 (cost functions for maximum likelihood, gradient saturation analysis).
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 5.4 (loss functions and their probabilistic interpretations).
- Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley. Chapter 2 (entropy, cross-entropy, KL divergence).
- Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." arXiv:1503.02531. (Cross-entropy and KL divergence for knowledge distillation.)
- Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." ICML 2017. (Why cross-entropy training does not guarantee calibration.)
- Huber, P. J. (1964). "Robust estimation of a location parameter." Annals of Mathematical Statistics, 35(1), 73-101. (Huber loss as a robust alternative to MSE.)