Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Cross-Entropy vs. MSE Loss

Cross-entropy is the natural loss for classification because it equals the negative log-likelihood of a Bernoulli or categorical model, produces strong gradients even when the model is confidently wrong, and decomposes as entropy plus KL divergence. MSE is the natural loss for regression, corresponding to Gaussian likelihood, but causes gradient saturation when paired with sigmoid or softmax outputs.

What Each Does

Mean Squared Error (MSE, L2 loss) measures the average squared difference between predictions and targets:

LMSE=1ni=1n(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2

For a single example with target yy and prediction y^\hat{y}, the loss is (yy^)2(y - \hat{y})^2. MSE penalizes large errors quadratically.

Cross-Entropy (CE) measures the dissimilarity between the predicted probability distribution p^\hat{p} and the true distribution pp:

LCE=k=1Kp(k)logp^(k)\mathcal{L}_{\text{CE}} = -\sum_{k=1}^K p(k) \log \hat{p}(k)

For binary classification with true label y{0,1}y \in \{0, 1\} and predicted probability p^\hat{p}:

LBCE=[ylogp^+(1y)log(1p^)]\mathcal{L}_{\text{BCE}} = -\left[y \log \hat{p} + (1 - y) \log(1 - \hat{p})\right]

Probabilistic Interpretations

Both losses arise naturally as negative log-likelihoods under specific distributional assumptions.

MSE = Gaussian likelihood. If yiN(y^i,σ2)y_i \sim \mathcal{N}(\hat{y}_i, \sigma^2), the negative log-likelihood (ignoring constants) is proportional to (yiy^i)2\sum (y_i - \hat{y}_i)^2. Minimizing MSE is equivalent to maximum likelihood estimation under Gaussian noise. This is why MSE is the canonical loss for regression.

Cross-entropy = Bernoulli/categorical likelihood. If yiBernoulli(p^i)y_i \sim \text{Bernoulli}(\hat{p}_i) for binary classification, or yiCategorical(p^i)y_i \sim \text{Categorical}(\hat{p}_i) for multiclass, the negative log-likelihood is exactly the cross-entropy. Minimizing cross-entropy is MLE for the class probabilities.

The Information-Theoretic Decomposition

Cross-entropy decomposes into two terms:

H(p,p^)=H(p)+DKL(pp^)H(p, \hat{p}) = H(p) + D_{\text{KL}}(p \| \hat{p})

where H(p)=kp(k)logp(k)H(p) = -\sum_k p(k) \log p(k) is the entropy of the true distribution and DKL(pp^)=kp(k)logp(k)p^(k)D_{\text{KL}}(p \| \hat{p}) = \sum_k p(k) \log \frac{p(k)}{\hat{p}(k)} is the KL divergence from p^\hat{p} to pp.

Since H(p)H(p) is constant with respect to the model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and predicted distributions. This gives cross-entropy a clean information-theoretic interpretation: it measures how many extra bits (or nats) are needed to encode samples from pp using the code optimized for p^\hat{p}.

MSE has no analogous decomposition. It measures raw prediction error, not distributional mismatch.

Why Cross-Entropy Beats MSE for Classification

Gradient saturation with MSE on sigmoids

Consider binary classification with a sigmoid output p^=σ(z)=1/(1+ez)\hat{p} = \sigma(z) = 1/(1 + e^{-z}). The gradient of MSE with respect to the logit zz is:

LMSEz=2(p^y)σ(z)=2(p^y)p^(1p^)\frac{\partial \mathcal{L}_{\text{MSE}}}{\partial z} = 2(\hat{p} - y) \cdot \sigma'(z) = 2(\hat{p} - y) \cdot \hat{p}(1 - \hat{p})

When the model is confidently wrong (e.g., y=1y = 1 but p^0\hat{p} \approx 0), the factor p^(1p^)0\hat{p}(1 - \hat{p}) \approx 0 makes the gradient vanish. The model is maximally wrong but receives almost no learning signal.

The gradient of cross-entropy with respect to zz is:

LCEz=p^y\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z} = \hat{p} - y

The sigmoid derivative cancels with the log in cross-entropy, leaving a clean gradient proportional to the error. When the model is confidently wrong, p^y1|\hat{p} - y| \approx 1, giving the strongest possible gradient. This is why cross-entropy trains classification networks faster and more reliably.

The same problem occurs with softmax

For multiclass classification with softmax outputs, the gradient of cross-entropy with respect to logit zkz_k is p^k1[y=k]\hat{p}_k - \mathbf{1}[y = k]. Clean, linear in the error, no saturation. MSE through softmax suffers the same vanishing gradient problem as the binary case.

Side-by-Side Comparison

PropertyCross-EntropyMSE
Formulap(k)logp^(k)-\sum p(k) \log \hat{p}(k)1n(yiy^i)2\frac{1}{n}\sum(y_i - \hat{y}_i)^2
Probabilistic modelBernoulli / CategoricalGaussian
Natural pairingClassification (sigmoid, softmax)Regression (linear output)
Gradient through sigmoidp^y\hat{p} - y (no saturation)2(p^y)p^(1p^)2(\hat{p} - y)\hat{p}(1-\hat{p}) (saturates)
Info-theoretic meaningH(p)+DKL(pp^)H(p) + D_{\text{KL}}(p \| \hat{p})None (raw squared error)
Sensitivity to outliersModerate (log penalty)High (quadratic penalty)
Output range assumedProbabilities in [0,1][0, 1]Any real number
CalibrationEncourages calibrated probabilitiesDoes not directly optimize calibration
Convexity (in logits)ConvexNon-convex through sigmoid

When Each Wins

Cross-entropy wins: classification

For any task where the output is a probability distribution over classes, cross-entropy is the default. This includes binary classification, multiclass classification, language modeling (next-token prediction), and any setting where the target is a discrete distribution.

MSE wins: regression with continuous targets

When predicting a continuous value (price, temperature, distance), MSE is natural. The Gaussian likelihood assumption is reasonable for many real-valued targets, and the quadratic penalty appropriately weights large errors.

MSE wins: autoencoders and reconstruction

In variational autoencoders and image reconstruction tasks, the decoder often predicts pixel values. MSE (or its normalized variant) is appropriate because pixel intensities are continuous and Gaussian noise is a reasonable model for reconstruction error.

Cross-entropy wins: knowledge distillation

When training a student network to match a teacher's soft probability distribution, cross-entropy (or KL divergence, which differs only by a constant) is the correct loss. MSE on logits is sometimes used as an approximation, but it does not properly weight the tails of the distribution.

Where Each Fails

MSE fails at classification

Beyond gradient saturation, MSE treats all errors equally in output space. A prediction of p^=0.49\hat{p} = 0.49 for a true label of 1 gets almost the same MSE loss as p^=0.51\hat{p} = 0.51, but these correspond to opposite classifications. Cross-entropy correctly assigns much higher loss to p^=0.01\hat{p} = 0.01 than to p^=0.49\hat{p} = 0.49 for y=1y = 1.

Cross-entropy fails with noisy labels

Cross-entropy drives the model to assign probability 1 to the given label. With noisy or incorrect labels, this causes overfitting to noise. Label smoothing (replacing hard targets with a mixture like 0.91[y=k]+0.1/K0.9 \cdot \mathbf{1}[y=k] + 0.1/K) mitigates this, but it changes the loss from pure cross-entropy to a smoothed variant.

MSE is sensitive to outliers

The quadratic penalty means a single outlier with large yiy^i|y_i - \hat{y}_i| can dominate the loss. Robust alternatives include Huber loss (quadratic for small errors, linear for large errors) and quantile regression.

Common Confusions

Watch Out

Cross-entropy is not only for classification

Cross-entropy applies whenever you are comparing probability distributions. It is used in language modeling, generative models, variational inference (via KL divergence), and density estimation. The binary classification case is just the most common application.

Watch Out

MSE on probabilities is not the same as Brier score in all contexts

The Brier score is 1n(p^iyi)2\frac{1}{n}\sum(\hat{p}_i - y_i)^2 where yi{0,1}y_i \in \{0, 1\}. This looks like MSE applied to probabilities, and it is a proper scoring rule. However, using MSE as a training loss through a sigmoid still has the gradient saturation problem. The Brier score is useful for evaluating calibration, not for training.

Watch Out

Minimizing cross-entropy does not guarantee calibration

A model can achieve low cross-entropy while being poorly calibrated. Cross-entropy rewards correct ranking (assigning higher probability to the correct class) but does not enforce that predicted probabilities match empirical frequencies. Post-hoc calibration methods like Platt scaling or temperature scaling are often needed.

Watch Out

Log loss and cross-entropy are the same thing

In binary classification, log loss, binary cross-entropy, and negative log-likelihood of a Bernoulli model are all identical. The different names come from different communities (information theory, machine learning, statistics) but the mathematical expression is the same.

References

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Sections 4.3.2 (cross-entropy for classification) and 1.2.5 (MLE and squared error).
  2. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Section 6.2.2 (cost functions for maximum likelihood, gradient saturation analysis).
  3. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 5.4 (loss functions and their probabilistic interpretations).
  4. Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley. Chapter 2 (entropy, cross-entropy, KL divergence).
  5. Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." arXiv:1503.02531. (Cross-entropy and KL divergence for knowledge distillation.)
  6. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." ICML 2017. (Why cross-entropy training does not guarantee calibration.)
  7. Huber, P. J. (1964). "Robust estimation of a location parameter." Annals of Mathematical Statistics, 35(1), 73-101. (Huber loss as a robust alternative to MSE.)