Cross-Entropy vs MSE Loss: When to Use Each and Why

What Each Does

Mean Squared Error (MSE, L2 loss) measures the average squared difference between predictions and targets:

$\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$

For a single example with target $y$ and prediction $\hat{y}$ , the loss is $(y - \hat{y})^2$ . MSE penalizes large errors quadratically.

Cross-Entropy (CE) measures the dissimilarity between the predicted probability distribution $\hat{p}$ and the true distribution $p$ :

$\mathcal{L}_{\text{CE}} = -\sum_{k=1}^K p(k) \log \hat{p}(k)$

For binary classification with true label $y \in \{0, 1\}$ and predicted probability $\hat{p}$ :

$\mathcal{L}_{\text{BCE}} = -\left[y \log \hat{p} + (1 - y) \log(1 - \hat{p})\right]$

Probabilistic Interpretations

Both losses arise naturally as negative log-likelihoods under specific distributional assumptions.

MSE = Gaussian likelihood. If $y_i \sim \mathcal{N}(\hat{y}_i, \sigma^2)$ , the negative log-likelihood (ignoring constants) is proportional to $\sum (y_i - \hat{y}_i)^2$ . Minimizing MSE is equivalent to maximum likelihood estimation under Gaussian noise. This is why MSE is the canonical loss for regression.

Cross-entropy = Bernoulli/categorical likelihood. If $y_i \sim \text{Bernoulli}(\hat{p}_i)$ for binary classification, or $y_i \sim \text{Categorical}(\hat{p}_i)$ for multiclass, the negative log-likelihood is exactly the cross-entropy. Minimizing cross-entropy is MLE for the class probabilities.

The Information-Theoretic Decomposition

Cross-entropy decomposes into two terms:

$H(p, \hat{p}) = H(p) + D_{\text{KL}}(p \| \hat{p})$

where $H(p) = -\sum_k p(k) \log p(k)$ is the entropy of the true distribution and $D_{\text{KL}}(p \| \hat{p}) = \sum_k p(k) \log \frac{p(k)}{\hat{p}(k)}$ is the KL divergence from $\hat{p}$ to $p$ .

Since $H(p)$ is constant with respect to the model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and predicted distributions. This gives cross-entropy a clean information-theoretic interpretation: it measures how many extra bits (or nats) are needed to encode samples from $p$ using the code optimized for $\hat{p}$ .

MSE has no analogous decomposition. It measures raw prediction error, not distributional mismatch.

Why Cross-Entropy Beats MSE for Classification

Gradient saturation with MSE on sigmoids

Consider binary classification with a sigmoid output $\hat{p} = \sigma(z) = 1/(1 + e^{-z})$ . The gradient of MSE with respect to the logit $z$ is:

$\frac{\partial \mathcal{L}_{\text{MSE}}}{\partial z} = 2(\hat{p} - y) \cdot \sigma'(z) = 2(\hat{p} - y) \cdot \hat{p}(1 - \hat{p})$

When the model is confidently wrong (e.g., $y = 1$ but $\hat{p} \approx 0$ ), the factor $\hat{p}(1 - \hat{p}) \approx 0$ makes the gradient vanish. The model is maximally wrong but receives almost no learning signal.

The gradient of cross-entropy with respect to $z$ is:

$\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z} = \hat{p} - y$

The sigmoid derivative cancels with the log in cross-entropy, leaving a clean gradient proportional to the error. When the model is confidently wrong, $|\hat{p} - y| \approx 1$ , giving the strongest possible gradient. This is why cross-entropy trains classification networks faster and more reliably.

The same problem occurs with softmax

For multiclass classification with softmax outputs, the gradient of cross-entropy with respect to logit $z_k$ is $\hat{p}_k - \mathbf{1}[y = k]$ . Clean, linear in the error, no saturation. MSE through softmax suffers the same vanishing gradient problem as the binary case.

Side-by-Side Comparison

Property	Cross-Entropy	MSE
Formula	$-\sum p(k) \log \hat{p}(k)$	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$
Probabilistic model	Bernoulli / Categorical	Gaussian
Natural pairing	Classification (sigmoid, softmax)	Regression (linear output)
Gradient through sigmoid	$\hat{p} - y$ (no saturation)	$2(\hat{p} - y)\hat{p}(1-\hat{p})$ (saturates)
Info-theoretic meaning	$H(p) + D_{\text{KL}}(p \\| \hat{p})$	None (raw squared error)
Sensitivity to outliers	Moderate (log penalty)	High (quadratic penalty)
Output range assumed	Probabilities in $[0, 1]$	Any real number
Calibration	Encourages calibrated probabilities	Does not directly optimize calibration
Convexity (in logits)	Convex	Non-convex through sigmoid

When Each Wins

Cross-entropy wins: classification

For any task where the output is a probability distribution over classes, cross-entropy is the default. This includes binary classification, multiclass classification, language modeling (next-token prediction), and any setting where the target is a discrete distribution.

MSE wins: regression with continuous targets

When predicting a continuous value (price, temperature, distance), MSE is natural. The Gaussian likelihood assumption is reasonable for many real-valued targets, and the quadratic penalty appropriately weights large errors.

MSE wins: autoencoders and reconstruction

In variational autoencoders and image reconstruction tasks, the decoder often predicts pixel values. MSE (or its normalized variant) is appropriate because pixel intensities are continuous and Gaussian noise is a reasonable model for reconstruction error.

Cross-entropy wins: knowledge distillation

When training a student network to match a teacher's soft probability distribution, cross-entropy (or KL divergence, which differs only by a constant) is the correct loss. MSE on logits is sometimes used as an approximation, but it does not properly weight the tails of the distribution.

Where Each Fails

MSE fails at classification

Beyond gradient saturation, MSE treats all errors equally in output space. A prediction of $\hat{p} = 0.49$ for a true label of 1 gets almost the same MSE loss as $\hat{p} = 0.51$ , but these correspond to opposite classifications. Cross-entropy correctly assigns much higher loss to $\hat{p} = 0.01$ than to $\hat{p} = 0.49$ for $y = 1$ .

Cross-entropy fails with noisy labels

Cross-entropy drives the model to assign probability 1 to the given label. With noisy or incorrect labels, this causes overfitting to noise. Label smoothing (replacing hard targets with a mixture like $0.9 \cdot \mathbf{1}[y=k] + 0.1/K$ ) mitigates this, but it changes the loss from pure cross-entropy to a smoothed variant.

MSE is sensitive to outliers

The quadratic penalty means a single outlier with large $|y_i - \hat{y}_i|$ can dominate the loss. Robust alternatives include Huber loss (quadratic for small errors, linear for large errors) and quantile regression.

Common Confusions

Watch Out

Cross-entropy is not only for classification

Cross-entropy applies whenever you are comparing probability distributions. It is used in language modeling, generative models, variational inference (via KL divergence), and density estimation. The binary classification case is just the most common application.

Watch Out

MSE on probabilities is not the same as Brier score in all contexts

The Brier score is $\frac{1}{n}\sum(\hat{p}_i - y_i)^2$ where $y_i \in \{0, 1\}$ . This looks like MSE applied to probabilities, and it is a proper scoring rule. However, using MSE as a training loss through a sigmoid still has the gradient saturation problem. The Brier score is useful for evaluating calibration, not for training.

Watch Out

Minimizing cross-entropy does not guarantee calibration

A model can achieve low cross-entropy while being poorly calibrated. Cross-entropy rewards correct ranking (assigning higher probability to the correct class) but does not enforce that predicted probabilities match empirical frequencies. Post-hoc calibration methods like Platt scaling or temperature scaling are often needed.

Watch Out

Log loss and cross-entropy are the same thing

In binary classification, log loss, binary cross-entropy, and negative log-likelihood of a Bernoulli model are all identical. The different names come from different communities (information theory, machine learning, statistics) but the mathematical expression is the same.

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Sections 4.3.2 (cross-entropy for classification) and 1.2.5 (MLE and squared error).
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Section 6.2.2 (cost functions for maximum likelihood, gradient saturation analysis).
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 5.4 (loss functions and their probabilistic interpretations).
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley. Chapter 2 (entropy, cross-entropy, KL divergence).
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." arXiv:1503.02531. (Cross-entropy and KL divergence for knowledge distillation.)
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." ICML 2017. (Why cross-entropy training does not guarantee calibration.)
Huber, P. J. (1964). "Robust estimation of a location parameter." Annals of Mathematical Statistics, 35(1), 73-101. (Huber loss as a robust alternative to MSE.)