Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Loss Functions Catalog

A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.

CoreTier 1Stable~50 min

Prerequisites

Why This Matters

The loss function defines what "good" means for your model. Two models with identical architectures trained with different loss functions will learn different things. In many practical settings, switching the loss function improves performance more than switching the architecture. The choice of loss encodes your assumptions about the problem: noise distribution, outlier sensitivity, class balance, and what errors cost. See cross-entropy loss deep dive for a detailed treatment of the most common classification loss.

Mental Model

A loss function (y^,y)\ell(\hat{y}, y) measures the cost of predicting y^\hat{y} when the truth is yy. Different losses penalize different types of errors. MSE penalizes large errors quadratically, making it the natural choice for linear regression under Gaussian noise. MAE penalizes all errors linearly (robust to outliers). Cross-entropy penalizes confident wrong classification predictions severely. The right loss depends on what errors matter in your application.

Classification Losses

Definition

Cross-Entropy Loss

For a classification problem with KK classes, the cross-entropy loss for a single example with true label yy (one-hot encoded) and predicted probabilities pp is:

LCE=k=1KyklogpkL_{\text{CE}} = -\sum_{k=1}^{K} y_k \log p_k

For binary classification with y{0,1}y \in \{0, 1\} and predicted probability pp:

LBCE=ylogp(1y)log(1p)L_{\text{BCE}} = -y \log p - (1 - y) \log(1 - p)

Cross-entropy has a critical property: as pk0p_k \to 0 for the true class, the loss goes to infinity. This severe penalty for confident wrong predictions drives the model to assign high probability to the correct class.

Definition

Hinge Loss

For binary classification with y{1,+1}y \in \{-1, +1\} and raw prediction f(x)Rf(x) \in \mathbb{R}:

Lhinge=max(0,1yf(x))L_{\text{hinge}} = \max(0, 1 - y \cdot f(x))

The loss is zero when yf(x)1y \cdot f(x) \geq 1 (correct prediction with margin at least 1). This is the loss used by support vector machines.

Hinge loss does not require probability outputs and is not differentiable at yf(x)=1y \cdot f(x) = 1. In practice, subgradient methods handle the non-differentiability.

Definition

Focal Loss

For binary classification with true class probability ptp_t (the model's predicted probability for the true class):

Lfocal=(1pt)γlogptL_{\text{focal}} = -(1 - p_t)^\gamma \log p_t

where γ0\gamma \geq 0 is a focusing parameter. When γ=0\gamma = 0, this reduces to cross-entropy.

Focal loss down-weights easy examples (where ptp_t is high). For γ=2\gamma = 2, an example with pt=0.9p_t = 0.9 gets weight (0.1)2=0.01(0.1)^2 = 0.01, while an example with pt=0.1p_t = 0.1 gets weight (0.9)2=0.81(0.9)^2 = 0.81. This concentrates learning on hard examples, which is critical for class-imbalanced problems like object detection where 99%+ of candidates are background.

Regression Losses

Definition

Mean Squared Error

For a regression problem with prediction y^\hat{y} and target yy:

LMSE=1ni=1n(yiy^i)2L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

MSE is the maximum likelihood estimator under a Gaussian noise model: y=f(x)+ϵy = f(x) + \epsilon where ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2).

Definition

Huber Loss

For a threshold δ>0\delta > 0:

Lδ(r)={12r2if rδδr12δ2if r>δL_\delta(r) = \begin{cases} \frac{1}{2}r^2 & \text{if } |r| \leq \delta \\ \delta|r| - \frac{1}{2}\delta^2 & \text{if } |r| > \delta \end{cases}

where r=yy^r = y - \hat{y} is the residual. Huber loss is quadratic for small errors and linear for large errors.

Huber loss combines the benefits of MSE (smooth, efficient for Gaussian errors) and MAE (robust to outliers). The parameter δ\delta controls the transition. When δ\delta is large, Huber approaches MSE. When δ\delta is small, it approaches MAE.

Divergence-Based Losses

Definition

KL Divergence Loss

The Kullback-Leibler divergence from distribution qq to distribution pp is:

DKL(pq)=kpklogpkqkD_{\text{KL}}(p \| q) = \sum_{k} p_k \log \frac{p_k}{q_k}

KL divergence is non-negative (DKL0D_{\text{KL}} \geq 0 by Gibbs' inequality) and equals zero if and only if p=qp = q. It is not symmetric: DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) in general.

KL divergence is used in knowledge distillation (matching a student's output distribution to a teacher's), variational autoencoders (regularizing the latent distribution toward a prior), and reinforcement learning from human feedback (penalizing deviation from a reference policy).

Definition

Contrastive Loss

For a pair of examples (xi,xj)(x_i, x_j) with label yij{0,1}y_{ij} \in \{0, 1\} indicating whether they are similar:

Lcontrastive=yijd(xi,xj)2+(1yij)max(0,md(xi,xj))2L_{\text{contrastive}} = y_{ij} \cdot d(x_i, x_j)^2 + (1 - y_{ij}) \cdot \max(0, m - d(x_i, x_j))^2

where dd is a distance function and mm is a margin. Similar pairs are pulled together; dissimilar pairs are pushed apart until they are at least distance mm apart.

Main Theorems

Theorem

Cross-Entropy Minimization Equals Maximum Likelihood

Statement

For a model parameterized by θ\theta that outputs class probabilities pθ(yx)p_\theta(y|x), minimizing the average cross-entropy loss on a dataset {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n is equivalent to maximizing the log-likelihood:

argminθ1ni=1nLCE(pθ(xi),yi)=argmaxθ1ni=1nlogpθ(yixi)\arg\min_\theta \frac{1}{n}\sum_{i=1}^n L_{\text{CE}}(p_\theta(x_i), y_i) = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(y_i | x_i)

Intuition

Cross-entropy loss for a one-hot target is just logpθ(yixi)-\log p_\theta(y_i | x_i): the negative log-probability of the true class. Summing over examples gives the negative log-likelihood. Minimizing the negative is maximizing the positive.

Proof Sketch

Expand the cross-entropy: LCE=kyklogpkL_{\text{CE}} = -\sum_k y_k \log p_k. For a one-hot label where yc=1y_c = 1 and yk=0y_k = 0 for kck \neq c, this simplifies to logpc=logpθ(yixi)-\log p_c = -\log p_\theta(y_i | x_i). Sum over the dataset and negate.

Why It Matters

This equivalence connects two perspectives: the information-theoretic view (cross-entropy measures how many extra bits your model needs) and the statistical view (maximum likelihood is the optimal estimator under regularity conditions). It justifies using cross-entropy as the default classification loss.

Failure Mode

The equivalence holds only when the model outputs valid probability distributions (non-negative, sum to 1). If the model is miscalibrated (probabilities do not reflect true frequencies), cross-entropy still works for discrimination but the probabilistic interpretation breaks down.

Proposition

Huber Loss Bounded Influence Function

Statement

The influence function of the Huber loss estimator is bounded: for any observation yy,

ψδ(y)δ|\psi_\delta(y)| \leq \delta

where ψδ(y)=Lδ/y^\psi_\delta(y) = \partial L_\delta / \partial \hat{y}. In contrast, the influence function of MSE is unbounded: ψMSE(y)=yy^|\psi_{\text{MSE}}(y)| = |y - \hat{y}|, which grows without limit.

Intuition

MSE's gradient is proportional to the residual, so a single outlier with residual 1000 exerts 1000x more influence than a typical point. Huber's gradient is capped at δ\delta, so no single point can dominate the gradient regardless of how far it is from the prediction.

Proof Sketch

The gradient of Huber loss is ψδ(r)=r\psi_\delta(r) = r for rδ|r| \leq \delta and ψδ(r)=δsign(r)\psi_\delta(r) = \delta \cdot \text{sign}(r) for r>δ|r| > \delta. The maximum absolute value is δ\delta, achieved for all rδ|r| \geq \delta.

Why It Matters

In real datasets, outliers are common (mislabeled examples, sensor errors, data entry mistakes). Huber loss provides a principled way to limit their influence without requiring explicit outlier removal. The parameter δ\delta controls the tradeoff: smaller δ\delta means more robustness but less statistical efficiency under Gaussian noise.

Failure Mode

Huber loss is robust to outliers in the target yy, not in the input xx. A leverage point (outlier in input space) can still distort the fit. For robustness to both, you need methods from robust regression (e.g., M-estimators with bounded leverage).

Why Loss Choice Matters More Than Architecture

For a fixed architecture, the loss function determines what the model optimizes. Concrete examples:

  • Object detection with cross-entropy treats all misclassifications equally. With focal loss, the model focuses on hard negatives and achieves significantly higher mAP.
  • Regression with MSE on heavy-tailed data produces estimates pulled toward outliers. Switching to Huber or MAE can reduce test error by 20%+ without changing the model.
  • Knowledge distillation with hard labels (cross-entropy on argmax) loses information. Soft labels with KL divergence preserve the teacher's inter-class relationships.

Common Confusions

Watch Out

Cross-entropy and log loss are the same thing

In the binary case, cross-entropy loss and log loss (logistic loss) are identical: ylogp(1y)log(1p)-y\log p - (1-y)\log(1-p). In the multi-class case, "log loss" typically refers to the same formula as multi-class cross-entropy. The terms are interchangeable.

Watch Out

KL divergence is not a distance

DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) and KL divergence does not satisfy the triangle inequality. It is a divergence, not a metric. The direction matters: DKL(pq)D_{\text{KL}}(p \| q) penalizes places where p>0p > 0 but q0q \approx 0 (mode-seeking when optimizing qq), while DKL(qp)D_{\text{KL}}(q \| p) does the reverse (mean-seeking).

Watch Out

Hinge loss does not produce probability estimates

Unlike cross-entropy, hinge loss does not require or produce probability outputs. An SVM's raw output f(x)f(x) is a signed distance from the decision boundary, not a probability. To get probabilities from an SVM, you need Platt scaling as a post-processing step.

Key Takeaways

  • Cross-entropy = negative log-likelihood for classification; the default choice
  • MSE assumes Gaussian noise; use Huber or MAE when outliers are present
  • Focal loss addresses class imbalance by down-weighting easy examples
  • Hinge loss creates maximum-margin classifiers (SVMs)
  • KL divergence measures distributional mismatch; critical for distillation and VAEs
  • Contrastive loss learns representations by comparing pairs
  • The choice of loss encodes assumptions about noise, class balance, and error costs

Exercises

ExerciseCore

Problem

Compute the cross-entropy loss for a 3-class problem where the true label is class 2 (zero-indexed) and the model predicts p=[0.1,0.2,0.7]p = [0.1, 0.2, 0.7].

ExerciseCore

Problem

For Huber loss with δ=1\delta = 1, compute the loss for residuals r=0.5r = 0.5, r=1r = 1, and r=10r = 10. Compare with MSE for the same residuals.

ExerciseAdvanced

Problem

Show that focal loss with γ=0\gamma = 0 reduces to cross-entropy, and explain why increasing γ\gamma concentrates the loss on hard examples. Compute the ratio of focal loss at pt=0.1p_t = 0.1 to focal loss at pt=0.9p_t = 0.9 for γ=0\gamma = 0 and γ=2\gamma = 2.

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3 (cross-entropy), Chapter 7.1 (SVM/hinge)
  • Huber, "Robust Estimation of a Location Parameter" (1964), Annals of Mathematical Statistics

Current:

  • Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV

  • Khosla et al., "Supervised Contrastive Learning" (2020), NeurIPS

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics