Loss Functions Catalog

Sneiderman, Robby

ML Methods

Loss Functions Catalog

A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.

CoreTier 1StableSupporting~50 min

Prerequisites

Logistic Regression

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 1 | tier 1. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Empirical Risk Minimization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The loss function defines what "good" means for your model. Two models with identical architectures trained with different loss functions will learn different things. In many practical settings, switching the loss function improves performance more than switching the architecture. The choice of loss encodes your assumptions about the problem: noise distribution, outlier sensitivity, class balance, and what errors cost. See cross-entropy loss deep dive for a detailed treatment of the most common classification loss.

From the convex optimization perspective, convexity of the loss in the parameters matters: it determines whether local minima are global, and whether subgradient methods converge. Most standard loss functions (cross-entropy, MSE, Huber, hinge) are convex in the predictions, but the composed loss through a neural network is non-convex in the weights.

Mental Model

A loss function $\ell(\hat{y}, y)$ measures the cost of predicting $\hat{y}$ when the truth is $y$ . Different losses penalize different types of errors. MSE penalizes large errors quadratically, making it the natural choice for linear regression under Gaussian noise. MAE penalizes all errors linearly (robust to outliers). Cross-entropy penalizes confident wrong classification predictions severely. The right loss depends on what errors matter in your application.

Classification Losses

Definition

Cross-Entropy Loss $L_{CE}$

For a classification problem with $K$ classes, the cross-entropy loss for a single example with true label $y$ (one-hot encoded) and predicted probabilities $p$ is:

$L_{\text{CE}} = -\sum_{k=1}^{K} y_k \log p_k$

For binary classification with $y \in \{0, 1\}$ and predicted probability $p$ :

$L_{\text{BCE}} = -y \log p - (1 - y) \log(1 - p)$

Cross-entropy has a critical property: as $p_k \to 0$ for the true class, the loss goes to infinity. This severe penalty for confident wrong predictions drives the model to assign high probability to the correct class.

Definition

Hinge Loss $L_{hinge}$

For binary classification with $y \in \{-1, +1\}$ and raw prediction $f(x) \in \mathbb{R}$ :

$L_{\text{hinge}} = \max(0, 1 - y \cdot f(x))$

The loss is zero when $y \cdot f(x) \geq 1$ (correct prediction with margin at least 1). This is the loss used by support vector machines.

Hinge loss does not require probability outputs and is not differentiable at $y \cdot f(x) = 1$ . In practice, subgradient methods handle the non-differentiability.

Definition

Focal Loss $L_{focal}$

For binary classification with true class probability $p_t$ (the model's predicted probability for the true class):

$L_{\text{focal}} = -(1 - p_t)^\gamma \log p_t$

where $\gamma \geq 0$ is a focusing parameter. When $\gamma = 0$ , this reduces to cross-entropy.

Focal loss down-weights easy examples (where $p_t$ is high). For $\gamma = 2$ , an example with $p_t = 0.9$ gets weight $(0.1)^2 = 0.01$ , while an example with $p_t = 0.1$ gets weight $(0.9)^2 = 0.81$ . This concentrates learning on hard examples, which is critical for class-imbalanced problems like object detection where 99%+ of candidates are background.

Regression Losses

Definition

Mean Squared Error $L_{MSE}$

For a regression problem with prediction $\hat{y}$ and target $y$ :

$L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

MSE is the maximum likelihood estimator under a Gaussian noise model: $y = f(x) + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ .

Definition

Huber Loss $L_{δ}$

For a threshold $\delta > 0$ :

$L_\delta(r) = \begin{cases} \frac{1}{2}r^2 & \text{if } |r| \leq \delta \\ \delta|r| - \frac{1}{2}\delta^2 & \text{if } |r| > \delta \end{cases}$

where $r = y - \hat{y}$ is the residual. Huber loss is quadratic for small errors and linear for large errors.

Huber loss combines the benefits of MSE (smooth, efficient for Gaussian errors) and MAE (robust to outliers). The parameter $\delta$ controls the transition. When $\delta$ is large, Huber approaches MSE. When $\delta$ is small, it approaches MAE.

Divergence-Based Losses

Definition

KL Divergence Loss $D_{KL}$

The Kullback-Leibler divergence from distribution $q$ to distribution $p$ is:

$D_{\text{KL}}(p \| q) = \sum_{k} p_k \log \frac{p_k}{q_k}$

KL divergence is non-negative ( $D_{\text{KL}} \geq 0$ by Gibbs' inequality) and equals zero if and only if $p = q$ . It is not symmetric: $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general.

KL divergence is used in knowledge distillation (matching a student's output distribution to a teacher's), variational autoencoders (regularizing the latent distribution toward a prior), and reinforcement learning from human feedback (penalizing deviation from a reference policy).

Definition

Contrastive Loss $L_{contrastive}$

For a pair of examples $(x_i, x_j)$ with label $y_{ij} \in \{0, 1\}$ indicating whether they are similar:

$L_{\text{contrastive}} = y_{ij} \cdot d(x_i, x_j)^2 + (1 - y_{ij}) \cdot \max(0, m - d(x_i, x_j))^2$

where $d$ is a distance function and $m$ is a margin. Similar pairs are pulled together; dissimilar pairs are pushed apart until they are at least distance $m$ apart.

Main Theorems

Theorem

Cross-Entropy Minimization Equals Maximum Likelihood

Statement

For a model parameterized by $\theta$ that outputs class probabilities $p_\theta(y|x)$ , minimizing the average cross-entropy loss on a dataset $\{(x_i, y_i)\}_{i=1}^n$ is equivalent to maximizing the log-likelihood:

$\arg\min_\theta \frac{1}{n}\sum_{i=1}^n L_{\text{CE}}(p_\theta(x_i), y_i) = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(y_i | x_i)$

Intuition

Cross-entropy loss for a one-hot target is just $-\log p_\theta(y_i | x_i)$ : the negative log-probability of the true class. Summing over examples gives the negative log-likelihood. Minimizing the negative is maximizing the positive.

Proof Sketch

Expand the cross-entropy: $L_{\text{CE}} = -\sum_k y_k \log p_k$ . For a one-hot label where $y_c = 1$ and $y_k = 0$ for $k \neq c$ , this simplifies to $-\log p_c = -\log p_\theta(y_i | x_i)$ . Sum over the dataset and negate.

Why It Matters

This equivalence connects two perspectives: the information-theoretic view (cross-entropy measures how many extra bits your model needs) and the statistical view (maximum likelihood is the optimal estimator under regularity conditions). It justifies using cross-entropy as the default classification loss.

Failure Mode

The equivalence holds only when the model outputs valid probability distributions (non-negative, sum to 1). If the model is miscalibrated (probabilities do not reflect true frequencies), cross-entropy still works for discrimination but the probabilistic interpretation breaks down.

report a correction →

Proposition

Huber Loss Bounded Influence Function

Statement

The influence function of the Huber loss estimator is bounded: for any observation $y$ ,

$|\psi_\delta(y)| \leq \delta$

where $\psi_\delta(y) = \partial L_\delta / \partial \hat{y}$ . In contrast, the influence function of MSE is unbounded: $|\psi_{\text{MSE}}(y)| = |y - \hat{y}|$ , which grows without limit.

Intuition

MSE's gradient is proportional to the residual, so a single outlier with residual 1000 exerts 1000x more influence than a typical point. Huber's gradient is capped at $\delta$ , so no single point can dominate the gradient regardless of how far it is from the prediction.

Proof Sketch

The gradient of Huber loss is $\psi_\delta(r) = r$ for $|r| \leq \delta$ and $\psi_\delta(r) = \delta \cdot \text{sign}(r)$ for $|r| > \delta$ . The maximum absolute value is $\delta$ , achieved for all $|r| \geq \delta$ .

Why It Matters

In real datasets, outliers are common (mislabeled examples, sensor errors, data entry mistakes). Huber loss provides a principled way to limit their influence without requiring explicit outlier removal. The parameter $\delta$ controls the tradeoff: smaller $\delta$ means more robustness but less statistical efficiency under Gaussian noise.

Failure Mode

Huber loss is robust to outliers in the target $y$ , not in the input $x$ . A leverage point (outlier in input space) can still distort the fit. For robustness to both, you need methods from robust regression (e.g., M-estimators with bounded leverage).

report a correction →

Gradient Expressions

For gradient descent and backpropagation, what matters is the gradient of the loss with respect to the prediction $\hat{y}$ (or raw logit $z$ before the activation).

Cross-entropy (binary, sigmoid output): Let $z$ be the pre-sigmoid logit and $p = \sigma(z)$ . Then:

$\frac{\partial L_{\text{BCE}}}{\partial z} = p - y$

The gradient is simply the prediction error. This clean form is why cross-entropy pairs naturally with sigmoid: the sigmoid's derivative cancels against the log's derivative.

MSE:

$\frac{\partial L_{\text{MSE}}}{\partial \hat{y}} = \frac{2}{n}(\hat{y} - y)$

Gradient is proportional to the residual; large residuals dominate.

Huber loss:

$\frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} \hat{y} - y & \text{if } |\hat{y} - y| \leq \delta \\ \delta \cdot \text{sign}(\hat{y} - y) & \text{if } |\hat{y} - y| > \delta \end{cases}$

Gradient is capped at $\pm\delta$ , so outliers exert bounded influence on the update.

MAE:

$\frac{\partial L_{\text{MAE}}}{\partial \hat{y}} = \text{sign}(\hat{y} - y)$

Gradient is constant magnitude regardless of residual size. This makes MAE robust but can cause slow convergence near the optimum (no gradient decay).

Hinge loss:

$\frac{\partial L_{\text{hinge}}}{\partial f(x)} = \begin{cases} -y & \text{if } y \cdot f(x) < 1 \\ 0 & \text{if } y \cdot f(x) \geq 1 \end{cases}$

Zero gradient when the margin condition is satisfied; the loss only responds to margin violations.

KL divergence (as $q$ is being optimized to match $p$ ):

$\frac{\partial D_{\text{KL}}(p \| q)}{\partial q_k} = -\frac{p_k}{q_k}$

This is why minimizing KL divergence is mode-seeking: the gradient blows up when $q_k \to 0$ for a region where $p_k > 0$ , forcing $q$ to cover all modes of $p$ .

Interactive: loss and gradient side by side

Every row of the table below collapses two shapes into one name. The loss curve says how much a residual or margin costs; the gradient curve says what the optimizer will do about it. Toggle overlays to compare, and slide $\delta$ to watch Huber interpolate between MSE (quadratic near zero) and MAE (linear in the tails).

loss catalog

Loss and gradient, side by side

Residual r = y - \overset{y}{^} . Top panel: loss ℓ (r) . Bottom: gradient \partial ℓ / \partial r .

huber δδ = 1.00

active formulas

MSE

r^{2}

MAE

∣ r ∣

Huber

Huber_{δ} (r)

The shape of the gradient is the shape of the update direction. A bounded gradient means bounded sensitivity to a single outlier.

Loss Function Comparison Table

Loss	Convex?	Smooth?	Robust to outliers?	Probabilistic	Typical use
Cross-entropy	Yes	Yes	No	Yes	Classification
Binary cross-entropy	Yes	Yes	No	Yes	Binary classification
MSE	Yes	Yes	No	Yes (Gaussian)	Regression, linear models
MAE	Yes	No (at 0)	Yes	Yes (Laplace)	Regression, robust fitting
Huber	Yes	Yes	Yes (bounded)	Approximate	Regression with outliers
Hinge	Yes	No (at 1)	Partially	No	SVM, max-margin classifiers
Focal	No (in $p_t$ )	Yes	Partially	Yes	Imbalanced classification
KL divergence	Yes (in $q$ )	Yes	No	Yes	Distillation, VAEs, RLHF
Contrastive	No	Partially	Partially	No	Metric learning

Notes on the table:

"Smooth" means differentiable everywhere. MAE and hinge have kink points requiring subgradients.
"Robust to outliers" means the influence function is bounded. MSE and cross-entropy have unbounded influence.
Focal loss is convex in the prediction logit $z$ for some $\gamma$ values but not all; treat as non-convex in general.
"Probabilistic" means the loss corresponds to a negative log-likelihood for some distribution.

When-to-Use Decision Guide

Classification:

Start with cross-entropy. It is the MLE loss for the categorical distribution and pairs cleanly with softmax.
Switch to focal loss when positive examples are fewer than 1% of the data (object detection, rare event detection).
Use hinge loss only when you specifically want maximum-margin geometry (SVM-based models or structured prediction).

Regression:

Start with MSE when you expect Gaussian noise and have no clear outliers.
Switch to Huber when residuals have heavy tails or when mislabeled examples are present. Tune $\delta$ to the scale of typical residuals.
Use MAE when the median (not mean) is the target quantity, or when the noise is Laplace-distributed.

Distribution matching / generative modeling:

Use KL divergence for knowledge distillation (forward KL: mean-seeking, covers all modes of the teacher) or for RLHF penalty terms.
Use reverse KL in variational inference (reverse KL: mode-seeking, latches onto one mode of the posterior).

Metric / representation learning:

Use contrastive loss or triplet loss for embedding-space problems where class boundaries are not pre-defined.
InfoNCE (used in CLIP, SimCLR) is a contrastive loss that has been shown to maximize a lower bound on mutual information.

Why Loss Choice Matters More Than Architecture

For a fixed architecture, the loss function determines what the model optimizes. Concrete examples:

Object detection with cross-entropy treats all misclassifications equally. With focal loss, the model focuses on hard negatives and achieves significantly higher mAP.
Regression with MSE on heavy-tailed data produces estimates pulled toward outliers. Switching to Huber or MAE can reduce test error by 20%+ without changing the model.
Knowledge distillation with hard labels (cross-entropy on argmax) loses information. Soft labels with KL divergence preserve the teacher's inter-class relationships.

Common Confusions

Watch Out

Cross-entropy and log loss are the same thing

In the binary case, cross-entropy loss and log loss (logistic loss) are identical: $-y\log p - (1-y)\log(1-p)$ . In the multi-class case, "log loss" typically refers to the same formula as multi-class cross-entropy. The terms are interchangeable.

Watch Out

KL divergence is not a distance

$D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ and KL divergence does not satisfy the triangle inequality. It is a divergence, not a metric. The direction matters: $D_{\text{KL}}(p \| q)$ penalizes places where $p > 0$ but $q \approx 0$ (mode-seeking when optimizing $q$ ), while $D_{\text{KL}}(q \| p)$ does the reverse (mean-seeking).

Watch Out

Hinge loss does not produce probability estimates

Unlike cross-entropy, hinge loss does not require or produce probability outputs. An SVM's raw output $f(x)$ is a signed distance from the decision boundary, not a probability. To get probabilities from an SVM, you need Platt scaling as a post-processing step.

Summary

Cross-entropy = negative log-likelihood for classification; the default choice
MSE assumes Gaussian noise; use Huber or MAE when outliers are present
Focal loss addresses class imbalance by down-weighting easy examples
Hinge loss creates maximum-margin classifiers (SVMs)
KL divergence measures distributional mismatch; critical for distillation and VAEs
Contrastive loss learns representations by comparing pairs
The choice of loss encodes assumptions about noise, class balance, and error costs

Exercises

ExerciseCore

Problem

Compute the cross-entropy loss for a 3-class problem where the true label is class 2 (zero-indexed) and the model predicts $p = [0.1, 0.2, 0.7]$ .

ExerciseCore

Problem

For Huber loss with $\delta = 1$ , compute the loss for residuals $r = 0.5$ , $r = 1$ , and $r = 10$ . Compare with MSE for the same residuals.

ExerciseAdvanced

Problem

Show that focal loss with $\gamma = 0$ reduces to cross-entropy, and explain why increasing $\gamma$ concentrates the loss on hard examples. Compute the ratio of focal loss at $p_t = 0.1$ to focal loss at $p_t = 0.9$ for $\gamma = 0$ and $\gamma = 2$ .

References

Canonical:

Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3: cross-entropy, logistic regression, softmax; Chapter 7.1: SVM and hinge loss derivation
Huber, "Robust Estimation of a Location Parameter" (1964), Annals of Mathematical Statistics: original Huber loss paper and the bounded-influence philosophy
Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Chapter 10: loss functions for boosting; Chapter 11: MSE, MAE, and their robustness properties

Current:

Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV. arXiv:1708.02002: introduces focal loss and the class-imbalance argument
Khosla et al., "Supervised Contrastive Learning" (2020), NeurIPS. arXiv:2004.11362: extends contrastive loss to supervised setting
Hadsell, Chopra & LeCun, "Dimensionality Reduction by Learning an Invariant Mapping" (2006), CVPR: original contrastive loss formulation
Kullback & Leibler, "On Information and Sufficiency" (1951): original KL divergence paper and its connection to maximum likelihood

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Logistic Regressionlayer 1 · tier 1

Derived topics

2

Empirical Risk Minimizationlayer 2 · tier 1
No-Free-Lunch Theoremlayer 2 · tier 2

Graph-backed continuations

Empirical Risk Minimization No-Free-Lunch Theorem