Loss Functions

Sneiderman, Robby

Learning Theory Core

Loss Functions

The generalized loss-function framework from Shalev-Shwartz & Ben-David Ch 3, the six standard losses (0-1, MSE, MAE, Huber, cross-entropy, hinge), and the bridge from supervised learning to ERM and PAC learnability.

CoreTier 2StableCore spine~55 min

Prerequisites

Random Variables Expectation Variance Covariance Moments Hypothesis Classes and Function Spaces

Quiz (10)Pulse Check Prereq Map

Why This Matters

A loss function turns prediction quality into a number. Once you fix the loss, you have decided what you are optimizing, what counts as an error, and which solutions tie. Two models with the same architecture and the same data will learn different things if you swap the loss; switching from squared loss to absolute loss replaces the conditional mean with the conditional median. The choice is not cosmetic.

The generalized loss framework from Shalev-Shwartz and Ben-David (Understanding Machine Learning, Ch 3) gives one definition that covers classification, regression, density estimation, ranking, and structured prediction. The same definition is what lets empirical risk minimization and the PAC learning framework be stated once instead of once per task.

Mental Model

A loss is a referee with a stopwatch. Hand it the model and a labeled example; it returns a non-negative number that says how badly the model did on that example. Population risk is the expected referee call over the data distribution; empirical risk is the average call on the training set. Everything downstream — ERM, generalization bounds, PAC sample complexity — is bookkeeping on top of those two quantities.

The Generalized Framework

Shalev-Shwartz and Ben-David define a loss as a map on hypothesis-example pairs.

Definition

Loss Function $ℓ : H \times Z \to R_{+}$

A loss function sends a hypothesis $h \in \mathcal{H}$ and an example $z \in \mathcal{Z}$ to a non-negative real number $\ell(h, z)$ . For supervised learning $\mathcal{Z} = \mathcal{X} \times \mathcal{Y}$ and a typical loss factors as $\ell(h, (x, y)) = L(h(x), y)$ , but the framework does not require this; density estimation uses $\ell(h, z) = -\log h(z)$ and $k$ -means uses $\ell(h, z) = \min_{c \in h} \|z - c\|^2$ .

Definition

Population Risk $L_{D} (h)$

For a distribution $\mathcal{D}$ over $\mathcal{Z}$ , the population risk (or true risk) of $h$ is $L_{\mathcal{D}}(h) = \mathbb{E}_{z \sim \mathcal{D}}[\ell(h, z)].$ The expectation runs over the data distribution; $h$ is fixed inside it.

Definition

Empirical Risk $L_{S} (h)$

Given a sample $S = (z_1, \ldots, z_m) \sim \mathcal{D}^m$ drawn iid, the empirical risk is the sample average $L_S(h) = \frac{1}{m} \sum_{i=1}^m \ell(h, z_i).$ This is the only one of the two we can compute, since $\mathcal{D}$ is unknown.

The expectation in $L_{\mathcal{D}}$ is the standard one for a random variable: if $Z \sim \mathcal{D}$ , then $L_{\mathcal{D}}(h) = \int \ell(h, z) \, d\mathcal{D}(z)$ . For a discrete distribution this collapses to $\sum_z \ell(h, z) \, \mathcal{D}(\{z\})$ ; for an absolutely continuous one it becomes $\int \ell(h, z) p(z) \, dz$ . Population risk averages out the example randomness; empirical risk replaces that average with the empirical measure $\frac{1}{m} \sum_i \delta_{z_i}$ .

The Six Standard Losses

0-1 loss

Definition

Zero-One Loss $ℓ_{01}$

For a binary or multiclass classifier $h : \mathcal{X} \to \mathcal{Y}$ , $\ell_{01}(h, (x, y)) = \mathbf{1}[h(x) \ne y].$ The indicator is one when the prediction is wrong and zero when it is right.

The expected 0-1 loss is exactly the misclassification probability: for a Bernoulli indicator $A = \mathbf{1}[h(X) \ne Y]$ , the law of the unconscious statistician gives $\mathbb{E}[A] = \mathbb{P}[A = 1] = \mathbb{P}[h(X) \ne Y]$ . So $L_{\mathcal{D}}(h)$ under 0-1 loss IS the population error rate.

The 0-1 loss is non-convex in the prediction and discontinuous; the gradient is zero almost everywhere and undefined where it is not. Direct ERM is NP-hard for most non-trivial hypothesis classes (Ben-David, Eiron, Long 2003). This is why classification practice optimizes a convex surrogate and reports 0-1 error only at evaluation time.

Square loss (MSE)

Definition

Squared Loss $ℓ_{sq}$

For regression with $\mathcal{Y} = \mathbb{R}$ , $\ell_{\text{sq}}(h, (x, y)) = (h(x) - y)^2.$ The mean squared error of a predictor is the expected squared loss: $\mathrm{MSE}(h) = \mathbb{E}[(h(X) - Y)^2]$ .

Squared loss is convex, smooth, and strictly convex in the prediction; it has a closed-form minimizer for linear regression and a unique gradient direction for stochastic optimization. Its weak point is outlier sensitivity: an example with residual $10$ contributes $100$ to the loss while one with residual $1$ contributes $1$ , so a single point can dominate the fit.

Absolute loss (MAE)

Definition

Absolute Loss $ℓ_{abs}$

For regression with $\mathcal{Y} = \mathbb{R}$ , $\ell_{\text{abs}}(h, (x, y)) = |h(x) - y|.$

Absolute loss is convex but not differentiable at zero, so first-order methods need subgradients. It is robust to outliers because errors enter linearly: doubling a residual doubles its contribution rather than quadrupling it.

Huber loss

Definition

Huber Loss $ℓ_{δ}$

For a threshold $\delta > 0$ and residual $e = h(x) - y$ ,

\ell_\delta(h, (x, y)) = \begin{cases} \tfrac{1}{2} e^2 & \text{if } |e| \leq \delta,\\ \delta\bigl(|e| - \tfrac{1}{2}\delta\bigr) & \text{if } |e| > \delta. \end{cases}

Huber is the standard hybrid: quadratic near zero (smooth, well-conditioned for small residuals), linear in the tails (outlier-robust). The construction is calibrated so that the function and its first derivative are continuous at $|e| = \delta$ . The derivative at the boundary equals $\delta$ from both sides, which is why the linear piece carries the offset $-\tfrac{1}{2}\delta^2$ . The threshold $\delta$ is a tunable robustness knob; large $\delta$ recovers MSE, small $\delta$ approaches MAE.

Cross-entropy / log loss

Definition

Binary Cross-Entropy $ℓ_{ce}$

For binary classification with $y \in \{0, 1\}$ and predicted probability $p = h(x) \in (0, 1)$ , $\ell_{\text{ce}}(h, (x, y)) = -\bigl[y \log p + (1 - y) \log(1 - p)\bigr].$ The general $K$ -class form is $-\sum_{k=1}^K \mathbf{1}[y = k] \log p_k$ .

Cross-entropy is the negative log-likelihood under a Bernoulli (or categorical) likelihood model: maximizing the likelihood of the labels under predicted probabilities $p$ is identical to minimizing $\ell_{\text{ce}}$ summed over the sample. It is convex in the logit, smooth, and strictly proper as a scoring rule, so it is the natural loss whenever the model outputs a probability.

Hinge loss

Definition

Hinge Loss $ℓ_{hinge}$

For binary classification with $y \in \{-1, +1\}$ and raw score $f(x) \in \mathbb{R}$ , $\ell_{\text{hinge}}(f, (x, y)) = \max\bigl(0, 1 - y \cdot f(x)\bigr).$

The product $y \cdot f(x)$ is the margin: positive when the score has the correct sign, negative when wrong, with magnitude $|f(x)|$ measuring how confident the prediction is. The hinge is zero when the margin is at least one (correct with confidence) and grows linearly when the margin falls short. This is the loss minimized by the support vector machine (with an L2 regularizer); maximizing the margin is exactly minimizing this loss plus a norm penalty.

How to Choose

Loss	Task	Convex	Differentiable	Outlier-robust
0-1	Classification (eval only)	No	No	N/A
MSE	Regression	Yes	Yes (smooth)	No
MAE	Regression	Yes	No (kink at 0)	Yes
Huber	Regression	Yes	Yes (smooth)	Yes
Cross-entropy	Probabilistic classification	Yes (in logits)	Yes	N/A
Hinge	Margin classification (SVM)	Yes	No (kink at margin 1)	N/A

The decision rule:

Classification, evaluation: 0-1 loss. It is what end users care about.
Classification, optimization: cross-entropy (probabilistic outputs) or hinge (margin-based, SVM-style). Both are convex surrogates for 0-1.
Regression, clean data: MSE. Smooth, closed-form for linear models, MLE under Gaussian noise.
Regression, outliers expected: Huber if you want smoothness, MAE if you want strict linearity in the tails.
Probability calibration needed: cross-entropy. It is strictly proper as a scoring rule.

Mean vs Median: Why MAE is Robust

The deepest fact about MSE versus MAE is that they recover different summaries of the conditional response distribution.

Theorem

MSE minimizer is the conditional mean

Statement

For a joint distribution $(X, Y)$ with $\mathbb{E}[Y^2] < \infty$ , the predictor $h^* : \mathcal{X} \to \mathbb{R}$ minimizing the population squared loss $\mathbb{E}[(h(X) - Y)^2]$ over all measurable functions is $h^*(x) = \mathbb{E}[Y \mid X = x].$

Intuition

Squared loss penalizes large deviations heavily, so the minimizer must balance positive and negative residuals symmetrically. The point that does this on a one-dimensional distribution is the mean.

Proof Sketch

Condition on $X = x$ . The problem reduces to choosing a constant $c = h(x)$ that minimizes $\mathbb{E}[(c - Y)^2 \mid X = x]$ . Setting the derivative in $c$ to zero gives $c = \mathbb{E}[Y \mid X = x]$ . The bias-variance decomposition $\mathbb{E}[(c - Y)^2 \mid X = x] = (c - \mathbb{E}[Y \mid X = x])^2 + \mathrm{Var}(Y \mid X = x)$ confirms this is the minimum, with residual variance independent of $c$ .

Why It Matters

This identifies what regression with MSE is actually estimating. It also gives the irreducible-error decomposition: the minimum population MSE over all predictors is $\mathbb{E}[\mathrm{Var}(Y \mid X)]$ , the noise floor.

Failure Mode

Without finite second moment, the conditional mean may not exist or the squared loss may be infinite. For Cauchy-distributed $Y$ , MSE is undefined and the conditional mean is not the right summary.

report a correction →

Theorem

MAE minimizer is the conditional median

Statement

For a joint distribution $(X, Y)$ with $\mathbb{E}[|Y|] < \infty$ , the predictor $h^*$ minimizing $\mathbb{E}[|h(X) - Y|]$ over all measurable functions is $h^*(x) = \mathrm{median}(Y \mid X = x).$

Proof Sketch

Condition on $X = x$ and choose a constant $c$ to minimize $g(c) = \mathbb{E}[|c - Y| \mid X = x]$ . The subgradient in $c$ is $\mathbb{P}[Y < c \mid X = x] - \mathbb{P}[Y > c \mid X = x]$ , which is zero exactly at the median.

Failure Mode

The minimizer is non-unique when the conditional distribution has a flat region of mass, since any point in that region is a valid median.

report a correction →

The robustness story falls out: a single outlier shifts the conditional mean by $\Theta(1/n)$ per unit of contamination but does not move the median at all until contamination crosses the 50% breakdown point. MSE chases means, MAE chases medians, and means are not robust.

Connection to ERM and PAC

Once a loss is fixed, empirical risk minimization is the rule $\hat{h} \in \arg\min_{h \in \mathcal{H}} L_S(h),$ and the PAC framework asks how many samples $m$ suffice for $L_{\mathcal{D}}(\hat{h}) \leq \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \varepsilon$ with probability at least $1 - \delta$ . Both definitions are stated against an arbitrary $\ell$ , which is why the same theorems cover classification and regression.

The choice of $\ell$ controls more than answer quality:

Tractability. Non-convex $\ell$ (0-1 loss) makes ERM NP-hard for many natural classes; convex $\ell$ admits gradient methods. This is the practical reason we minimize cross-entropy or hinge while caring about 0-1.
Sample complexity. The dependence on $\ell$ enters concentration bounds through the loss range. Hoeffding-style bounds for bounded losses scale as $\sqrt{B^2 \log|\mathcal{H}|/n}$ where $B$ is the loss range, so unbounded losses (square loss with unbounded $Y$ ) require sub-Gaussian or sub-exponential machinery instead.
Consistency. Minimizing a convex surrogate does not automatically minimize 0-1 risk. A surrogate $\phi$ is classification-calibrated if its population minimizer agrees with the Bayes classifier on every $x$ . Cross-entropy, hinge, and logistic are calibrated; squared loss on $\pm 1$ targets is calibrated; not every convex loss is. See Bartlett, Jordan, McAuliffe (2006) for the full statement.

Watch Out

The loss is on hypothesis-example pairs, not on prediction-label pairs

Most practitioner exposure pretends $\ell$ is a function of two reals, $\ell(\hat{y}, y)$ . The Shai-Shai framework is more general: $\ell : \mathcal{H} \times \mathcal{Z} \to \mathbb{R}_+$ takes the hypothesis itself, which lets density estimation ( $\ell(h, z) = -\log h(z)$ ) and clustering ( $\ell(h, z) = \min_c \|z - c\|^2$ ) live in the same definition. The $L(h(x), y)$ form is a special case, not the whole picture.

Watch Out

Calibration is not the same as consistency

"Cross-entropy is calibrated for 0-1 loss" means the population minimizer of cross-entropy gives the right hard classifier. It does NOT mean that ERM under cross-entropy converges to the Bayes-optimal classifier at any specific rate, or that finite-sample empirical minimizers behave well. Calibration is a statement about population minimizers, consistency is a statement about empirical procedures.

Watch Out

Bounded loss is a working assumption, not a fact about regression

The cleanest generalization bounds (Hoeffding, finite-class ERM, basic Rademacher) assume the loss is bounded. Squared loss on unbounded $Y$ is not bounded; getting useful bounds in that setting needs sub-Gaussian or sub-exponential tail control on $Y$ or a clipped loss. Quoting a " $\sqrt{\log|\mathcal{H}|/n}$ " bound for ordinary least squares without bounding $Y$ is dropping a real assumption.

Summary

A loss is a map $\ell : \mathcal{H} \times \mathcal{Z} \to \mathbb{R}_+$ . Population risk averages it over $\mathcal{D}$ ; empirical risk averages it over a sample.
The six standard losses split by task: 0-1 (classification target), MSE / MAE / Huber (regression), cross-entropy / hinge (classification surrogates).
0-1 is non-convex and intractable, so practice optimizes a convex surrogate and reports 0-1 only at evaluation.
MSE recovers conditional means, MAE recovers conditional medians; this is why MAE is outlier-robust.
ERM and PAC are stated for an arbitrary $\ell$ , but the choice controls tractability, sample complexity, and what consistent recovery means.

Exercises

ExerciseCore

Problem

A binary classifier outputs predicted probability $p = 0.9$ for a positive example ( $y = 1$ ) and $p = 0.4$ for a negative example ( $y = 0$ ). Compute the binary cross-entropy on each example and identify which contributes more to the average loss.

ExerciseAdvanced

Problem

Show that the Huber loss with threshold $\delta$ has continuous first derivative at $|e| = \delta$ , and explain why the offset $-\tfrac{1}{2}\delta^2$ in the linear piece is fixed by this requirement together with continuity of $\ell_\delta$ itself.

ExerciseAdvanced

Problem

Show that the population minimizer of MAE is the conditional median by writing $g(c) = \mathbb{E}[|c - Y|]$ for a fixed conditional distribution and computing $g'(c)$ in the sense of subgradients.

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 3 (generalized loss functions, Definition 3.1; PAC framework with general loss, Section 3.2)
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018, 2nd ed.), Chapter 1 (PAC learning), Chapter 5 (SVMs and hinge loss)
Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009, 2nd ed.), Chapter 2 (squared and 0-1 loss), Chapter 10 (Huber, robust regression)

Current:

Bartlett, Jordan, McAuliffe, "Convexity, Classification, and Risk Bounds," JASA 101 (2006), arXiv:math/0508276 — classification-calibration of convex surrogate losses
Ben-David, Eiron, Long, "On the difficulty of approximately maximizing agreements," J. Computer and System Sciences 66 (2003) — NP-hardness of empirical 0-1 minimization for halfspaces
Boyd & Vandenberghe, Convex Optimization (2004), Chapter 6 (Huber and other robust loss functions)

Next Topics

Empirical risk minimization: the algorithm that picks $\hat{h} = \arg\min_{h \in \mathcal{H}} L_S(h)$
PAC learning framework: sample-complexity guarantees stated for arbitrary loss
Support vector machines: hinge loss plus L2 regularization
Linear regression: squared loss with a linear hypothesis class

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Random Variableslayer 0A · tier 1
Hypothesis Classes and Function Spaceslayer 2 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.