Skip to main content

Learning Theory Core

Loss Functions

The generalized loss-function framework from Shalev-Shwartz & Ben-David Ch 3, the six standard losses (0-1, MSE, MAE, Huber, cross-entropy, hinge), and the bridge from supervised learning to ERM and PAC learnability.

CoreTier 2StableCore spine~55 min

Why This Matters

A loss function turns prediction quality into a number. Once you fix the loss, you have decided what you are optimizing, what counts as an error, and which solutions tie. Two models with the same architecture and the same data will learn different things if you swap the loss; switching from squared loss to absolute loss replaces the conditional mean with the conditional median. The choice is not cosmetic.

The generalized loss framework from Shalev-Shwartz and Ben-David (Understanding Machine Learning, Ch 3) gives one definition that covers classification, regression, density estimation, ranking, and structured prediction. The same definition is what lets empirical risk minimization and the PAC learning framework be stated once instead of once per task.

Mental Model

A loss is a referee with a stopwatch. Hand it the model and a labeled example; it returns a non-negative number that says how badly the model did on that example. Population risk is the expected referee call over the data distribution; empirical risk is the average call on the training set. Everything downstream — ERM, generalization bounds, PAC sample complexity — is bookkeeping on top of those two quantities.

The Generalized Framework

Shalev-Shwartz and Ben-David define a loss as a map on hypothesis-example pairs.

Definition

Loss Function

A loss function sends a hypothesis hHh \in \mathcal{H} and an example zZz \in \mathcal{Z} to a non-negative real number (h,z)\ell(h, z). For supervised learning Z=X×Y\mathcal{Z} = \mathcal{X} \times \mathcal{Y} and a typical loss factors as (h,(x,y))=L(h(x),y)\ell(h, (x, y)) = L(h(x), y), but the framework does not require this; density estimation uses (h,z)=logh(z)\ell(h, z) = -\log h(z) and kk-means uses (h,z)=minchzc2\ell(h, z) = \min_{c \in h} \|z - c\|^2.

Definition

Population Risk

For a distribution D\mathcal{D} over Z\mathcal{Z}, the population risk (or true risk) of hh is LD(h)=EzD[(h,z)].L_{\mathcal{D}}(h) = \mathbb{E}_{z \sim \mathcal{D}}[\ell(h, z)]. The expectation runs over the data distribution; hh is fixed inside it.

Definition

Empirical Risk

Given a sample S=(z1,,zm)DmS = (z_1, \ldots, z_m) \sim \mathcal{D}^m drawn iid, the empirical risk is the sample average LS(h)=1mi=1m(h,zi).L_S(h) = \frac{1}{m} \sum_{i=1}^m \ell(h, z_i). This is the only one of the two we can compute, since D\mathcal{D} is unknown.

The expectation in LDL_{\mathcal{D}} is the standard one for a random variable: if ZDZ \sim \mathcal{D}, then LD(h)=(h,z)dD(z)L_{\mathcal{D}}(h) = \int \ell(h, z) \, d\mathcal{D}(z). For a discrete distribution this collapses to z(h,z)D({z})\sum_z \ell(h, z) \, \mathcal{D}(\{z\}); for an absolutely continuous one it becomes (h,z)p(z)dz\int \ell(h, z) p(z) \, dz. Population risk averages out the example randomness; empirical risk replaces that average with the empirical measure 1miδzi\frac{1}{m} \sum_i \delta_{z_i}.

The Six Standard Losses

0-1 loss

Definition

Zero-One Loss

For a binary or multiclass classifier h:XYh : \mathcal{X} \to \mathcal{Y}, 01(h,(x,y))=1[h(x)y].\ell_{01}(h, (x, y)) = \mathbf{1}[h(x) \ne y]. The indicator is one when the prediction is wrong and zero when it is right.

The expected 0-1 loss is exactly the misclassification probability: for a Bernoulli indicator A=1[h(X)Y]A = \mathbf{1}[h(X) \ne Y], the law of the unconscious statistician gives E[A]=P[A=1]=P[h(X)Y]\mathbb{E}[A] = \mathbb{P}[A = 1] = \mathbb{P}[h(X) \ne Y]. So LD(h)L_{\mathcal{D}}(h) under 0-1 loss IS the population error rate.

The 0-1 loss is non-convex in the prediction and discontinuous; the gradient is zero almost everywhere and undefined where it is not. Direct ERM is NP-hard for most non-trivial hypothesis classes (Ben-David, Eiron, Long 2003). This is why classification practice optimizes a convex surrogate and reports 0-1 error only at evaluation time.

Square loss (MSE)

Definition

Squared Loss

For regression with Y=R\mathcal{Y} = \mathbb{R}, sq(h,(x,y))=(h(x)y)2.\ell_{\text{sq}}(h, (x, y)) = (h(x) - y)^2. The mean squared error of a predictor is the expected squared loss: MSE(h)=E[(h(X)Y)2]\mathrm{MSE}(h) = \mathbb{E}[(h(X) - Y)^2].

Squared loss is convex, smooth, and strictly convex in the prediction; it has a closed-form minimizer for linear regression and a unique gradient direction for stochastic optimization. Its weak point is outlier sensitivity: an example with residual 1010 contributes 100100 to the loss while one with residual 11 contributes 11, so a single point can dominate the fit.

Absolute loss (MAE)

Definition

Absolute Loss

For regression with Y=R\mathcal{Y} = \mathbb{R}, abs(h,(x,y))=h(x)y.\ell_{\text{abs}}(h, (x, y)) = |h(x) - y|.

Absolute loss is convex but not differentiable at zero, so first-order methods need subgradients. It is robust to outliers because errors enter linearly: doubling a residual doubles its contribution rather than quadrupling it.

Huber loss

Definition

Huber Loss

For a threshold δ>0\delta > 0 and residual e=h(x)ye = h(x) - y,

δ(h,(x,y))={12e2if eδ,δ(e12δ)if e>δ.\ell_\delta(h, (x, y)) = \begin{cases} \tfrac{1}{2} e^2 & \text{if } |e| \leq \delta,\\ \delta\bigl(|e| - \tfrac{1}{2}\delta\bigr) & \text{if } |e| > \delta. \end{cases}

Huber is the standard hybrid: quadratic near zero (smooth, well-conditioned for small residuals), linear in the tails (outlier-robust). The construction is calibrated so that the function and its first derivative are continuous at e=δ|e| = \delta. The derivative at the boundary equals δ\delta from both sides, which is why the linear piece carries the offset 12δ2-\tfrac{1}{2}\delta^2. The threshold δ\delta is a tunable robustness knob; large δ\delta recovers MSE, small δ\delta approaches MAE.

Cross-entropy / log loss

Definition

Binary Cross-Entropy

For binary classification with y{0,1}y \in \{0, 1\} and predicted probability p=h(x)(0,1)p = h(x) \in (0, 1), ce(h,(x,y))=[ylogp+(1y)log(1p)].\ell_{\text{ce}}(h, (x, y)) = -\bigl[y \log p + (1 - y) \log(1 - p)\bigr]. The general KK-class form is k=1K1[y=k]logpk-\sum_{k=1}^K \mathbf{1}[y = k] \log p_k.

Cross-entropy is the negative log-likelihood under a Bernoulli (or categorical) likelihood model: maximizing the likelihood of the labels under predicted probabilities pp is identical to minimizing ce\ell_{\text{ce}} summed over the sample. It is convex in the logit, smooth, and strictly proper as a scoring rule, so it is the natural loss whenever the model outputs a probability.

Hinge loss

Definition

Hinge Loss

For binary classification with y{1,+1}y \in \{-1, +1\} and raw score f(x)Rf(x) \in \mathbb{R}, hinge(f,(x,y))=max(0,1yf(x)).\ell_{\text{hinge}}(f, (x, y)) = \max\bigl(0, 1 - y \cdot f(x)\bigr).

The product yf(x)y \cdot f(x) is the margin: positive when the score has the correct sign, negative when wrong, with magnitude f(x)|f(x)| measuring how confident the prediction is. The hinge is zero when the margin is at least one (correct with confidence) and grows linearly when the margin falls short. This is the loss minimized by the support vector machine (with an L2 regularizer); maximizing the margin is exactly minimizing this loss plus a norm penalty.

How to Choose

LossTaskConvexDifferentiableOutlier-robust
0-1Classification (eval only)NoNoN/A
MSERegressionYesYes (smooth)No
MAERegressionYesNo (kink at 0)Yes
HuberRegressionYesYes (smooth)Yes
Cross-entropyProbabilistic classificationYes (in logits)YesN/A
HingeMargin classification (SVM)YesNo (kink at margin 1)N/A

The decision rule:

  • Classification, evaluation: 0-1 loss. It is what end users care about.
  • Classification, optimization: cross-entropy (probabilistic outputs) or hinge (margin-based, SVM-style). Both are convex surrogates for 0-1.
  • Regression, clean data: MSE. Smooth, closed-form for linear models, MLE under Gaussian noise.
  • Regression, outliers expected: Huber if you want smoothness, MAE if you want strict linearity in the tails.
  • Probability calibration needed: cross-entropy. It is strictly proper as a scoring rule.

Mean vs Median: Why MAE is Robust

The deepest fact about MSE versus MAE is that they recover different summaries of the conditional response distribution.

Theorem

MSE minimizer is the conditional mean

Statement

For a joint distribution (X,Y)(X, Y) with E[Y2]<\mathbb{E}[Y^2] < \infty, the predictor h:XRh^* : \mathcal{X} \to \mathbb{R} minimizing the population squared loss E[(h(X)Y)2]\mathbb{E}[(h(X) - Y)^2] over all measurable functions is h(x)=E[YX=x].h^*(x) = \mathbb{E}[Y \mid X = x].

Intuition

Squared loss penalizes large deviations heavily, so the minimizer must balance positive and negative residuals symmetrically. The point that does this on a one-dimensional distribution is the mean.

Proof Sketch

Condition on X=xX = x. The problem reduces to choosing a constant c=h(x)c = h(x) that minimizes E[(cY)2X=x]\mathbb{E}[(c - Y)^2 \mid X = x]. Setting the derivative in cc to zero gives c=E[YX=x]c = \mathbb{E}[Y \mid X = x]. The bias-variance decomposition E[(cY)2X=x]=(cE[YX=x])2+Var(YX=x)\mathbb{E}[(c - Y)^2 \mid X = x] = (c - \mathbb{E}[Y \mid X = x])^2 + \mathrm{Var}(Y \mid X = x) confirms this is the minimum, with residual variance independent of cc.

Why It Matters

This identifies what regression with MSE is actually estimating. It also gives the irreducible-error decomposition: the minimum population MSE over all predictors is E[Var(YX)]\mathbb{E}[\mathrm{Var}(Y \mid X)], the noise floor.

Failure Mode

Without finite second moment, the conditional mean may not exist or the squared loss may be infinite. For Cauchy-distributed YY, MSE is undefined and the conditional mean is not the right summary.

Theorem

MAE minimizer is the conditional median

Statement

For a joint distribution (X,Y)(X, Y) with E[Y]<\mathbb{E}[|Y|] < \infty, the predictor hh^* minimizing E[h(X)Y]\mathbb{E}[|h(X) - Y|] over all measurable functions is h(x)=median(YX=x).h^*(x) = \mathrm{median}(Y \mid X = x).

Proof Sketch

Condition on X=xX = x and choose a constant cc to minimize g(c)=E[cYX=x]g(c) = \mathbb{E}[|c - Y| \mid X = x]. The subgradient in cc is P[Y<cX=x]P[Y>cX=x]\mathbb{P}[Y < c \mid X = x] - \mathbb{P}[Y > c \mid X = x], which is zero exactly at the median.

Failure Mode

The minimizer is non-unique when the conditional distribution has a flat region of mass, since any point in that region is a valid median.

The robustness story falls out: a single outlier shifts the conditional mean by Θ(1/n)\Theta(1/n) per unit of contamination but does not move the median at all until contamination crosses the 50% breakdown point. MSE chases means, MAE chases medians, and means are not robust.

Connection to ERM and PAC

Once a loss is fixed, empirical risk minimization is the rule h^argminhHLS(h),\hat{h} \in \arg\min_{h \in \mathcal{H}} L_S(h), and the PAC framework asks how many samples mm suffice for LD(h^)minhHLD(h)+εL_{\mathcal{D}}(\hat{h}) \leq \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \varepsilon with probability at least 1δ1 - \delta. Both definitions are stated against an arbitrary \ell, which is why the same theorems cover classification and regression.

The choice of \ell controls more than answer quality:

  • Tractability. Non-convex \ell (0-1 loss) makes ERM NP-hard for many natural classes; convex \ell admits gradient methods. This is the practical reason we minimize cross-entropy or hinge while caring about 0-1.
  • Sample complexity. The dependence on \ell enters concentration bounds through the loss range. Hoeffding-style bounds for bounded losses scale as B2logH/n\sqrt{B^2 \log|\mathcal{H}|/n} where BB is the loss range, so unbounded losses (square loss with unbounded YY) require sub-Gaussian or sub-exponential machinery instead.
  • Consistency. Minimizing a convex surrogate does not automatically minimize 0-1 risk. A surrogate ϕ\phi is classification-calibrated if its population minimizer agrees with the Bayes classifier on every xx. Cross-entropy, hinge, and logistic are calibrated; squared loss on ±1\pm 1 targets is calibrated; not every convex loss is. See Bartlett, Jordan, McAuliffe (2006) for the full statement.
Watch Out

The loss is on hypothesis-example pairs, not on prediction-label pairs

Most practitioner exposure pretends \ell is a function of two reals, (y^,y)\ell(\hat{y}, y). The Shai-Shai framework is more general: :H×ZR+\ell : \mathcal{H} \times \mathcal{Z} \to \mathbb{R}_+ takes the hypothesis itself, which lets density estimation ((h,z)=logh(z)\ell(h, z) = -\log h(z)) and clustering ((h,z)=minczc2\ell(h, z) = \min_c \|z - c\|^2) live in the same definition. The L(h(x),y)L(h(x), y) form is a special case, not the whole picture.

Watch Out

Calibration is not the same as consistency

"Cross-entropy is calibrated for 0-1 loss" means the population minimizer of cross-entropy gives the right hard classifier. It does NOT mean that ERM under cross-entropy converges to the Bayes-optimal classifier at any specific rate, or that finite-sample empirical minimizers behave well. Calibration is a statement about population minimizers, consistency is a statement about empirical procedures.

Watch Out

Bounded loss is a working assumption, not a fact about regression

The cleanest generalization bounds (Hoeffding, finite-class ERM, basic Rademacher) assume the loss is bounded. Squared loss on unbounded YY is not bounded; getting useful bounds in that setting needs sub-Gaussian or sub-exponential tail control on YY or a clipped loss. Quoting a "logH/n\sqrt{\log|\mathcal{H}|/n}" bound for ordinary least squares without bounding YY is dropping a real assumption.

Summary

  • A loss is a map :H×ZR+\ell : \mathcal{H} \times \mathcal{Z} \to \mathbb{R}_+. Population risk averages it over D\mathcal{D}; empirical risk averages it over a sample.
  • The six standard losses split by task: 0-1 (classification target), MSE / MAE / Huber (regression), cross-entropy / hinge (classification surrogates).
  • 0-1 is non-convex and intractable, so practice optimizes a convex surrogate and reports 0-1 only at evaluation.
  • MSE recovers conditional means, MAE recovers conditional medians; this is why MAE is outlier-robust.
  • ERM and PAC are stated for an arbitrary \ell, but the choice controls tractability, sample complexity, and what consistent recovery means.

Exercises

ExerciseCore

Problem

A binary classifier outputs predicted probability p=0.9p = 0.9 for a positive example (y=1y = 1) and p=0.4p = 0.4 for a negative example (y=0y = 0). Compute the binary cross-entropy on each example and identify which contributes more to the average loss.

ExerciseAdvanced

Problem

Show that the Huber loss with threshold δ\delta has continuous first derivative at e=δ|e| = \delta, and explain why the offset 12δ2-\tfrac{1}{2}\delta^2 in the linear piece is fixed by this requirement together with continuity of δ\ell_\delta itself.

ExerciseAdvanced

Problem

Show that the population minimizer of MAE is the conditional median by writing g(c)=E[cY]g(c) = \mathbb{E}[|c - Y|] for a fixed conditional distribution and computing g(c)g'(c) in the sense of subgradients.

References

Canonical:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 3 (generalized loss functions, Definition 3.1; PAC framework with general loss, Section 3.2)
  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018, 2nd ed.), Chapter 1 (PAC learning), Chapter 5 (SVMs and hinge loss)
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009, 2nd ed.), Chapter 2 (squared and 0-1 loss), Chapter 10 (Huber, robust regression)

Current:

  • Bartlett, Jordan, McAuliffe, "Convexity, Classification, and Risk Bounds," JASA 101 (2006), arXiv:math/0508276 — classification-calibration of convex surrogate losses
  • Ben-David, Eiron, Long, "On the difficulty of approximately maximizing agreements," J. Computer and System Sciences 66 (2003) — NP-hardness of empirical 0-1 minimization for halfspaces
  • Boyd & Vandenberghe, Convex Optimization (2004), Chapter 6 (Huber and other robust loss functions)

Next Topics

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.