Learning Theory Core
Loss Functions
The generalized loss-function framework from Shalev-Shwartz & Ben-David Ch 3, the six standard losses (0-1, MSE, MAE, Huber, cross-entropy, hinge), and the bridge from supervised learning to ERM and PAC learnability.
Prerequisites
Why This Matters
A loss function turns prediction quality into a number. Once you fix the loss, you have decided what you are optimizing, what counts as an error, and which solutions tie. Two models with the same architecture and the same data will learn different things if you swap the loss; switching from squared loss to absolute loss replaces the conditional mean with the conditional median. The choice is not cosmetic.
The generalized loss framework from Shalev-Shwartz and Ben-David (Understanding Machine Learning, Ch 3) gives one definition that covers classification, regression, density estimation, ranking, and structured prediction. The same definition is what lets empirical risk minimization and the PAC learning framework be stated once instead of once per task.
Mental Model
A loss is a referee with a stopwatch. Hand it the model and a labeled example; it returns a non-negative number that says how badly the model did on that example. Population risk is the expected referee call over the data distribution; empirical risk is the average call on the training set. Everything downstream — ERM, generalization bounds, PAC sample complexity — is bookkeeping on top of those two quantities.
The Generalized Framework
Shalev-Shwartz and Ben-David define a loss as a map on hypothesis-example pairs.
Loss Function
A loss function sends a hypothesis and an example to a non-negative real number . For supervised learning and a typical loss factors as , but the framework does not require this; density estimation uses and -means uses .
Population Risk
For a distribution over , the population risk (or true risk) of is The expectation runs over the data distribution; is fixed inside it.
Empirical Risk
Given a sample drawn iid, the empirical risk is the sample average This is the only one of the two we can compute, since is unknown.
The expectation in is the standard one for a random variable: if , then . For a discrete distribution this collapses to ; for an absolutely continuous one it becomes . Population risk averages out the example randomness; empirical risk replaces that average with the empirical measure .
The Six Standard Losses
0-1 loss
Zero-One Loss
For a binary or multiclass classifier , The indicator is one when the prediction is wrong and zero when it is right.
The expected 0-1 loss is exactly the misclassification probability: for a Bernoulli indicator , the law of the unconscious statistician gives . So under 0-1 loss IS the population error rate.
The 0-1 loss is non-convex in the prediction and discontinuous; the gradient is zero almost everywhere and undefined where it is not. Direct ERM is NP-hard for most non-trivial hypothesis classes (Ben-David, Eiron, Long 2003). This is why classification practice optimizes a convex surrogate and reports 0-1 error only at evaluation time.
Square loss (MSE)
Squared Loss
For regression with , The mean squared error of a predictor is the expected squared loss: .
Squared loss is convex, smooth, and strictly convex in the prediction; it has a closed-form minimizer for linear regression and a unique gradient direction for stochastic optimization. Its weak point is outlier sensitivity: an example with residual contributes to the loss while one with residual contributes , so a single point can dominate the fit.
Absolute loss (MAE)
Absolute Loss
For regression with ,
Absolute loss is convex but not differentiable at zero, so first-order methods need subgradients. It is robust to outliers because errors enter linearly: doubling a residual doubles its contribution rather than quadrupling it.
Huber loss
Huber Loss
For a threshold and residual ,
Huber is the standard hybrid: quadratic near zero (smooth, well-conditioned for small residuals), linear in the tails (outlier-robust). The construction is calibrated so that the function and its first derivative are continuous at . The derivative at the boundary equals from both sides, which is why the linear piece carries the offset . The threshold is a tunable robustness knob; large recovers MSE, small approaches MAE.
Cross-entropy / log loss
Binary Cross-Entropy
For binary classification with and predicted probability , The general -class form is .
Cross-entropy is the negative log-likelihood under a Bernoulli (or categorical) likelihood model: maximizing the likelihood of the labels under predicted probabilities is identical to minimizing summed over the sample. It is convex in the logit, smooth, and strictly proper as a scoring rule, so it is the natural loss whenever the model outputs a probability.
Hinge loss
Hinge Loss
For binary classification with and raw score ,
The product is the margin: positive when the score has the correct sign, negative when wrong, with magnitude measuring how confident the prediction is. The hinge is zero when the margin is at least one (correct with confidence) and grows linearly when the margin falls short. This is the loss minimized by the support vector machine (with an L2 regularizer); maximizing the margin is exactly minimizing this loss plus a norm penalty.
How to Choose
| Loss | Task | Convex | Differentiable | Outlier-robust |
|---|---|---|---|---|
| 0-1 | Classification (eval only) | No | No | N/A |
| MSE | Regression | Yes | Yes (smooth) | No |
| MAE | Regression | Yes | No (kink at 0) | Yes |
| Huber | Regression | Yes | Yes (smooth) | Yes |
| Cross-entropy | Probabilistic classification | Yes (in logits) | Yes | N/A |
| Hinge | Margin classification (SVM) | Yes | No (kink at margin 1) | N/A |
The decision rule:
- Classification, evaluation: 0-1 loss. It is what end users care about.
- Classification, optimization: cross-entropy (probabilistic outputs) or hinge (margin-based, SVM-style). Both are convex surrogates for 0-1.
- Regression, clean data: MSE. Smooth, closed-form for linear models, MLE under Gaussian noise.
- Regression, outliers expected: Huber if you want smoothness, MAE if you want strict linearity in the tails.
- Probability calibration needed: cross-entropy. It is strictly proper as a scoring rule.
Mean vs Median: Why MAE is Robust
The deepest fact about MSE versus MAE is that they recover different summaries of the conditional response distribution.
MSE minimizer is the conditional mean
Statement
For a joint distribution with , the predictor minimizing the population squared loss over all measurable functions is
Intuition
Squared loss penalizes large deviations heavily, so the minimizer must balance positive and negative residuals symmetrically. The point that does this on a one-dimensional distribution is the mean.
Proof Sketch
Condition on . The problem reduces to choosing a constant that minimizes . Setting the derivative in to zero gives . The bias-variance decomposition confirms this is the minimum, with residual variance independent of .
Why It Matters
This identifies what regression with MSE is actually estimating. It also gives the irreducible-error decomposition: the minimum population MSE over all predictors is , the noise floor.
Failure Mode
Without finite second moment, the conditional mean may not exist or the squared loss may be infinite. For Cauchy-distributed , MSE is undefined and the conditional mean is not the right summary.
MAE minimizer is the conditional median
Statement
For a joint distribution with , the predictor minimizing over all measurable functions is
Proof Sketch
Condition on and choose a constant to minimize . The subgradient in is , which is zero exactly at the median.
Failure Mode
The minimizer is non-unique when the conditional distribution has a flat region of mass, since any point in that region is a valid median.
The robustness story falls out: a single outlier shifts the conditional mean by per unit of contamination but does not move the median at all until contamination crosses the 50% breakdown point. MSE chases means, MAE chases medians, and means are not robust.
Connection to ERM and PAC
Once a loss is fixed, empirical risk minimization is the rule and the PAC framework asks how many samples suffice for with probability at least . Both definitions are stated against an arbitrary , which is why the same theorems cover classification and regression.
The choice of controls more than answer quality:
- Tractability. Non-convex (0-1 loss) makes ERM NP-hard for many natural classes; convex admits gradient methods. This is the practical reason we minimize cross-entropy or hinge while caring about 0-1.
- Sample complexity. The dependence on enters concentration bounds through the loss range. Hoeffding-style bounds for bounded losses scale as where is the loss range, so unbounded losses (square loss with unbounded ) require sub-Gaussian or sub-exponential machinery instead.
- Consistency. Minimizing a convex surrogate does not automatically minimize 0-1 risk. A surrogate is classification-calibrated if its population minimizer agrees with the Bayes classifier on every . Cross-entropy, hinge, and logistic are calibrated; squared loss on targets is calibrated; not every convex loss is. See Bartlett, Jordan, McAuliffe (2006) for the full statement.
The loss is on hypothesis-example pairs, not on prediction-label pairs
Most practitioner exposure pretends is a function of two reals, . The Shai-Shai framework is more general: takes the hypothesis itself, which lets density estimation () and clustering () live in the same definition. The form is a special case, not the whole picture.
Calibration is not the same as consistency
"Cross-entropy is calibrated for 0-1 loss" means the population minimizer of cross-entropy gives the right hard classifier. It does NOT mean that ERM under cross-entropy converges to the Bayes-optimal classifier at any specific rate, or that finite-sample empirical minimizers behave well. Calibration is a statement about population minimizers, consistency is a statement about empirical procedures.
Bounded loss is a working assumption, not a fact about regression
The cleanest generalization bounds (Hoeffding, finite-class ERM, basic Rademacher) assume the loss is bounded. Squared loss on unbounded is not bounded; getting useful bounds in that setting needs sub-Gaussian or sub-exponential tail control on or a clipped loss. Quoting a "" bound for ordinary least squares without bounding is dropping a real assumption.
Summary
- A loss is a map . Population risk averages it over ; empirical risk averages it over a sample.
- The six standard losses split by task: 0-1 (classification target), MSE / MAE / Huber (regression), cross-entropy / hinge (classification surrogates).
- 0-1 is non-convex and intractable, so practice optimizes a convex surrogate and reports 0-1 only at evaluation.
- MSE recovers conditional means, MAE recovers conditional medians; this is why MAE is outlier-robust.
- ERM and PAC are stated for an arbitrary , but the choice controls tractability, sample complexity, and what consistent recovery means.
Exercises
Problem
A binary classifier outputs predicted probability for a positive example () and for a negative example (). Compute the binary cross-entropy on each example and identify which contributes more to the average loss.
Problem
Show that the Huber loss with threshold has continuous first derivative at , and explain why the offset in the linear piece is fixed by this requirement together with continuity of itself.
Problem
Show that the population minimizer of MAE is the conditional median by writing for a fixed conditional distribution and computing in the sense of subgradients.
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 3 (generalized loss functions, Definition 3.1; PAC framework with general loss, Section 3.2)
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018, 2nd ed.), Chapter 1 (PAC learning), Chapter 5 (SVMs and hinge loss)
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009, 2nd ed.), Chapter 2 (squared and 0-1 loss), Chapter 10 (Huber, robust regression)
Current:
- Bartlett, Jordan, McAuliffe, "Convexity, Classification, and Risk Bounds," JASA 101 (2006), arXiv:math/0508276 — classification-calibration of convex surrogate losses
- Ben-David, Eiron, Long, "On the difficulty of approximately maximizing agreements," J. Computer and System Sciences 66 (2003) — NP-hardness of empirical 0-1 minimization for halfspaces
- Boyd & Vandenberghe, Convex Optimization (2004), Chapter 6 (Huber and other robust loss functions)
Next Topics
- Empirical risk minimization: the algorithm that picks
- PAC learning framework: sample-complexity guarantees stated for arbitrary loss
- Support vector machines: hinge loss plus L2 regularization
- Linear regression: squared loss with a linear hypothesis class
Last reviewed: May 8, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Random Variableslayer 0A · tier 1
- Hypothesis Classes and Function Spaceslayer 2 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.