Learning Theory Core
Empirical Risk Minimization
The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.
Why This Matters
Every supervised learning algorithm is an empirical risk minimizer or a regularized variant of one. When you train a neural network by minimizing cross-entropy loss on a training set, you are doing ERM. When you fit a linear regression by minimizing squared error, you are doing ERM.
Understanding ERM is the first step toward understanding why learning from data works at all, and where it fails.
ERM and Generalization: the core structure of learning theory
Mental Model
Imagine you want to find a function that predicts well on future, unseen data. You cannot directly optimize over data you haven't seen. So instead, you optimize over the data you have. your training set. And hope that good performance on training data translates to good performance on new data.
ERM formalizes this hope. The central question of learning theory is: when and why does this hope actually work?
Formal Setup and Notation
Let be an input space and an output space. Let be an unknown probability distribution over .
We have a loss function measuring prediction quality.
Population Risk
The population risk (or true risk) of a hypothesis is:
This is the quantity we actually want to minimize, but we cannot compute it directly because is unknown.
Empirical Risk
Given a training sample drawn i.i.d. from , the empirical risk is:
This is computable. It is the average loss on training data.
Empirical Risk Minimization
Given a hypothesis class , the ERM principle selects:
The key question: when is close to ?
Core Definitions
The generalization gap is the difference between population risk and empirical risk:
If this gap is small uniformly over all , then minimizing empirical risk also approximately minimizes population risk.
The approximation error is:
where is the Bayes risk. This measures how well the best hypothesis in can do, regardless of sample size.
The estimation error is:
This is what ERM theory actually controls.
Main Theorems
ERM Generalization via Uniform Convergence
Statement
If is finite and the loss is bounded in , then for any , with probability at least over the draw of :
Intuition
The bound says: if the hypothesis class is not too large (measured by ) and the sample is large enough, then empirical risk uniformly approximates population risk. The ERM hypothesis therefore has population risk close to the best in class.
Proof Sketch
Apply Hoeffding's inequality to each to get with probability . Then take a union bound over all hypotheses. Solve for .
Why It Matters
This is the simplest generalization bound in learning theory. It shows why ERM works for finite hypothesis classes. The dependence on rather than is crucial. It means the sample complexity grows only logarithmically in the number of hypotheses.
Failure Mode
This bound is vacuous for infinite hypothesis classes (e.g., all linear classifiers). You need more sophisticated complexity measures. VC dimension, Rademacher complexity, or covering numbers. to handle that case.
Proof Ideas and Templates Used
The proof of the finite-class ERM bound uses two standard tools:
- Hoeffding's inequality: to control the deviation of the empirical mean from the true mean for a single hypothesis
- Union bound: to extend the guarantee from a single hypothesis to all hypotheses in the class simultaneously
This "Hoeffding + union bound" pattern is the simplest proof template in learning theory. It breaks for infinite classes because the union bound over uncountably many hypotheses gives infinity.
Canonical Examples
Finite binary classifiers
Suppose contains 1000 binary classifiers and we use 0-1 loss. With training examples and :
So the ERM hypothesis has population risk within about 6.3% of its training risk.
Common Confusions
ERM is not the same as training loss minimization
ERM minimizes over a fixed hypothesis class . In practice, when you train a neural network, the class is parameterized and the optimization landscape matters. The ERM framework abstracts away optimization difficulty and asks: if you could find the exact minimizer, would it generalize? Real training adds optimization error on top of the estimation error ERM theory controls.
Small training loss does not imply small test loss
A model can memorize the training set (achieving zero empirical risk) while having terrible population risk. This happens when is too rich relative to . The generalization bound tells you when you can trust low training risk as a signal of low population risk.
Training loss and training error are not the same
In classification you typically optimize a convex surrogate loss (cross-entropy, hinge, logistic) while caring about the 0-1 classification error. "Minimizing training loss" (surrogate) and "minimizing training error" (0-1) are not equivalent. The connection is classification-calibration: a surrogate is calibrated if minimizing its population risk implies minimizing the 0-1 population risk (Bartlett, Jordan, McAuliffe 2006). Cross-entropy, hinge, and logistic are calibrated; squared loss on targets is calibrated for binary classification; but not every convex surrogate is. Pages that use "training loss" and "training error" interchangeably are eliding this distinction.
Summary
- Population risk is what you want to minimize; empirical risk is what you can compute
- The generalization gap must be controlled uniformly
- For finite : gap scales as
- Total excess risk = approximation error + estimation error
- Richer reduces approximation error but increases estimation error. This is the bias-variance tradeoff
Exercises
Problem
Suppose has 100 hypotheses and the loss is bounded in . How many samples do you need so that with probability at least 0.95, the uniform convergence gap is at most 0.1?
Problem
Why can't you apply the finite-class ERM bound to the class of all linear classifiers in ? What goes wrong, and what tool do you need instead?
Related Comparisons
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapter 2-4
- Vapnik, Statistical Learning Theory (1998), Chapters 1-4
- Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 4
Current:
-
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
-
Wainwright, High-Dimensional Statistics (2019), Chapters 4-6
Next Topics
The natural next steps from ERM:
- Uniform convergence: the general framework for proving ERM works
- VC dimension: controlling complexity of infinite hypothesis classes
- Rademacher complexity: data-dependent complexity measurement
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
Builds on This
- Algorithmic StabilityLayer 3
- Bias-Variance TradeoffLayer 2
- Cross-Validation TheoryLayer 2
- Decision Trees and EnsemblesLayer 2
- Hallucination TheoryLayer 4
- Hypothesis Classes and Function SpacesLayer 2
- Naive BayesLayer 1
- Overfitting and UnderfittingLayer 1
- Rademacher ComplexityLayer 3
- Uniform ConvergenceLayer 2
- VC DimensionLayer 2