Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Learning Theory Core

Empirical Risk Minimization

The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.

CoreTier 1Stable~60 min

Why This Matters

Every supervised learning algorithm is an empirical risk minimizer or a regularized variant of one. When you train a neural network by minimizing cross-entropy loss on a training set, you are doing ERM. When you fit a linear regression by minimizing squared error, you are doing ERM.

Understanding ERM is the first step toward understanding why learning from data works at all, and where it fails.

ERM and Generalization: the core structure of learning theory

Population RiskR(h) = E[ℓ(h(x),y)]Empirical RiskR̂ₙ(h) = (1/n)ΣℓᵢERM: minimize R̂ₙ(h)over h ∈ HGeneralization GapR(h) - R̂ₙ(h)Controlled by:• VC dimension• Rademacher complexity• Stabilityunknown Dcomputableminimize

Mental Model

Imagine you want to find a function hh that predicts well on future, unseen data. You cannot directly optimize over data you haven't seen. So instead, you optimize over the data you have. your training set. And hope that good performance on training data translates to good performance on new data.

ERM formalizes this hope. The central question of learning theory is: when and why does this hope actually work?

Formal Setup and Notation

Let X\mathcal{X} be an input space and Y\mathcal{Y} an output space. Let D\mathcal{D} be an unknown probability distribution over X×Y\mathcal{X} \times \mathcal{Y}.

We have a loss function :Y×Y[0,)\ell: \mathcal{Y} \times \mathcal{Y} \to [0, \infty) measuring prediction quality.

Definition

Population Risk

The population risk (or true risk) of a hypothesis hh is:

R(h)=E(x,y)D[(h(x),y)]R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]

This is the quantity we actually want to minimize, but we cannot compute it directly because D\mathcal{D} is unknown.

Definition

Empirical Risk

Given a training sample S={(x1,y1),,(xn,yn)}S = \{(x_1, y_1), \ldots, (x_n, y_n)\} drawn i.i.d. from D\mathcal{D}, the empirical risk is:

R^n(h)=1ni=1n(h(xi),yi)\hat{R}_n(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i)

This is computable. It is the average loss on training data.

Definition

Empirical Risk Minimization

Given a hypothesis class H\mathcal{H}, the ERM principle selects:

hERM=argminhHR^n(h)h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_n(h)

The key question: when is R(hERM)R(h_{\text{ERM}}) close to minhHR(h)\min_{h \in \mathcal{H}} R(h)?

Core Definitions

The generalization gap is the difference between population risk and empirical risk:

R(h)R^n(h)R(h) - \hat{R}_n(h)

If this gap is small uniformly over all hHh \in \mathcal{H}, then minimizing empirical risk also approximately minimizes population risk.

The approximation error is:

minhHR(h)R\min_{h \in \mathcal{H}} R(h) - R^*

where R=infhR(h)R^* = \inf_h R(h) is the Bayes risk. This measures how well the best hypothesis in H\mathcal{H} can do, regardless of sample size.

The estimation error is:

R(hERM)minhHR(h)R(h_{\text{ERM}}) - \min_{h \in \mathcal{H}} R(h)

This is what ERM theory actually controls.

Main Theorems

Theorem

ERM Generalization via Uniform Convergence

Statement

If H|\mathcal{H}| is finite and the loss is bounded in [0,1][0,1], then for any δ>0\delta > 0, with probability at least 1δ1 - \delta over the draw of SS:

suphHR(h)R^n(h)logH+log(2/δ)2n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}

Intuition

The bound says: if the hypothesis class is not too large (measured by logH\log|\mathcal{H}|) and the sample is large enough, then empirical risk uniformly approximates population risk. The ERM hypothesis therefore has population risk close to the best in class.

Proof Sketch

Apply Hoeffding's inequality to each hHh \in \mathcal{H} to get R(h)R^n(h)ϵ|R(h) - \hat{R}_n(h)| \leq \epsilon with probability 12e2nϵ2\geq 1 - 2e^{-2n\epsilon^2}. Then take a union bound over all H|\mathcal{H}| hypotheses. Solve for ϵ\epsilon.

Why It Matters

This is the simplest generalization bound in learning theory. It shows why ERM works for finite hypothesis classes. The dependence on logH\log|\mathcal{H}| rather than H|\mathcal{H}| is crucial. It means the sample complexity grows only logarithmically in the number of hypotheses.

Failure Mode

This bound is vacuous for infinite hypothesis classes (e.g., all linear classifiers). You need more sophisticated complexity measures. VC dimension, Rademacher complexity, or covering numbers. to handle that case.

Proof Ideas and Templates Used

The proof of the finite-class ERM bound uses two standard tools:

  1. Hoeffding's inequality: to control the deviation of the empirical mean from the true mean for a single hypothesis
  2. Union bound: to extend the guarantee from a single hypothesis to all hypotheses in the class simultaneously

This "Hoeffding + union bound" pattern is the simplest proof template in learning theory. It breaks for infinite classes because the union bound over uncountably many hypotheses gives infinity.

Canonical Examples

Example

Finite binary classifiers

Suppose H\mathcal{H} contains 1000 binary classifiers and we use 0-1 loss. With n=10,000n = 10{,}000 training examples and δ=0.05\delta = 0.05:

suphHR(h)R^n(h)log(1000)+log(40)200000.063\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log(1000) + \log(40)}{20000}} \approx 0.063

So the ERM hypothesis has population risk within about 6.3% of its training risk.

Common Confusions

Watch Out

ERM is not the same as training loss minimization

ERM minimizes over a fixed hypothesis class H\mathcal{H}. In practice, when you train a neural network, the class is parameterized and the optimization landscape matters. The ERM framework abstracts away optimization difficulty and asks: if you could find the exact minimizer, would it generalize? Real training adds optimization error on top of the estimation error ERM theory controls.

Watch Out

Small training loss does not imply small test loss

A model can memorize the training set (achieving zero empirical risk) while having terrible population risk. This happens when H\mathcal{H} is too rich relative to nn. The generalization bound tells you when you can trust low training risk as a signal of low population risk.

Watch Out

Training loss and training error are not the same

In classification you typically optimize a convex surrogate loss (cross-entropy, hinge, logistic) while caring about the 0-1 classification error. "Minimizing training loss" (surrogate) and "minimizing training error" (0-1) are not equivalent. The connection is classification-calibration: a surrogate ϕ\phi is calibrated if minimizing its population risk implies minimizing the 0-1 population risk (Bartlett, Jordan, McAuliffe 2006). Cross-entropy, hinge, and logistic are calibrated; squared loss on ±1\pm 1 targets is calibrated for binary classification; but not every convex surrogate is. Pages that use "training loss" and "training error" interchangeably are eliding this distinction.

Summary

  • Population risk R(h)R(h) is what you want to minimize; empirical risk R^n(h)\hat{R}_n(h) is what you can compute
  • The generalization gap R(h)R^n(h)R(h) - \hat{R}_n(h) must be controlled uniformly
  • For finite H\mathcal{H}: gap scales as logH/n\sqrt{\log|\mathcal{H}|/n}
  • Total excess risk = approximation error + estimation error
  • Richer H\mathcal{H} reduces approximation error but increases estimation error. This is the bias-variance tradeoff

Exercises

ExerciseCore

Problem

Suppose H\mathcal{H} has 100 hypotheses and the loss is bounded in [0,1][0,1]. How many samples nn do you need so that with probability at least 0.95, the uniform convergence gap is at most 0.1?

ExerciseAdvanced

Problem

Why can't you apply the finite-class ERM bound to the class of all linear classifiers in Rd\mathbb{R}^d? What goes wrong, and what tool do you need instead?

Related Comparisons

References

Canonical:

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2

  • Wainwright, High-Dimensional Statistics (2019), Chapters 4-6

Next Topics

The natural next steps from ERM:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics