Skip to main content

Learning Theory Core

Bias-Complexity Tradeoff

The error of an ERM hypothesis decomposes into approximation error (the price of restricting H) and estimation error (the price of finite data). Richer classes shrink approximation but inflate estimation. The optimal class is the smallest one that still captures the truth.

CoreTier 2StableCore spine~30 min

Why This Matters

Once the No-Free-Lunch theorem forces you to commit to a hypothesis class H\mathcal{H}, the next question is which one. Smaller classes generalize from less data but may not contain a hypothesis close to the truth. Larger classes can fit the truth but need more data to identify it. The bias-complexity tradeoff is the formal account of this tension.

The decomposition tells you exactly which side of the tradeoff is hurting you. Large empirical risk on the training set means the class is too small (approximation error). Small empirical risk but large test error means the class is too rich for the sample size (estimation error). Every model-selection technique, including cross-validation, AIC/BIC, regularization theory, and Kolmogorov complexity / MDL, is trying to navigate this tradeoff.

This is closely related to but distinct from the statistical bias-variance decomposition. Read the Common Confusions section below for the difference.

Mental Model

Imagine fitting a curve through noisy data. A constant function is a tiny class: it cannot bend, so its best member misses any non-constant truth by a lot (large approximation error). A polynomial of degree 50 is a giant class: it can match almost any 50-point sample, so it tracks noise (large estimation error). Linear or low-degree polynomial is the compromise: small enough to be identifiable from the sample, large enough to capture trends.

The art of model selection is choosing the smallest class that still contains a hypothesis close to the truth. NFL says that choice is forced; the bias-complexity tradeoff quantifies the cost of getting it wrong.

Formal Setup

Fix a domain X×Y\mathcal{X} \times \mathcal{Y}, an unknown distribution D\mathcal{D}, a loss \ell, and a hypothesis class H\mathcal{H}. Population risk of a hypothesis is

LD(h)=E(x,y)D[(h(x),y)].L_\mathcal{D}(h) = \mathbb{E}_{(x, y) \sim \mathcal{D}}[\ell(h(x), y)].

Let hargminhHLD(h)h^* \in \arg\min_{h \in \mathcal{H}} L_\mathcal{D}(h) be a best-in-class hypothesis (assumed to exist; if not, take an infimum and add a vanishing slack). Let hS=argminhHLS(h)h_S = \arg\min_{h \in \mathcal{H}} L_S(h) be the ERM on a sample SS of size mm.

The Error Decomposition

Theorem

ERM Error Decomposition (Shalev-Shwartz and Ben-David, Section 5.2)

Statement

The population risk of the ERM hypothesis decomposes as LD(hS)=εapp+εest,L_\mathcal{D}(h_S) = \varepsilon_{\mathrm{app}} + \varepsilon_{\mathrm{est}}, where the approximation error is εapp=minhHLD(h)=LD(h)\varepsilon_{\mathrm{app}} = \min_{h \in \mathcal{H}} L_\mathcal{D}(h) = L_\mathcal{D}(h^*) and the estimation error is εest=LD(hS)LD(h).\varepsilon_{\mathrm{est}} = L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*).

Intuition

The approximation error is the floor: the smallest population risk any hypothesis in H\mathcal{H} can achieve. It depends on H\mathcal{H} and D\mathcal{D} but not on the sample. The estimation error is the gap between ERM and the in-class minimum: it depends on the sample and shrinks with mm. Total risk is the sum.

Proof Sketch

Add and subtract LD(h)L_\mathcal{D}(h^*): LD(hS)=LD(h)εapp+LD(hS)LD(h)εest.L_\mathcal{D}(h_S) = \underbrace{L_\mathcal{D}(h^*)}_{\varepsilon_{\mathrm{app}}} + \underbrace{L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*)}_{\varepsilon_{\mathrm{est}}}. The decomposition is an identity, not a bound. The content is in the interpretation of the two terms and in the bounds on each.

Why It Matters

This identity organizes every generalization bound in learning theory. Approximation error is what you control by choosing H\mathcal{H}. Estimation error is what you control by collecting more data or restricting H\mathcal{H}. Total error cannot go below the approximation floor, no matter how much data you have.

Failure Mode

The decomposition assumes a best-in-class hypothesis exists. For non-compact classes the infimum may not be attained; adopt a vanishing slack (hηh^*_\eta with LD(hη)inf+ηL_\mathcal{D}(h^*_\eta) \le \inf + \eta) and let η0\eta \to 0. The identity also assumes a fixed H\mathcal{H}; data-driven class selection (cross-validation, complexity-penalized class search) requires adjusted bounds that account for the selection step.

Properties of Each Term

Approximation error: εapp\varepsilon_{\mathrm{app}}

  • Depends on H\mathcal{H} and D\mathcal{D}, not on mm.
  • Measures the inductive bias: how well does the best hypothesis in our chosen class fit the truth?
  • Under the realizability assumption, εapp=0\varepsilon_{\mathrm{app}} = 0.
  • In the agnostic setting, εapp\varepsilon_{\mathrm{app}} is the gap to the Bayes risk LD(h)LDBayesL_\mathcal{D}(h^*) - L_\mathcal{D}^{\mathrm{Bayes}} plus the Bayes risk itself. It can be large if H\mathcal{H} is too small.
  • Cannot be reduced by collecting more data. The only lever is enlarging H\mathcal{H}.

Estimation error: εest\varepsilon_{\mathrm{est}}

  • Depends on mm, H\mathcal{H}, and the sample.
  • For finite H\mathcal{H} in the agnostic setting, εestO ⁣(logH/m)\varepsilon_{\mathrm{est}} \le O\!\left(\sqrt{\log|\mathcal{H}|/m}\right) with high probability over SS. This is the standard uniform-convergence bound.
  • For infinite H\mathcal{H} with finite VC dimension dd, εestO ⁣(d/m)\varepsilon_{\mathrm{est}} \le O\!\left(\sqrt{d/m}\right) up to log factors.
  • Decreases with mm at rate 1/m1/\sqrt{m} (agnostic) or 1/m1/m (realizable).
  • Increases with class complexity (logarithmically in H|\mathcal{H}|, linearly in dd under the square root).

The Tradeoff

DirectionApproximation error εapp\varepsilon_{\mathrm{app}}Estimation error εest\varepsilon_{\mathrm{est}}
Enlarge H\mathcal{H}shrinks (more functions to fit truth)grows (logH\log\vert\mathcal{H}\vert or dd increases)
Shrink H\mathcal{H}grows (truth may not be in class)shrinks (less to overfit)
Increase mmunchangedshrinks (1/m1/\sqrt{m})

For fixed mm, the optimal class H\mathcal{H} minimizes εapp(H)+εest(H,m)\varepsilon_{\mathrm{app}}(\mathcal{H}) + \varepsilon_{\mathrm{est}}(\mathcal{H}, m). Too small: bias dominates, even infinite data does not help. Too large: variance dominates, the bound is vacuous at the available sample size.

The Bayes optimal classifier hBayes(x)=argmaxyPrD[Y=yX=x]h^{\mathrm{Bayes}}(x) = \arg\max_y \Pr_\mathcal{D}[Y = y \mid X = x] minimizes population risk over all functions, so taking H={0,1}X\mathcal{H} = \{0,1\}^\mathcal{X} (or its regression analog) drives εapp\varepsilon_{\mathrm{app}} to the Bayes-risk floor. But No-Free-Lunch shows that class is not learnable: εest\varepsilon_{\mathrm{est}} blows up. Practical learning is the search for a class small enough to identify from data and large enough that εapp\varepsilon_{\mathrm{app}} is acceptable.

Worked Example: Polynomial Regression

Take X=[1,1]\mathcal{X} = [-1, 1], regression with squared loss, true regression function f(x)=sin(πx)f(x) = \sin(\pi x), Gaussian noise σ2=0.01\sigma^2 = 0.01. Hypothesis class Hd\mathcal{H}_d: polynomials of degree at most dd. Sample size m=30m = 30.

Degree ddεapp\varepsilon_{\mathrm{app}} (best-fit MSE in Hd\mathcal{H}_d)εest\varepsilon_{\mathrm{est}} (typical, O(d/m)O(d/m))Total
1 (line)0.300.030.33
30.040.100.14
70.0010.230.23
150.00010.500.50

Constants are illustrative. The shape is what matters: εapp\varepsilon_{\mathrm{app}} decays fast as dd grows from 1 to 7, then plateaus near zero. εest\varepsilon_{\mathrm{est}} grows roughly linearly in d/md/m. The optimum is at degree 3, not degree 1 (too biased) and not degree 15 (too much estimation error).

This is why model-selection procedures matter: at m=30m = 30, degree 3 generalizes; at m=3000m = 3000, degree 7 might. The right class depends on the sample size, not just on the truth.

Common Confusions

Watch Out

Bias-complexity is not the same as bias-variance

The classical bias-variance decomposition splits squared-error risk on a fixed point as E[(hS(x)f(x))2]=(E[hS(x)]f(x))2+Var[hS(x)]+σ2\mathbb{E}[(h_S(x) - f(x))^2] = (\mathbb{E}[h_S(x)] - f(x))^2 + \mathrm{Var}[h_S(x)] + \sigma^2, with the expectation over training samples. Bias is the gap between the expected prediction and the truth; variance is the spread of predictions across samples; noise is irreducible.

The bias-complexity decomposition splits population risk of the ERM hypothesis into approximation error (the in-class minimum) and estimation error (the gap to that minimum). The two pictures coincide under restrictive conditions (e.g., realizable squared loss with a specific class), but in general they are different cuts of "what goes wrong." Bias-variance lives in the regression literature and applies pointwise; bias-complexity lives in PAC theory and applies to expected risk over the entire input distribution.

Watch Out

Approximation error does not require knowing the truth

You do not need to know D\mathcal{D} or the Bayes-optimal classifier to talk about approximation error. The definition is internal to the class: εapp=minhHLD(h)\varepsilon_{\mathrm{app}} = \min_{h \in \mathcal{H}} L_\mathcal{D}(h) is the best risk among hypotheses in your class. You cannot compute it without D\mathcal{D}, but you can reason about how it changes when you enlarge or shrink H\mathcal{H}, even when D\mathcal{D} is unknown.

Watch Out

Realizability does not eliminate estimation error

Under realizability, εapp=0\varepsilon_{\mathrm{app}} = 0, but εest\varepsilon_{\mathrm{est}} remains positive because ERM picks among many zero-empirical-risk hypotheses, only some of which generalize. The realizable PAC bound gives εest=O(logH/m)\varepsilon_{\mathrm{est}} = O(\log|\mathcal{H}|/m) in expectation, better than the O(logH/m)O(\sqrt{\log|\mathcal{H}|/m}) agnostic rate but still nonzero.

Watch Out

The Bayes classifier is not a practical H

hBayesh^{\mathrm{Bayes}} minimizes risk over all functions, so taking H\mathcal{H} = all functions makes εapp=LDBayes\varepsilon_{\mathrm{app}} = L_\mathcal{D}^{\mathrm{Bayes}} (the irreducible risk). The catch: that class is not learnable by No-Free-Lunch, so the favorable εapp\varepsilon_{\mathrm{app}} is paid for by εest\varepsilon_{\mathrm{est}} that does not vanish with mm. Bayes optimality is a theoretical benchmark, not an implementable target.

Connection to Other Frameworks

The decomposition refines the PAC framework: PAC bounds the estimation error of ERM, conditional on the choice of H\mathcal{H}. NFL forces the choice; bias-complexity quantifies the cost; PAC controls one term in the sum. The full story is:

LD(hS)=LDBayesnoise floor+LD(h)LDBayesclass deficit+LD(hS)LD(h)εest.L_\mathcal{D}(h_S) = \underbrace{L_\mathcal{D}^{\mathrm{Bayes}}}_{\text{noise floor}} + \underbrace{L_\mathcal{D}(h^*) - L_\mathcal{D}^{\mathrm{Bayes}}}_{\text{class deficit}} + \underbrace{L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*)}_{\varepsilon_{\mathrm{est}}}.

The first two terms together equal εapp\varepsilon_{\mathrm{app}}; the split into noise and class-deficit makes explicit that even hh^* cannot beat the Bayes risk.

The operational response is structural risk minimization: parameterize a sequence H1H2\mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq \cdots and let the data choose the index by penalizing complexity (Vapnik 1998, Chapter 1). Regularization, cross-validation, and the MDL-based information criteria are all variations on the same theme.

Exercises

ExerciseCore

Problem

Define approximation error and estimation error in one sentence each, and state which depends on the sample size and which does not.

ExerciseCore

Problem

You enlarge your hypothesis class from finite H1\mathcal{H}_1 with H1=100|\mathcal{H}_1| = 100 to finite H2H1\mathcal{H}_2 \supset \mathcal{H}_1 with H2=10000|\mathcal{H}_2| = 10\,000. In which direction does each error term move, qualitatively, at fixed mm?

ExerciseAdvanced

Problem

Show that for finite H\mathcal{H} in the agnostic setting, with probability at least 1δ1 - \delta over SS, εest=LD(hS)LD(h)2log(2H/δ)2m.\varepsilon_{\mathrm{est}} = L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*) \le 2\sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2m}}.

ExerciseAdvanced

Problem

You have m=1000m = 1000 samples and two candidate classes: HA\mathcal{H}_A with HA=50|\mathcal{H}_A| = 50 and εappA=0.10\varepsilon_{\mathrm{app}}^A = 0.10; HB\mathcal{H}_B with HB=5000|\mathcal{H}_B| = 5\,000 and εappB=0.02\varepsilon_{\mathrm{app}}^B = 0.02. Using the agnostic finite-class estimation bound at δ=0.05\delta = 0.05, which class minimizes the upper bound on LD(hS)L_\mathcal{D}(h_S)?

Summary

  • The error of an ERM hypothesis decomposes: LD(hS)=εapp+εestL_\mathcal{D}(h_S) = \varepsilon_{\mathrm{app}} + \varepsilon_{\mathrm{est}}.
  • Approximation error: in-class population minimum, depends on H\mathcal{H} and D\mathcal{D}, not on mm.
  • Estimation error: ERM-vs-best-in-class gap, depends on mm and class complexity.
  • Richer H\mathcal{H} shrinks approximation, inflates estimation; smaller H\mathcal{H} does the reverse.
  • The Bayes-optimal "all functions" class is unlearnable (No-Free-Lunch); practical learning is the search for the smallest class that captures the truth.
  • Distinct from the statistical bias-variance decomposition; different variables, different settings, often confused.

References

Canonical:

  • Shalev-Shwartz and Ben-David, Understanding Machine Learning, Section 5.2 — the source for the approximation/estimation split as named here.
  • Vapnik, Statistical Learning Theory (1998), Chapter 1 — the structural risk minimization framing.
  • Devroye, Györfi, Lugosi, A Probabilistic Theory of Pattern Recognition (1996), Chapter 12 — empirical risk minimization and the consistency-vs-rate split.

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Section 4.1 — agnostic PAC oracle inequality and the same decomposition under different notation.
  • Wainwright, High-Dimensional Statistics (2019), Chapter 14 — uniform convergence rates that bound the estimation error.
  • Bartlett, Bousquet, Mendelson, "Local Rademacher Complexities," Annals of Statistics 33 (2005), 1497-1537 — fast rates for the estimation error under low-noise / Bernstein conditions, refining the 1/m1/\sqrt{m} scaling.

Next Topics

  • VC dimension: how to bound estimation error for infinite hypothesis classes via a single complexity measure.
  • Regularization theory: penalty-based class selection that operationalizes the tradeoff.
  • Bias-variance tradeoff: the related-but-distinct pointwise decomposition for squared-error regression.

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.