Bias-Complexity Tradeoff

Sneiderman, Robby

Learning Theory Core

Bias-Complexity Tradeoff

The error of an ERM hypothesis decomposes into approximation error (the price of restricting H) and estimation error (the price of finite data). Richer classes shrink approximation but inflate estimation. The optimal class is the smallest one that still captures the truth.

CoreTier 2StableCore spine~30 min

Prerequisites

PAC Learning Framework Empirical Risk Minimization No Free Lunch Theorem

Quiz (5)Pulse Check Prereq Map

Why This Matters

Once the No-Free-Lunch theorem forces you to commit to a hypothesis class $\mathcal{H}$ , the next question is which one. Smaller classes generalize from less data but may not contain a hypothesis close to the truth. Larger classes can fit the truth but need more data to identify it. The bias-complexity tradeoff is the formal account of this tension.

The decomposition tells you exactly which side of the tradeoff is hurting you. Large empirical risk on the training set means the class is too small (approximation error). Small empirical risk but large test error means the class is too rich for the sample size (estimation error). Every model-selection technique, including cross-validation, AIC/BIC, regularization theory, and Kolmogorov complexity / MDL, is trying to navigate this tradeoff.

This is closely related to but distinct from the statistical bias-variance decomposition. Read the Common Confusions section below for the difference.

Mental Model

Imagine fitting a curve through noisy data. A constant function is a tiny class: it cannot bend, so its best member misses any non-constant truth by a lot (large approximation error). A polynomial of degree 50 is a giant class: it can match almost any 50-point sample, so it tracks noise (large estimation error). Linear or low-degree polynomial is the compromise: small enough to be identifiable from the sample, large enough to capture trends.

The art of model selection is choosing the smallest class that still contains a hypothesis close to the truth. NFL says that choice is forced; the bias-complexity tradeoff quantifies the cost of getting it wrong.

Formal Setup

Fix a domain $\mathcal{X} \times \mathcal{Y}$ , an unknown distribution $\mathcal{D}$ , a loss $\ell$ , and a hypothesis class $\mathcal{H}$ . Population risk of a hypothesis is

$L_\mathcal{D}(h) = \mathbb{E}_{(x, y) \sim \mathcal{D}}[\ell(h(x), y)].$

Let $h^* \in \arg\min_{h \in \mathcal{H}} L_\mathcal{D}(h)$ be a best-in-class hypothesis (assumed to exist; if not, take an infimum and add a vanishing slack). Let $h_S = \arg\min_{h \in \mathcal{H}} L_S(h)$ be the ERM on a sample $S$ of size $m$ .

The Error Decomposition

Theorem

ERM Error Decomposition (Shalev-Shwartz and Ben-David, Section 5.2)

Statement

The population risk of the ERM hypothesis decomposes as $L_\mathcal{D}(h_S) = \varepsilon_{\mathrm{app}} + \varepsilon_{\mathrm{est}},$ where the approximation error is $\varepsilon_{\mathrm{app}} = \min_{h \in \mathcal{H}} L_\mathcal{D}(h) = L_\mathcal{D}(h^*)$ and the estimation error is $\varepsilon_{\mathrm{est}} = L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*).$

Intuition

The approximation error is the floor: the smallest population risk any hypothesis in $\mathcal{H}$ can achieve. It depends on $\mathcal{H}$ and $\mathcal{D}$ but not on the sample. The estimation error is the gap between ERM and the in-class minimum: it depends on the sample and shrinks with $m$ . Total risk is the sum.

Proof Sketch

Add and subtract $L_\mathcal{D}(h^*)$ : $L_\mathcal{D}(h_S) = \underbrace{L_\mathcal{D}(h^*)}_{\varepsilon_{\mathrm{app}}} + \underbrace{L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*)}_{\varepsilon_{\mathrm{est}}}.$ The decomposition is an identity, not a bound. The content is in the interpretation of the two terms and in the bounds on each.

Why It Matters

This identity organizes every generalization bound in learning theory. Approximation error is what you control by choosing $\mathcal{H}$ . Estimation error is what you control by collecting more data or restricting $\mathcal{H}$ . Total error cannot go below the approximation floor, no matter how much data you have.

Failure Mode

The decomposition assumes a best-in-class hypothesis exists. For non-compact classes the infimum may not be attained; adopt a vanishing slack ( $h^*_\eta$ with $L_\mathcal{D}(h^*_\eta) \le \inf + \eta$ ) and let $\eta \to 0$ . The identity also assumes a fixed $\mathcal{H}$ ; data-driven class selection (cross-validation, complexity-penalized class search) requires adjusted bounds that account for the selection step.

report a correction →

Properties of Each Term

Approximation error: $\varepsilon_{\mathrm{app}}$

Depends on $\mathcal{H}$ and $\mathcal{D}$ , not on $m$ .
Measures the inductive bias: how well does the best hypothesis in our chosen class fit the truth?
Under the realizability assumption, $\varepsilon_{\mathrm{app}} = 0$ .
In the agnostic setting, $\varepsilon_{\mathrm{app}}$ is the gap to the Bayes risk $L_\mathcal{D}(h^*) - L_\mathcal{D}^{\mathrm{Bayes}}$ plus the Bayes risk itself. It can be large if $\mathcal{H}$ is too small.
Cannot be reduced by collecting more data. The only lever is enlarging $\mathcal{H}$ .

Estimation error: $\varepsilon_{\mathrm{est}}$

Depends on $m$ , $\mathcal{H}$ , and the sample.
For finite $\mathcal{H}$ in the agnostic setting, $\varepsilon_{\mathrm{est}} \le O\!\left(\sqrt{\log|\mathcal{H}|/m}\right)$ with high probability over $S$ . This is the standard uniform-convergence bound.
For infinite $\mathcal{H}$ with finite VC dimension $d$ , $\varepsilon_{\mathrm{est}} \le O\!\left(\sqrt{d/m}\right)$ up to log factors.
Decreases with $m$ at rate $1/\sqrt{m}$ (agnostic) or $1/m$ (realizable).
Increases with class complexity (logarithmically in $|\mathcal{H}|$ , linearly in $d$ under the square root).

The Tradeoff

Direction	Approximation error $\varepsilon_{\mathrm{app}}$	Estimation error $\varepsilon_{\mathrm{est}}$
Enlarge $\mathcal{H}$	shrinks (more functions to fit truth)	grows ( $\log\vert\mathcal{H}\vert$ or $d$ increases)
Shrink $\mathcal{H}$	grows (truth may not be in class)	shrinks (less to overfit)
Increase $m$	unchanged	shrinks ( $1/\sqrt{m}$ )

For fixed $m$ , the optimal class $\mathcal{H}$ minimizes $\varepsilon_{\mathrm{app}}(\mathcal{H}) + \varepsilon_{\mathrm{est}}(\mathcal{H}, m)$ . Too small: bias dominates, even infinite data does not help. Too large: variance dominates, the bound is vacuous at the available sample size.

The Bayes optimal classifier $h^{\mathrm{Bayes}}(x) = \arg\max_y \Pr_\mathcal{D}[Y = y \mid X = x]$ minimizes population risk over all functions, so taking $\mathcal{H} = \{0,1\}^\mathcal{X}$ (or its regression analog) drives $\varepsilon_{\mathrm{app}}$ to the Bayes-risk floor. But No-Free-Lunch shows that class is not learnable: $\varepsilon_{\mathrm{est}}$ blows up. Practical learning is the search for a class small enough to identify from data and large enough that $\varepsilon_{\mathrm{app}}$ is acceptable.

Worked Example: Polynomial Regression

Take $\mathcal{X} = [-1, 1]$ , regression with squared loss, true regression function $f(x) = \sin(\pi x)$ , Gaussian noise $\sigma^2 = 0.01$ . Hypothesis class $\mathcal{H}_d$ : polynomials of degree at most $d$ . Sample size $m = 30$ .

Degree $d$	$\varepsilon_{\mathrm{app}}$ (best-fit MSE in $\mathcal{H}_d$ )	$\varepsilon_{\mathrm{est}}$ (typical, $O(d/m)$ )	Total
1 (line)	0.30	0.03	0.33
3	0.04	0.10	0.14
7	0.001	0.23	0.23
15	0.0001	0.50	0.50

Constants are illustrative. The shape is what matters: $\varepsilon_{\mathrm{app}}$ decays fast as $d$ grows from 1 to 7, then plateaus near zero. $\varepsilon_{\mathrm{est}}$ grows roughly linearly in $d/m$ . The optimum is at degree 3, not degree 1 (too biased) and not degree 15 (too much estimation error).

This is why model-selection procedures matter: at $m = 30$ , degree 3 generalizes; at $m = 3000$ , degree 7 might. The right class depends on the sample size, not just on the truth.

Common Confusions

Watch Out

Bias-complexity is not the same as bias-variance

The classical bias-variance decomposition splits squared-error risk on a fixed point as $\mathbb{E}[(h_S(x) - f(x))^2] = (\mathbb{E}[h_S(x)] - f(x))^2 + \mathrm{Var}[h_S(x)] + \sigma^2$ , with the expectation over training samples. Bias is the gap between the expected prediction and the truth; variance is the spread of predictions across samples; noise is irreducible.

The bias-complexity decomposition splits population risk of the ERM hypothesis into approximation error (the in-class minimum) and estimation error (the gap to that minimum). The two pictures coincide under restrictive conditions (e.g., realizable squared loss with a specific class), but in general they are different cuts of "what goes wrong." Bias-variance lives in the regression literature and applies pointwise; bias-complexity lives in PAC theory and applies to expected risk over the entire input distribution.

Watch Out

Approximation error does not require knowing the truth

You do not need to know $\mathcal{D}$ or the Bayes-optimal classifier to talk about approximation error. The definition is internal to the class: $\varepsilon_{\mathrm{app}} = \min_{h \in \mathcal{H}} L_\mathcal{D}(h)$ is the best risk among hypotheses in your class. You cannot compute it without $\mathcal{D}$ , but you can reason about how it changes when you enlarge or shrink $\mathcal{H}$ , even when $\mathcal{D}$ is unknown.

Watch Out

Realizability does not eliminate estimation error

Under realizability, $\varepsilon_{\mathrm{app}} = 0$ , but $\varepsilon_{\mathrm{est}}$ remains positive because ERM picks among many zero-empirical-risk hypotheses, only some of which generalize. The realizable PAC bound gives $\varepsilon_{\mathrm{est}} = O(\log|\mathcal{H}|/m)$ in expectation, better than the $O(\sqrt{\log|\mathcal{H}|/m})$ agnostic rate but still nonzero.

Watch Out

The Bayes classifier is not a practical H

$h^{\mathrm{Bayes}}$ minimizes risk over all functions, so taking $\mathcal{H}$ = all functions makes $\varepsilon_{\mathrm{app}} = L_\mathcal{D}^{\mathrm{Bayes}}$ (the irreducible risk). The catch: that class is not learnable by No-Free-Lunch, so the favorable $\varepsilon_{\mathrm{app}}$ is paid for by $\varepsilon_{\mathrm{est}}$ that does not vanish with $m$ . Bayes optimality is a theoretical benchmark, not an implementable target.

Connection to Other Frameworks

The decomposition refines the PAC framework: PAC bounds the estimation error of ERM, conditional on the choice of $\mathcal{H}$ . NFL forces the choice; bias-complexity quantifies the cost; PAC controls one term in the sum. The full story is:

$L_\mathcal{D}(h_S) = \underbrace{L_\mathcal{D}^{\mathrm{Bayes}}}_{\text{noise floor}} + \underbrace{L_\mathcal{D}(h^*) - L_\mathcal{D}^{\mathrm{Bayes}}}_{\text{class deficit}} + \underbrace{L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*)}_{\varepsilon_{\mathrm{est}}}.$

The first two terms together equal $\varepsilon_{\mathrm{app}}$ ; the split into noise and class-deficit makes explicit that even $h^*$ cannot beat the Bayes risk.

The operational response is structural risk minimization: parameterize a sequence $\mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq \cdots$ and let the data choose the index by penalizing complexity (Vapnik 1998, Chapter 1). Regularization, cross-validation, and the MDL-based information criteria are all variations on the same theme.

Exercises

ExerciseCore

Problem

Define approximation error and estimation error in one sentence each, and state which depends on the sample size and which does not.

ExerciseCore

Problem

You enlarge your hypothesis class from finite $\mathcal{H}_1$ with $|\mathcal{H}_1| = 100$ to finite $\mathcal{H}_2 \supset \mathcal{H}_1$ with $|\mathcal{H}_2| = 10\,000$ . In which direction does each error term move, qualitatively, at fixed $m$ ?

ExerciseAdvanced

Problem

Show that for finite $\mathcal{H}$ in the agnostic setting, with probability at least $1 - \delta$ over $S$ , $\varepsilon_{\mathrm{est}} = L_\mathcal{D}(h_S) - L_\mathcal{D}(h^*) \le 2\sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2m}}.$

ExerciseAdvanced

Problem

You have $m = 1000$ samples and two candidate classes: $\mathcal{H}_A$ with $|\mathcal{H}_A| = 50$ and $\varepsilon_{\mathrm{app}}^A = 0.10$ ; $\mathcal{H}_B$ with $|\mathcal{H}_B| = 5\,000$ and $\varepsilon_{\mathrm{app}}^B = 0.02$ . Using the agnostic finite-class estimation bound at $\delta = 0.05$ , which class minimizes the upper bound on $L_\mathcal{D}(h_S)$ ?

Summary

The error of an ERM hypothesis decomposes: $L_\mathcal{D}(h_S) = \varepsilon_{\mathrm{app}} + \varepsilon_{\mathrm{est}}$ .
Approximation error: in-class population minimum, depends on $\mathcal{H}$ and $\mathcal{D}$ , not on $m$ .
Estimation error: ERM-vs-best-in-class gap, depends on $m$ and class complexity.
Richer $\mathcal{H}$ shrinks approximation, inflates estimation; smaller $\mathcal{H}$ does the reverse.
The Bayes-optimal "all functions" class is unlearnable (No-Free-Lunch); practical learning is the search for the smallest class that captures the truth.
Distinct from the statistical bias-variance decomposition; different variables, different settings, often confused.

References

Canonical:

Shalev-Shwartz and Ben-David, Understanding Machine Learning, Section 5.2 — the source for the approximation/estimation split as named here.
Vapnik, Statistical Learning Theory (1998), Chapter 1 — the structural risk minimization framing.
Devroye, Györfi, Lugosi, A Probabilistic Theory of Pattern Recognition (1996), Chapter 12 — empirical risk minimization and the consistency-vs-rate split.

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Section 4.1 — agnostic PAC oracle inequality and the same decomposition under different notation.
Wainwright, High-Dimensional Statistics (2019), Chapter 14 — uniform convergence rates that bound the estimation error.
Bartlett, Bousquet, Mendelson, "Local Rademacher Complexities," Annals of Statistics 33 (2005), 1497-1537 — fast rates for the estimation error under low-noise / Bernstein conditions, refining the $1/\sqrt{m}$ scaling.

Next Topics

VC dimension: how to bound estimation error for infinite hypothesis classes via a single complexity measure.
Regularization theory: penalty-based class selection that operationalizes the tradeoff.
Bias-variance tradeoff: the related-but-distinct pointwise decomposition for squared-error regression.

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

PAC Learning Frameworklayer 1 · tier 1
Empirical Risk Minimizationlayer 2 · tier 1
No-Free-Lunch Theoremlayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.

Why This Matters

Mental Model

Formal Setup

The Error Decomposition

Properties of Each Term

Approximation error: εapp\varepsilon_{\mathrm{app}}εapp​

Estimation error: εest\varepsilon_{\mathrm{est}}εest​

The Tradeoff

Worked Example: Polynomial Regression

Common Confusions

Connection to Other Frameworks

Exercises

Summary

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Approximation error: $\varepsilon_{\mathrm{app}}$

Estimation error: $\varepsilon_{\mathrm{est}}$