Learning Theory Core
Bias-Complexity Tradeoff
The error of an ERM hypothesis decomposes into approximation error (the price of restricting H) and estimation error (the price of finite data). Richer classes shrink approximation but inflate estimation. The optimal class is the smallest one that still captures the truth.
Why This Matters
Once the No-Free-Lunch theorem forces you to commit to a hypothesis class , the next question is which one. Smaller classes generalize from less data but may not contain a hypothesis close to the truth. Larger classes can fit the truth but need more data to identify it. The bias-complexity tradeoff is the formal account of this tension.
The decomposition tells you exactly which side of the tradeoff is hurting you. Large empirical risk on the training set means the class is too small (approximation error). Small empirical risk but large test error means the class is too rich for the sample size (estimation error). Every model-selection technique, including cross-validation, AIC/BIC, regularization theory, and Kolmogorov complexity / MDL, is trying to navigate this tradeoff.
This is closely related to but distinct from the statistical bias-variance decomposition. Read the Common Confusions section below for the difference.
Mental Model
Imagine fitting a curve through noisy data. A constant function is a tiny class: it cannot bend, so its best member misses any non-constant truth by a lot (large approximation error). A polynomial of degree 50 is a giant class: it can match almost any 50-point sample, so it tracks noise (large estimation error). Linear or low-degree polynomial is the compromise: small enough to be identifiable from the sample, large enough to capture trends.
The art of model selection is choosing the smallest class that still contains a hypothesis close to the truth. NFL says that choice is forced; the bias-complexity tradeoff quantifies the cost of getting it wrong.
Formal Setup
Fix a domain , an unknown distribution , a loss , and a hypothesis class . Population risk of a hypothesis is
Let be a best-in-class hypothesis (assumed to exist; if not, take an infimum and add a vanishing slack). Let be the ERM on a sample of size .
The Error Decomposition
ERM Error Decomposition (Shalev-Shwartz and Ben-David, Section 5.2)
Statement
The population risk of the ERM hypothesis decomposes as where the approximation error is and the estimation error is
Intuition
The approximation error is the floor: the smallest population risk any hypothesis in can achieve. It depends on and but not on the sample. The estimation error is the gap between ERM and the in-class minimum: it depends on the sample and shrinks with . Total risk is the sum.
Proof Sketch
Add and subtract : The decomposition is an identity, not a bound. The content is in the interpretation of the two terms and in the bounds on each.
Why It Matters
This identity organizes every generalization bound in learning theory. Approximation error is what you control by choosing . Estimation error is what you control by collecting more data or restricting . Total error cannot go below the approximation floor, no matter how much data you have.
Failure Mode
The decomposition assumes a best-in-class hypothesis exists. For non-compact classes the infimum may not be attained; adopt a vanishing slack ( with ) and let . The identity also assumes a fixed ; data-driven class selection (cross-validation, complexity-penalized class search) requires adjusted bounds that account for the selection step.
Properties of Each Term
Approximation error:
- Depends on and , not on .
- Measures the inductive bias: how well does the best hypothesis in our chosen class fit the truth?
- Under the realizability assumption, .
- In the agnostic setting, is the gap to the Bayes risk plus the Bayes risk itself. It can be large if is too small.
- Cannot be reduced by collecting more data. The only lever is enlarging .
Estimation error:
- Depends on , , and the sample.
- For finite in the agnostic setting, with high probability over . This is the standard uniform-convergence bound.
- For infinite with finite VC dimension , up to log factors.
- Decreases with at rate (agnostic) or (realizable).
- Increases with class complexity (logarithmically in , linearly in under the square root).
The Tradeoff
| Direction | Approximation error | Estimation error |
|---|---|---|
| Enlarge | shrinks (more functions to fit truth) | grows ( or increases) |
| Shrink | grows (truth may not be in class) | shrinks (less to overfit) |
| Increase | unchanged | shrinks () |
For fixed , the optimal class minimizes . Too small: bias dominates, even infinite data does not help. Too large: variance dominates, the bound is vacuous at the available sample size.
The Bayes optimal classifier minimizes population risk over all functions, so taking (or its regression analog) drives to the Bayes-risk floor. But No-Free-Lunch shows that class is not learnable: blows up. Practical learning is the search for a class small enough to identify from data and large enough that is acceptable.
Worked Example: Polynomial Regression
Take , regression with squared loss, true regression function , Gaussian noise . Hypothesis class : polynomials of degree at most . Sample size .
| Degree | (best-fit MSE in ) | (typical, ) | Total |
|---|---|---|---|
| 1 (line) | 0.30 | 0.03 | 0.33 |
| 3 | 0.04 | 0.10 | 0.14 |
| 7 | 0.001 | 0.23 | 0.23 |
| 15 | 0.0001 | 0.50 | 0.50 |
Constants are illustrative. The shape is what matters: decays fast as grows from 1 to 7, then plateaus near zero. grows roughly linearly in . The optimum is at degree 3, not degree 1 (too biased) and not degree 15 (too much estimation error).
This is why model-selection procedures matter: at , degree 3 generalizes; at , degree 7 might. The right class depends on the sample size, not just on the truth.
Common Confusions
Bias-complexity is not the same as bias-variance
The classical bias-variance decomposition splits squared-error risk on a fixed point as , with the expectation over training samples. Bias is the gap between the expected prediction and the truth; variance is the spread of predictions across samples; noise is irreducible.
The bias-complexity decomposition splits population risk of the ERM hypothesis into approximation error (the in-class minimum) and estimation error (the gap to that minimum). The two pictures coincide under restrictive conditions (e.g., realizable squared loss with a specific class), but in general they are different cuts of "what goes wrong." Bias-variance lives in the regression literature and applies pointwise; bias-complexity lives in PAC theory and applies to expected risk over the entire input distribution.
Approximation error does not require knowing the truth
You do not need to know or the Bayes-optimal classifier to talk about approximation error. The definition is internal to the class: is the best risk among hypotheses in your class. You cannot compute it without , but you can reason about how it changes when you enlarge or shrink , even when is unknown.
Realizability does not eliminate estimation error
Under realizability, , but remains positive because ERM picks among many zero-empirical-risk hypotheses, only some of which generalize. The realizable PAC bound gives in expectation, better than the agnostic rate but still nonzero.
The Bayes classifier is not a practical H
minimizes risk over all functions, so taking = all functions makes (the irreducible risk). The catch: that class is not learnable by No-Free-Lunch, so the favorable is paid for by that does not vanish with . Bayes optimality is a theoretical benchmark, not an implementable target.
Connection to Other Frameworks
The decomposition refines the PAC framework: PAC bounds the estimation error of ERM, conditional on the choice of . NFL forces the choice; bias-complexity quantifies the cost; PAC controls one term in the sum. The full story is:
The first two terms together equal ; the split into noise and class-deficit makes explicit that even cannot beat the Bayes risk.
The operational response is structural risk minimization: parameterize a sequence and let the data choose the index by penalizing complexity (Vapnik 1998, Chapter 1). Regularization, cross-validation, and the MDL-based information criteria are all variations on the same theme.
Exercises
Problem
Define approximation error and estimation error in one sentence each, and state which depends on the sample size and which does not.
Problem
You enlarge your hypothesis class from finite with to finite with . In which direction does each error term move, qualitatively, at fixed ?
Problem
Show that for finite in the agnostic setting, with probability at least over ,
Problem
You have samples and two candidate classes: with and ; with and . Using the agnostic finite-class estimation bound at , which class minimizes the upper bound on ?
Summary
- The error of an ERM hypothesis decomposes: .
- Approximation error: in-class population minimum, depends on and , not on .
- Estimation error: ERM-vs-best-in-class gap, depends on and class complexity.
- Richer shrinks approximation, inflates estimation; smaller does the reverse.
- The Bayes-optimal "all functions" class is unlearnable (No-Free-Lunch); practical learning is the search for the smallest class that captures the truth.
- Distinct from the statistical bias-variance decomposition; different variables, different settings, often confused.
References
Canonical:
- Shalev-Shwartz and Ben-David, Understanding Machine Learning, Section 5.2 — the source for the approximation/estimation split as named here.
- Vapnik, Statistical Learning Theory (1998), Chapter 1 — the structural risk minimization framing.
- Devroye, Györfi, Lugosi, A Probabilistic Theory of Pattern Recognition (1996), Chapter 12 — empirical risk minimization and the consistency-vs-rate split.
Current:
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Section 4.1 — agnostic PAC oracle inequality and the same decomposition under different notation.
- Wainwright, High-Dimensional Statistics (2019), Chapter 14 — uniform convergence rates that bound the estimation error.
- Bartlett, Bousquet, Mendelson, "Local Rademacher Complexities," Annals of Statistics 33 (2005), 1497-1537 — fast rates for the estimation error under low-noise / Bernstein conditions, refining the scaling.
Next Topics
- VC dimension: how to bound estimation error for infinite hypothesis classes via a single complexity measure.
- Regularization theory: penalty-based class selection that operationalizes the tradeoff.
- Bias-variance tradeoff: the related-but-distinct pointwise decomposition for squared-error regression.
Last reviewed: May 8, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- PAC Learning Frameworklayer 1 · tier 1
- Empirical Risk Minimizationlayer 2 · tier 1
- No-Free-Lunch Theoremlayer 2 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.