Learning Theory Core
Hypothesis Classes and Function Spaces
What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.
Prerequisites
Why This Matters
Hypothesis class complexity: richer class gives lower bias, higher variance, needs more data
When you train a model, you do not search over all possible functions. You search over a specific set: linear functions, decision trees, neural networks with a given architecture. This set is the hypothesis class. The choice of hypothesis class is the most consequential decision in learning, because it determines both what you can learn and what you cannot.
A hypothesis class that is too small cannot approximate the true function (high approximation error). A class that is too large needs too many samples to learn reliably (high estimation error). All of learning theory is about understanding this tradeoff.
Mental Model
Think of the hypothesis class as a toolbox. If your toolbox only contains hammers (linear functions), you can fix nails efficiently but you cannot turn screws. If your toolbox contains every tool ever made (all measurable functions), you can in principle do anything, but you cannot figure out which tool to use without trying each one.
The learning theorist asks: given data points, what is the right-sized toolbox?
Formal Setup
Hypothesis Class
A hypothesis class is a set of functions that a learning algorithm considers as candidate predictors. The algorithm selects based on training data.
Realizable Setting
The learning problem is realizable if the true data-generating function belongs to : there exists with , where is the Bayes risk. In the realizable setting, the approximation error is zero.
Agnostic Setting
The learning problem is agnostic if we make no assumption about whether . The best we can hope for is to find whose risk is close to , which may be strictly larger than .
Examples of Hypothesis Classes
Finite hypothesis classes. . The ERM generalization bound from the ERM topic gives a gap of .
Linear classifiers. . This is an infinite class parameterized by real numbers. The VC dimension is .
Neural networks with fixed architecture. for a network with parameters. The class is parameterized by real numbers, but its effective complexity depends on the architecture, not just .
Nonparametric classes. for a reproducing kernel Hilbert space. The "number of parameters" is not fixed; it can grow with . Examples: kernel SVM, Gaussian process regression.
All measurable functions. . This class can memorize any dataset, giving zero training error. But it provides no generalization: the estimation error is maximal.
Main Theorems
Excess Risk Decomposition
Statement
The excess risk of the ERM hypothesis over the Bayes-optimal predictor decomposes as:
The approximation error depends only on (not on the sample). The estimation error depends on both and the sample size .
Intuition
The approximation error asks: how close can the best function in get to the truth? The estimation error asks: how close can ERM get to the best function in , given finite data? These are two different sources of error, and they pull in opposite directions as you vary the size of .
Proof Sketch
This is an identity: add and subtract . The content is in recognizing that the two terms have different dependencies and require different tools to bound.
Why It Matters
This decomposition is the backbone of model selection. Regularization, cross-validation, and structural risk minimization are all about balancing these two terms. When someone says a model is "too complex" or "too simple," they are implicitly referring to this tradeoff.
Failure Mode
This decomposition does not account for optimization error. In practice, ERM does not find the exact minimizer of empirical risk (especially for non-convex problems like neural networks). The true decomposition has three terms: approximation + estimation + optimization. The optimization error can dominate for large models trained with SGD.
Nested Classes and Monotonicity
Statement
For nested hypothesis classes :
-
Approximation error is non-increasing:
-
Estimation error is non-decreasing (in expectation, for fixed ): the uniform convergence gap grows with the complexity of .
The optimal class minimizes the sum. It is neither the smallest nor the largest class.
Intuition
A bigger class always contains the best functions from a smaller class, so the approximation error can only improve. But a bigger class also gives ERM more room to overfit, so the estimation error grows. The sweet spot balances these opposing forces.
Proof Sketch
Part 1: implies by minimizing over a larger set. Part 2: the uniform convergence bound for depends on the complexity of (e.g., VC dimension), which is at least that of .
Why It Matters
This is the formal version of the bias-variance tradeoff. It explains why increasing model capacity helps up to a point and then hurts (in classical statistics). It also frames the puzzle of modern overparameterized models: why does estimation error not blow up when is massive? The answer involves implicit regularization and is covered in the double descent and benign overfitting topics.
Failure Mode
The monotonicity of estimation error relies on uniform convergence bounds. For overparameterized models, uniform convergence bounds are vacuous, and the actual estimation error may decrease with increasing class size. This is the "interpolation regime" studied in double descent.
Parameterized vs Nonparametric
A parametric hypothesis class has a fixed-dimensional parameter space . The number of parameters does not change with the sample size . Examples: linear regression (), logistic regression, fixed-architecture neural networks.
A nonparametric hypothesis class allows the effective number of parameters to grow with . Examples: -nearest neighbors (stores all training points), kernel methods (one coefficient per data point), random forests (depth grows with ).
The distinction matters for generalization bounds: parametric classes have fixed complexity, while nonparametric classes require bounds that account for growing complexity.
Common Confusions
More parameters does not always mean bigger hypothesis class
A neural network with parameters might have effective complexity much less than due to symmetries, implicit regularization from SGD, and architectural constraints. The number of parameters is an upper bound on the "size" of the hypothesis class, not the effective size. VC dimension, Rademacher complexity, and PAC-Bayes bounds attempt to measure effective complexity.
The hypothesis class is chosen before seeing data
In the ERM framework, is fixed before observing the training set. Using the training data to select (e.g., choosing the network architecture via cross-validation) introduces a dependence that invalidates simple ERM bounds. Structural risk minimization (SRM) and model selection theory handle this case.
Approximation error zero does not mean the problem is easy
In the realizable setting, and approximation error is zero. But estimation error can still be large if is complex. Realizability makes the problem easier (you only fight estimation error), but not trivial.
Exercises
Problem
Consider two hypothesis classes: is all linear functions from to and is all polynomials of degree at most 3 from to . Which has smaller approximation error? Which has smaller estimation error (for fixed )? What is the relationship ?
Problem
Suppose the true function is and we use ERM with 0-1 loss. Compare the excess risk decomposition for: (a) = linear classifiers, (b) = two-layer neural networks with 100 hidden units. Which term dominates in each case, and why?
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2, 5, and 6
- Vapnik, Statistical Learning Theory (1998), Chapter 3
- Devroye, Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition (1996), Chapters 4-6 (hypothesis classes and risk)
Current:
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
- Wainwright, High-Dimensional Statistics (2019), Chapter 4.1 (function classes and approximation)
- Bartlett, Bousquet, Mendelson, "Local Rademacher complexities" (2005), Annals of Statistics. Refined analysis of estimation error for localized classes.
Next Topics
- VC dimension: measuring the complexity of infinite hypothesis classes
- Rademacher complexity: data-dependent complexity that adapts to the data distribution
- PAC learning framework: formalization of when learning is possible
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.