Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Learning Theory Core

Hypothesis Classes and Function Spaces

What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.

CoreTier 1Stable~45 min

Why This Matters

Hypothesis class complexity: richer class gives lower bias, higher variance, needs more data

Linear (VC dim = 3)High bias, low varianceDegree-3 polyModerate bias and varianceNeural networkLow bias, high variancepositive classnegative classdecision boundarySame data, different hypothesis classes. Increasing complexity trades bias for variance.

When you train a model, you do not search over all possible functions. You search over a specific set: linear functions, decision trees, neural networks with a given architecture. This set is the hypothesis class. The choice of hypothesis class is the most consequential decision in learning, because it determines both what you can learn and what you cannot.

A hypothesis class that is too small cannot approximate the true function (high approximation error). A class that is too large needs too many samples to learn reliably (high estimation error). All of learning theory is about understanding this tradeoff.

Mental Model

Think of the hypothesis class as a toolbox. If your toolbox only contains hammers (linear functions), you can fix nails efficiently but you cannot turn screws. If your toolbox contains every tool ever made (all measurable functions), you can in principle do anything, but you cannot figure out which tool to use without trying each one.

The learning theorist asks: given nn data points, what is the right-sized toolbox?

Formal Setup

Definition

Hypothesis Class

A hypothesis class is a set of functions H{h:XY}\mathcal{H} \subseteq \{h : \mathcal{X} \to \mathcal{Y}\} that a learning algorithm considers as candidate predictors. The algorithm selects h^H\hat{h} \in \mathcal{H} based on training data.

Definition

Realizable Setting

The learning problem is realizable if the true data-generating function ff^* belongs to H\mathcal{H}: there exists hHh^* \in \mathcal{H} with R(h)=RR(h^*) = R^*, where RR^* is the Bayes risk. In the realizable setting, the approximation error is zero.

Definition

Agnostic Setting

The learning problem is agnostic if we make no assumption about whether fHf^* \in \mathcal{H}. The best we can hope for is to find h^\hat{h} whose risk is close to minhHR(h)\min_{h \in \mathcal{H}} R(h), which may be strictly larger than RR^*.

Examples of Hypothesis Classes

Finite hypothesis classes. H={h1,,hM}\mathcal{H} = \{h_1, \ldots, h_M\}. The ERM generalization bound from the ERM topic gives a gap of O(logM/n)O(\sqrt{\log M / n}).

Linear classifiers. H={xsign(wTx+b):wRd,bR}\mathcal{H} = \{x \mapsto \text{sign}(w^T x + b) : w \in \mathbb{R}^d, b \in \mathbb{R}\}. This is an infinite class parameterized by d+1d + 1 real numbers. The VC dimension is d+1d + 1.

Neural networks with fixed architecture. H={xfθ(x):θRp}\mathcal{H} = \{x \mapsto f_\theta(x) : \theta \in \mathbb{R}^p\} for a network fθf_\theta with pp parameters. The class is parameterized by pp real numbers, but its effective complexity depends on the architecture, not just pp.

Nonparametric classes. H={f:fRKHSB}\mathcal{H} = \{f : \|f\|_{\text{RKHS}} \leq B\} for a reproducing kernel Hilbert space. The "number of parameters" is not fixed; it can grow with nn. Examples: kernel SVM, Gaussian process regression.

All measurable functions. H={h:XY}\mathcal{H} = \{h : \mathcal{X} \to \mathcal{Y}\}. This class can memorize any dataset, giving zero training error. But it provides no generalization: the estimation error is maximal.

Main Theorems

Theorem

Excess Risk Decomposition

Statement

The excess risk of the ERM hypothesis h^\hat{h} over the Bayes-optimal predictor hBayesh^*_{\text{Bayes}} decomposes as:

R(h^)R=(minhHR(h)R)approximation error+(R(h^)minhHR(h))estimation errorR(\hat{h}) - R^* = \underbrace{\left(\min_{h \in \mathcal{H}} R(h) - R^*\right)}_{\text{approximation error}} + \underbrace{\left(R(\hat{h}) - \min_{h \in \mathcal{H}} R(h)\right)}_{\text{estimation error}}

The approximation error depends only on H\mathcal{H} (not on the sample). The estimation error depends on both H\mathcal{H} and the sample size nn.

Intuition

The approximation error asks: how close can the best function in H\mathcal{H} get to the truth? The estimation error asks: how close can ERM get to the best function in H\mathcal{H}, given finite data? These are two different sources of error, and they pull in opposite directions as you vary the size of H\mathcal{H}.

Proof Sketch

This is an identity: add and subtract minhHR(h)\min_{h \in \mathcal{H}} R(h). The content is in recognizing that the two terms have different dependencies and require different tools to bound.

Why It Matters

This decomposition is the backbone of model selection. Regularization, cross-validation, and structural risk minimization are all about balancing these two terms. When someone says a model is "too complex" or "too simple," they are implicitly referring to this tradeoff.

Failure Mode

This decomposition does not account for optimization error. In practice, ERM does not find the exact minimizer of empirical risk (especially for non-convex problems like neural networks). The true decomposition has three terms: approximation + estimation + optimization. The optimization error can dominate for large models trained with SGD.

Proposition

Nested Classes and Monotonicity

Statement

For nested hypothesis classes H1H2H3\mathcal{H}_1 \subset \mathcal{H}_2 \subset \mathcal{H}_3:

  1. Approximation error is non-increasing: minhH3R(h)minhH2R(h)minhH1R(h)\min_{h \in \mathcal{H}_3} R(h) \leq \min_{h \in \mathcal{H}_2} R(h) \leq \min_{h \in \mathcal{H}_1} R(h)

  2. Estimation error is non-decreasing (in expectation, for fixed nn): the uniform convergence gap grows with the complexity of H\mathcal{H}.

The optimal class H\mathcal{H}^* minimizes the sum. It is neither the smallest nor the largest class.

Intuition

A bigger class always contains the best functions from a smaller class, so the approximation error can only improve. But a bigger class also gives ERM more room to overfit, so the estimation error grows. The sweet spot balances these opposing forces.

Proof Sketch

Part 1: H1H2\mathcal{H}_1 \subset \mathcal{H}_2 implies minH2R(h)minH1R(h)\min_{\mathcal{H}_2} R(h) \leq \min_{\mathcal{H}_1} R(h) by minimizing over a larger set. Part 2: the uniform convergence bound for H2\mathcal{H}_2 depends on the complexity of H2\mathcal{H}_2 (e.g., VC dimension), which is at least that of H1\mathcal{H}_1.

Why It Matters

This is the formal version of the bias-variance tradeoff. It explains why increasing model capacity helps up to a point and then hurts (in classical statistics). It also frames the puzzle of modern overparameterized models: why does estimation error not blow up when H\mathcal{H} is massive? The answer involves implicit regularization and is covered in the double descent and benign overfitting topics.

Failure Mode

The monotonicity of estimation error relies on uniform convergence bounds. For overparameterized models, uniform convergence bounds are vacuous, and the actual estimation error may decrease with increasing class size. This is the "interpolation regime" studied in double descent.

Parameterized vs Nonparametric

A parametric hypothesis class has a fixed-dimensional parameter space ΘRp\Theta \subseteq \mathbb{R}^p. The number of parameters pp does not change with the sample size nn. Examples: linear regression (p=d+1p = d + 1), logistic regression, fixed-architecture neural networks.

A nonparametric hypothesis class allows the effective number of parameters to grow with nn. Examples: kk-nearest neighbors (stores all training points), kernel methods (one coefficient per data point), random forests (depth grows with nn).

The distinction matters for generalization bounds: parametric classes have fixed complexity, while nonparametric classes require bounds that account for growing complexity.

Common Confusions

Watch Out

More parameters does not always mean bigger hypothesis class

A neural network with 10910^9 parameters might have effective complexity much less than 10910^9 due to symmetries, implicit regularization from SGD, and architectural constraints. The number of parameters is an upper bound on the "size" of the hypothesis class, not the effective size. VC dimension, Rademacher complexity, and PAC-Bayes bounds attempt to measure effective complexity.

Watch Out

The hypothesis class is chosen before seeing data

In the ERM framework, H\mathcal{H} is fixed before observing the training set. Using the training data to select H\mathcal{H} (e.g., choosing the network architecture via cross-validation) introduces a dependence that invalidates simple ERM bounds. Structural risk minimization (SRM) and model selection theory handle this case.

Watch Out

Approximation error zero does not mean the problem is easy

In the realizable setting, fHf^* \in \mathcal{H} and approximation error is zero. But estimation error can still be large if H\mathcal{H} is complex. Realizability makes the problem easier (you only fight estimation error), but not trivial.

Exercises

ExerciseCore

Problem

Consider two hypothesis classes: H1\mathcal{H}_1 is all linear functions from R10\mathbb{R}^{10} to R\mathbb{R} and H2\mathcal{H}_2 is all polynomials of degree at most 3 from R10\mathbb{R}^{10} to R\mathbb{R}. Which has smaller approximation error? Which has smaller estimation error (for fixed nn)? What is the relationship H1H2\mathcal{H}_1 \subset \mathcal{H}_2?

ExerciseAdvanced

Problem

Suppose the true function is f(x)=sin(x1)f^*(x) = \sin(x_1) and we use ERM with 0-1 loss. Compare the excess risk decomposition for: (a) Hlin\mathcal{H}_{\text{lin}} = linear classifiers, (b) Hnn\mathcal{H}_{\text{nn}} = two-layer neural networks with 100 hidden units. Which term dominates in each case, and why?

References

Canonical:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2, 5, and 6
  • Vapnik, Statistical Learning Theory (1998), Chapter 3
  • Devroye, Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition (1996), Chapters 4-6 (hypothesis classes and risk)

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
  • Wainwright, High-Dimensional Statistics (2019), Chapter 4.1 (function classes and approximation)
  • Bartlett, Bousquet, Mendelson, "Local Rademacher complexities" (2005), Annals of Statistics. Refined analysis of estimation error for localized classes.

Next Topics

  • VC dimension: measuring the complexity of infinite hypothesis classes
  • Rademacher complexity: data-dependent complexity that adapts to the data distribution
  • PAC learning framework: formalization of when learning is possible

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics