Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Learning Theory Core

Sample Complexity Bounds

How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.

CoreTier 1Stable~50 min

Why This Matters

Sample complexity is the right currency for comparing learning algorithms. Asking "how accurate is this algorithm?" without specifying the sample size is meaningless. The question that matters is: how many samples nn does an algorithm need to achieve excess risk at most ϵ\epsilon with probability at least 1δ1 - \delta?

Every upper bound on generalization error from ERM, VC theory, or Rademacher complexity can be inverted to give a sample complexity bound. Lower bounds tell you no algorithm can do better, regardless of computational power.

Mental Model

Think of sample complexity as a budget. You want to buy ϵ\epsilon-accurate generalization. The price depends on how complex your hypothesis class is (measured by logH\log|\mathcal{H}|, VC dimension dd, or Rademacher complexity Rn\mathfrak{R}_n). Upper bounds tell you a price that is always sufficient. Lower bounds tell you the cheapest possible price.

The gap between upper and lower bounds measures how well we understand a learning problem.

Formal Setup

Fix an input space X\mathcal{X}, output space Y\mathcal{Y}, a loss function \ell bounded in [0,1][0, 1], a hypothesis class H\mathcal{H}, and accuracy parameters ϵ,δ>0\epsilon, \delta > 0.

Definition

Sample Complexity

The sample complexity of learning H\mathcal{H} to accuracy ϵ\epsilon with confidence 1δ1 - \delta is the smallest nn such that there exists an algorithm AA satisfying:

supDPrSDn[R(A(S))minhHR(h)>ϵ]δ\sup_{\mathcal{D}} \Pr_{S \sim \mathcal{D}^n}\left[R(A(S)) - \min_{h \in \mathcal{H}} R(h) > \epsilon\right] \leq \delta

The supremum is over all distributions D\mathcal{D} on X×Y\mathcal{X} \times \mathcal{Y}.

This is a worst-case definition. The distribution D\mathcal{D} is adversarial.

Main Theorems

Theorem

Sample Complexity of Finite Hypothesis Classes

Statement

For a finite hypothesis class H\mathcal{H} with loss in [0,1][0,1], ERM achieves excess risk at most ϵ\epsilon with probability at least 1δ1 - \delta whenever:

n2(logH+log(2/δ))ϵ2n \geq \frac{2(\log|\mathcal{H}| + \log(2/\delta))}{\epsilon^2}

Equivalently, the sample complexity satisfies n(ϵ,δ,H)=O ⁣(logH+log(1/δ)ϵ2)n(\epsilon, \delta, \mathcal{H}) = O\!\left(\frac{\log|\mathcal{H}| + \log(1/\delta)}{\epsilon^2}\right).

Intuition

Each hypothesis needs roughly 1/ϵ21/\epsilon^2 samples to pin down its population risk (by Hoeffding). The union bound over H|\mathcal{H}| hypotheses costs a factor of logH\log|\mathcal{H}|, not H|\mathcal{H}|, because the bound enters through an exponential inequality.

Proof Sketch

From the ERM generalization bound for finite classes, with probability 1δ\geq 1 - \delta:

suphHR(h)R^n(h)logH+log(2/δ)2n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}

Set this ϵ/2\leq \epsilon/\sqrt{2} and solve for nn. Since the ERM hypothesis h^\hat{h} satisfies R(h^)minhR(h)2suphR(h)R^n(h)R(\hat{h}) - \min_h R(h) \leq 2 \sup_h |R(h) - \hat{R}_n(h)|, the excess risk is at most ϵ\epsilon.

Why It Matters

This is the baseline sample complexity result. It shows that finite classes are always learnable, and the cost is logarithmic in the class size. A class with 2d2^d hypotheses needs only O(d/ϵ2)O(d/\epsilon^2) samples, not O(2d/ϵ2)O(2^d/\epsilon^2).

Failure Mode

The bound is vacuous for infinite classes. It also gives no help when H|\mathcal{H}| is finite but astronomically large (e.g., all possible decision trees of a given depth), because the logH\log|\mathcal{H}| term may be too large to be useful.

Theorem

Sample Complexity via VC Dimension

Statement

For binary classification with 0-1 loss, if H\mathcal{H} has VC dimension d<d < \infty:

Upper bound: There exists a constant C>0C > 0 such that ERM achieves excess risk ϵ\leq \epsilon with probability 1δ\geq 1 - \delta whenever:

nCd+log(1/δ)ϵ2n \geq C \cdot \frac{d + \log(1/\delta)}{\epsilon^2}

Lower bound: There exist constants c1,c2>0c_1, c_2 > 0 such that for any algorithm:

n(ϵ,δ,H)c1dϵ2+c2log(1/δ)ϵ2n(\epsilon, \delta, \mathcal{H}) \geq c_1 \cdot \frac{d}{\epsilon^2} + c_2 \cdot \frac{\log(1/\delta)}{\epsilon^2}

Intuition

VC dimension dd replaces logH\log|\mathcal{H}| for infinite classes. The Sauer-Shelah lemma shows that the effective number of behaviors of H\mathcal{H} on nn points is at most (en/d)d(en/d)^d. Taking logs recovers dlog(n/d)d \log(n/d), which leads to a sample complexity of O(d/ϵ2)O(d/\epsilon^2) up to log factors. The lower bound shows this is tight.

Proof Sketch

Upper bound: Apply the VC generalization bound: with probability 1δ\geq 1 - \delta, the uniform deviation is O((dlog(n/d)+log(1/δ))/n)O(\sqrt{(d\log(n/d) + \log(1/\delta))/n}). Setting this ϵ\leq \epsilon and solving for nn gives n=O(d/ϵ2)n = O(d/\epsilon^2) (the log factor is absorbed because nn appears on both sides, and the solution satisfies n=O(d/ϵ2log(d/ϵ2))n = O(d/\epsilon^2 \cdot \log(d/\epsilon^2)), which simplifies).

Lower bound: Construct a set of 2d2^d distributions that are pairwise hard to distinguish. Apply Fano's inequality.

Why It Matters

This result characterizes PAC learnability for binary classification: H\mathcal{H} is PAC learnable if and only if its VC dimension is finite. The sample complexity is Θ(d/ϵ2)\Theta(d/\epsilon^2) up to log factors, giving a tight characterization.

Failure Mode

The VC bound applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, you need different complexity measures (pseudo-dimension, Natarajan dimension, or Rademacher complexity). The constants C,c1,c2C, c_1, c_2 are often not known tightly, making the bound hard to use for choosing nn in practice.

Theorem

Sample Complexity via Rademacher Complexity

Statement

If the Rademacher complexity of the loss class satisfies Rn(H)R(n)\mathfrak{R}_n(\ell \circ \mathcal{H}) \leq R(n), then ERM achieves excess risk ϵ\leq \epsilon with probability 1δ\geq 1 - \delta whenever:

n is chosen so that 2R(n)+log(1/δ)2nϵn \text{ is chosen so that } 2R(n) + \sqrt{\frac{\log(1/\delta)}{2n}} \leq \epsilon

For many classes, R(n)=O(1/n)R(n) = O(1/\sqrt{n}), giving sample complexity O(1/ϵ2)O(1/\epsilon^2).

Intuition

Rademacher complexity directly measures how well the class can fit random noise. If the class cannot fit noise well (small Rn\mathfrak{R}_n), it cannot overfit real data either. The sample complexity is determined by how fast Rn\mathfrak{R}_n decays with nn.

Proof Sketch

The Rademacher generalization bound states that with probability 1δ\geq 1 - \delta:

suphH(R(h)R^n(h))2Rn(H)+log(1/δ)2n\sup_{h \in \mathcal{H}} (R(h) - \hat{R}_n(h)) \leq 2\mathfrak{R}_n(\ell \circ \mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2n}}

Set the right side ϵ\leq \epsilon and solve for nn.

Why It Matters

Rademacher bounds are data-dependent: you can estimate Rn\mathfrak{R}_n from the training sample itself. This gives tighter sample complexity for specific distributions, not just worst-case over all distributions. For linear classes in Rd\mathbb{R}^d with bounded norm, Rn=O(d/n)\mathfrak{R}_n = O(\sqrt{d/n}), recovering the VC result without going through combinatorics.

Failure Mode

Computing or bounding Rn\mathfrak{R}_n can be difficult for complex hypothesis classes (e.g., deep neural networks). The resulting bounds are often loose in practice and do not explain the generalization of overparameterized models.

Lower Bound Methods

Upper bounds say "this many samples suffice." Lower bounds say "no algorithm can do better." Two classical methods:

Fano's method. Construct MM distributions P1,,PMP_1, \ldots, P_M such that: (1) any two Pi,PjP_i, P_j require different hypotheses (i.e., hihj>ϵ\|h_i - h_j\| > \epsilon), and (2) the distributions are hard to tell apart (measured by KL divergence). Fano's inequality then gives:

nlogMlog2maxijKL(PinPjn)n \geq \frac{\log M - \log 2}{\max_{i \neq j} \mathrm{KL}(P_i^n \| P_j^n)}

Le Cam's method. A two-hypothesis version. Construct P0P_0 and P1P_1 with different optimal hypotheses. The minimax risk is at least (ϵ/2)(1TV(P0n,P1n))(\epsilon/2)(1 - \mathrm{TV}(P_0^n, P_1^n)). Simpler than Fano but gives weaker results (no logM\log M factor).

Theorem

Fano Method Lower Bound

Statement

Let P1,,PMP_1, \ldots, P_M be distributions such that for all iji \neq j, the associated optimal hypotheses satisfy d(hi,hj)>2ϵd(h_i, h_j) > 2\epsilon and KL(PinPjn)β\mathrm{KL}(P_i^n \| P_j^n) \leq \beta. Then for any estimator h^\hat{h}:

max1iMPrSPin[d(h^(S),hi)>ϵ]1nβ+log2logM\max_{1 \leq i \leq M} \Pr_{S \sim P_i^n}[d(\hat{h}(S), h_i) > \epsilon] \geq 1 - \frac{n\beta + \log 2}{\log M}

Intuition

If you have MM hypotheses that all look similar (low KL divergence) but require different answers, then no algorithm can reliably pick the right one unless nn is large enough for the total information nβn\beta to exceed logM\log M.

Proof Sketch

Apply Fano's inequality to the MM-ary hypothesis testing problem. The mutual information between the index ii and the sample SS is at most nβn\beta (by the data processing inequality and the KL bound). Fano's inequality bounds the probability of error in terms of I(i;S)/logMI(i; S)/\log M.

Why It Matters

This is the workhorse of minimax lower bounds. Nearly every known sample complexity lower bound in nonparametric statistics and learning theory uses some form of Fano's method or its generalizations.

Failure Mode

Constructing the right packing (the MM distributions) is the hard part. A poor choice of packing gives a weak lower bound. The bound is also limited to worst-case analysis; it says nothing about specific distributions that might be easier.

The Gap Between Upper and Lower Bounds

For binary classification with VC dimension dd:

  • Upper bound: O(dlog(1/ϵ)/ϵ2)O(d \log(1/\epsilon) / \epsilon^2) (from the VC bound with log factor)
  • Lower bound: Ω(d/ϵ2)\Omega(d/\epsilon^2) (from Fano)

The gap is a log(1/ϵ)\log(1/\epsilon) factor. Whether this log factor is necessary was an open question for decades. Hanneke (2016) showed the optimal sample complexity of ERM is Θ(d/ϵ2)\Theta(d/\epsilon^2) for realizable binary classification, closing the gap.

For agnostic (noisy) learning, the picture is more nuanced. The optimal rate depends on whether you measure excess risk or absolute risk, and whether the algorithm is required to be an ERM or can be arbitrary.

Canonical Examples

Example

Learning threshold classifiers on the line

Let X=[0,1]\mathcal{X} = [0, 1] and H={x1[xt]:t[0,1]}\mathcal{H} = \{x \mapsto \mathbf{1}[x \geq t] : t \in [0,1]\}. This class has VC dimension d=1d = 1. The sample complexity is Θ(1/ϵ2)\Theta(1/\epsilon^2).

With n=100n = 100 samples, ERM achieves excess risk 0.1\approx 0.1 with high probability. With n=10,000n = 10{,}000 samples, excess risk 0.01\approx 0.01.

Example

Linear classifiers in d dimensions

Halfspaces in Rd\mathbb{R}^d have VC dimension dd. Sample complexity is Θ(d/ϵ2)\Theta(d/\epsilon^2). In d=100d = 100 dimensions, you need roughly 100/ϵ2100/\epsilon^2 samples. At ϵ=0.01\epsilon = 0.01, that is 10810^8 samples. This explains why high-dimensional linear classification needs large datasets.

Common Confusions

Watch Out

Sample complexity is not the same as convergence rate

Sample complexity asks: how many samples for ϵ\epsilon-accuracy? Convergence rate asks: how fast does the excess risk decrease with nn? They are related by inversion, but people often confuse the two. A rate of O(1/n)O(1/\sqrt{n}) corresponds to a sample complexity of O(1/ϵ2)O(1/\epsilon^2). A rate of O(1/n)O(1/n) corresponds to O(1/ϵ)O(1/\epsilon), which is called a "fast rate."

Watch Out

Computational complexity is separate from sample complexity

Sample complexity ignores computation. An algorithm might need only O(d/ϵ2)O(d/\epsilon^2) samples but require exponential time to process them. Cryptographic hardness results show that for some problems, computationally efficient algorithms provably need more samples than information-theoretically optimal (but computationally unbounded) ones.

Watch Out

Bounds with unspecified constants are hard to use in practice

The statement n=O(d/ϵ2)n = O(d/\epsilon^2) hides a constant. That constant might be 1 or 1000. For practical sample size planning, you need either explicit constants (which are rarely tight) or empirical methods like learning curves. Sample complexity theory tells you the right scaling, not the right number.

Key Takeaways

  • For finite H\mathcal{H}: n=O(logH/ϵ2)n = O(\log|\mathcal{H}|/\epsilon^2)
  • For VC dimension dd: n=Θ(d/ϵ2)n = \Theta(d/\epsilon^2) up to log factors
  • For Rademacher complexity R(n)R(n): solve R(n)ϵ/2R(n) \leq \epsilon/2
  • Lower bounds via Fano require constructing hard packing sets
  • Sample complexity is the right way to compare learning algorithms across different regimes
  • The gap between upper and lower bounds measures our ignorance about a problem

Exercises

ExerciseCore

Problem

A hypothesis class has 500 elements. How many samples does ERM need to achieve excess risk 0.05\leq 0.05 with probability 0.95\geq 0.95?

ExerciseAdvanced

Problem

The class of halfspaces in R10\mathbb{R}^{10} has VC dimension 11. Use the VC sample complexity bound to determine the order of magnitude of samples needed for ϵ=0.01\epsilon = 0.01 accuracy. Then explain why this is still useful despite the loose constants.

ExerciseResearch

Problem

Construct a Fano-style lower bound argument showing that learning threshold classifiers on [0,1][0,1] requires n=Ω(1/ϵ2)n = \Omega(1/\epsilon^2) samples. Specifically: choose MM threshold values spaced 2ϵ2\epsilon apart, bound the KL divergence between the induced distributions, and apply Fano's inequality.

References

Canonical:

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 3
  • Wainwright, High-Dimensional Statistics (2019), Chapter 15 (minimax lower bounds)

Next Topics

The natural extensions from sample complexity:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics