Learning Theory Core
Sample Complexity Bounds
How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.
Prerequisites
Why This Matters
Sample complexity is the right currency for comparing learning algorithms. Asking "how accurate is this algorithm?" without specifying the sample size is meaningless. The question that matters is: how many samples does an algorithm need to achieve excess risk at most with probability at least ?
Every upper bound on generalization error from ERM, VC theory, or Rademacher complexity can be inverted to give a sample complexity bound. Lower bounds tell you no algorithm can do better, regardless of computational power.
Mental Model
Think of sample complexity as a budget. You want to buy -accurate generalization. The price depends on how complex your hypothesis class is (measured by , VC dimension , or Rademacher complexity ). Upper bounds tell you a price that is always sufficient. Lower bounds tell you the cheapest possible price.
The gap between upper and lower bounds measures how well we understand a learning problem.
Formal Setup
Fix an input space , output space , a loss function bounded in , a hypothesis class , and accuracy parameters .
Sample Complexity
The sample complexity of learning to accuracy with confidence is the smallest such that there exists an algorithm satisfying:
The supremum is over all distributions on .
This is a worst-case definition. The distribution is adversarial.
Main Theorems
Sample Complexity of Finite Hypothesis Classes
Statement
For a finite hypothesis class with loss in , ERM achieves excess risk at most with probability at least whenever:
Equivalently, the sample complexity satisfies .
Intuition
Each hypothesis needs roughly samples to pin down its population risk (by Hoeffding). The union bound over hypotheses costs a factor of , not , because the bound enters through an exponential inequality.
Proof Sketch
From the ERM generalization bound for finite classes, with probability :
Set this and solve for . Since the ERM hypothesis satisfies , the excess risk is at most .
Why It Matters
This is the baseline sample complexity result. It shows that finite classes are always learnable, and the cost is logarithmic in the class size. A class with hypotheses needs only samples, not .
Failure Mode
The bound is vacuous for infinite classes. It also gives no help when is finite but astronomically large (e.g., all possible decision trees of a given depth), because the term may be too large to be useful.
Sample Complexity via VC Dimension
Statement
For binary classification with 0-1 loss, if has VC dimension :
Upper bound: There exists a constant such that ERM achieves excess risk with probability whenever:
Lower bound: There exist constants such that for any algorithm:
Intuition
VC dimension replaces for infinite classes. The Sauer-Shelah lemma shows that the effective number of behaviors of on points is at most . Taking logs recovers , which leads to a sample complexity of up to log factors. The lower bound shows this is tight.
Proof Sketch
Upper bound: Apply the VC generalization bound: with probability , the uniform deviation is . Setting this and solving for gives (the log factor is absorbed because appears on both sides, and the solution satisfies , which simplifies).
Lower bound: Construct a set of distributions that are pairwise hard to distinguish. Apply Fano's inequality.
Why It Matters
This result characterizes PAC learnability for binary classification: is PAC learnable if and only if its VC dimension is finite. The sample complexity is up to log factors, giving a tight characterization.
Failure Mode
The VC bound applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, you need different complexity measures (pseudo-dimension, Natarajan dimension, or Rademacher complexity). The constants are often not known tightly, making the bound hard to use for choosing in practice.
Sample Complexity via Rademacher Complexity
Statement
If the Rademacher complexity of the loss class satisfies , then ERM achieves excess risk with probability whenever:
For many classes, , giving sample complexity .
Intuition
Rademacher complexity directly measures how well the class can fit random noise. If the class cannot fit noise well (small ), it cannot overfit real data either. The sample complexity is determined by how fast decays with .
Proof Sketch
The Rademacher generalization bound states that with probability :
Set the right side and solve for .
Why It Matters
Rademacher bounds are data-dependent: you can estimate from the training sample itself. This gives tighter sample complexity for specific distributions, not just worst-case over all distributions. For linear classes in with bounded norm, , recovering the VC result without going through combinatorics.
Failure Mode
Computing or bounding can be difficult for complex hypothesis classes (e.g., deep neural networks). The resulting bounds are often loose in practice and do not explain the generalization of overparameterized models.
Lower Bound Methods
Upper bounds say "this many samples suffice." Lower bounds say "no algorithm can do better." Two classical methods:
Fano's method. Construct distributions such that: (1) any two require different hypotheses (i.e., ), and (2) the distributions are hard to tell apart (measured by KL divergence). Fano's inequality then gives:
Le Cam's method. A two-hypothesis version. Construct and with different optimal hypotheses. The minimax risk is at least . Simpler than Fano but gives weaker results (no factor).
Fano Method Lower Bound
Statement
Let be distributions such that for all , the associated optimal hypotheses satisfy and . Then for any estimator :
Intuition
If you have hypotheses that all look similar (low KL divergence) but require different answers, then no algorithm can reliably pick the right one unless is large enough for the total information to exceed .
Proof Sketch
Apply Fano's inequality to the -ary hypothesis testing problem. The mutual information between the index and the sample is at most (by the data processing inequality and the KL bound). Fano's inequality bounds the probability of error in terms of .
Why It Matters
This is the workhorse of minimax lower bounds. Nearly every known sample complexity lower bound in nonparametric statistics and learning theory uses some form of Fano's method or its generalizations.
Failure Mode
Constructing the right packing (the distributions) is the hard part. A poor choice of packing gives a weak lower bound. The bound is also limited to worst-case analysis; it says nothing about specific distributions that might be easier.
The Gap Between Upper and Lower Bounds
For binary classification with VC dimension :
- Upper bound: (from the VC bound with log factor)
- Lower bound: (from Fano)
The gap is a factor. Whether this log factor is necessary was an open question for decades. Hanneke (2016) showed the optimal sample complexity of ERM is for realizable binary classification, closing the gap.
For agnostic (noisy) learning, the picture is more nuanced. The optimal rate depends on whether you measure excess risk or absolute risk, and whether the algorithm is required to be an ERM or can be arbitrary.
Canonical Examples
Learning threshold classifiers on the line
Let and . This class has VC dimension . The sample complexity is .
With samples, ERM achieves excess risk with high probability. With samples, excess risk .
Linear classifiers in d dimensions
Halfspaces in have VC dimension . Sample complexity is . In dimensions, you need roughly samples. At , that is samples. This explains why high-dimensional linear classification needs large datasets.
Common Confusions
Sample complexity is not the same as convergence rate
Sample complexity asks: how many samples for -accuracy? Convergence rate asks: how fast does the excess risk decrease with ? They are related by inversion, but people often confuse the two. A rate of corresponds to a sample complexity of . A rate of corresponds to , which is called a "fast rate."
Computational complexity is separate from sample complexity
Sample complexity ignores computation. An algorithm might need only samples but require exponential time to process them. Cryptographic hardness results show that for some problems, computationally efficient algorithms provably need more samples than information-theoretically optimal (but computationally unbounded) ones.
Bounds with unspecified constants are hard to use in practice
The statement hides a constant. That constant might be 1 or 1000. For practical sample size planning, you need either explicit constants (which are rarely tight) or empirical methods like learning curves. Sample complexity theory tells you the right scaling, not the right number.
Key Takeaways
- For finite :
- For VC dimension : up to log factors
- For Rademacher complexity : solve
- Lower bounds via Fano require constructing hard packing sets
- Sample complexity is the right way to compare learning algorithms across different regimes
- The gap between upper and lower bounds measures our ignorance about a problem
Exercises
Problem
A hypothesis class has 500 elements. How many samples does ERM need to achieve excess risk with probability ?
Problem
The class of halfspaces in has VC dimension 11. Use the VC sample complexity bound to determine the order of magnitude of samples needed for accuracy. Then explain why this is still useful despite the loose constants.
Problem
Construct a Fano-style lower bound argument showing that learning threshold classifiers on requires samples. Specifically: choose threshold values spaced apart, bound the KL divergence between the induced distributions, and apply Fano's inequality.
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 3-6
- Vapnik, Statistical Learning Theory (1998), Chapters 3-4
Current:
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 3
- Wainwright, High-Dimensional Statistics (2019), Chapter 15 (minimax lower bounds)
Next Topics
The natural extensions from sample complexity:
- Algorithmic stability: an alternative route to generalization that bypasses uniform convergence
- PAC-Bayes: bounds that depend on a prior over hypotheses, often tighter for overparameterized models
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- VC DimensionLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Rademacher ComplexityLayer 3