Learning Theory Core
PAC Learning Framework
The foundational formalization of what it means to learn from data: a concept is PAC-learnable if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.
Prerequisites
Why This Matters
The PAC (Probably Approximately Correct) framework, introduced by Leslie Valiant in 1984, is the mathematical foundation for answering the most basic question in machine learning: when is it possible to learn from data?
Before PAC theory, machine learning was a collection of algorithms without a unifying theory of learnability. PAC provides the definitions and theorems that tell you whether a learning problem is solvable in principle, and how many samples you need.
Every generalization bound you encounter, from ERM to VC dimension to Rademacher complexity, is ultimately answering a question that PAC theory formalized. The sample complexity bounds page covers the quantitative details.
Mental Model
Think of a learner as a detective. The detective wants to identify an unknown rule (the target concept) but can only observe examples. PAC theory asks: how many examples does the detective need to be probably () approximately () correct about the rule?
The "probably" accounts for the randomness of the training sample. you might get an unlucky draw. The "approximately" accounts for the fact that you do not need to identify the rule exactly; getting close is enough.
Formal Setup and Notation
Let be an instance space (e.g., ). A concept is a function . A concept class is a collection of concepts.
There is an unknown distribution over and an unknown target concept .
The learner receives a training sample where each independently.
The learner outputs a hypothesis .
Generalization Error
The generalization error (or risk) of a hypothesis with respect to target and distribution is:
This is the probability that disagrees with on a random example.
Core Definitions
PAC Learnability (Realizable Case)
A concept class is PAC-learnable if there exists an algorithm and a function such that for every target concept , every distribution over , every accuracy parameter , and every confidence parameter :
when is given a sample of size drawn i.i.d. from and labeled by , it outputs satisfying:
If is polynomial in and , the class is efficiently PAC-learnable. If additionally runs in time polynomial in , , and , we have polynomial-time PAC learnability.
The key features of this definition:
- For all distributions. The learner must work regardless of
- For all target concepts. The learner must work for any
- Probably: the guarantee holds with probability
- Approximately: the error is at most , not necessarily zero
Sample Complexity
The sample complexity is the minimum number of training examples needed to achieve error with confidence . This is the fundamental quantity that PAC theory characterizes.
Realizable vs. Agnostic PAC
Realizable Setting
In the realizable setting, we assume . The target concept is in our hypothesis class. The learner can achieve zero training error. The question is whether zero training error implies low test error.
Agnostic PAC Learning
In the agnostic setting, we make no assumption about the target. Labels may be generated by any joint distribution over . The goal is to compete with the best hypothesis in :
This is harder because even the best hypothesis in may have nonzero error (the approximation error). Agnostic PAC learning requires more samples.
The agnostic setting is more realistic: in practice, you never know if the true relationship is exactly representable by your model class.
Main Theorems
PAC Bound for Finite Hypothesis Classes
Statement
Let be a finite concept class with . In the realizable setting, the ERM algorithm (which outputs any consistent with the training data) satisfies: for any , if the sample size satisfies:
then with probability , the output has .
Intuition
Each "bad" hypothesis (one with error ) has probability at most of being consistent with all training examples. There are at most bad hypotheses. By a union bound, the probability that any bad hypothesis survives is at most . Setting this and solving for gives the bound.
Proof Sketch
- Fix a hypothesis with . The probability that is consistent with a single random example is at most .
- By independence, .
- Union bound over all with : .
- Set and solve: .
Why It Matters
This is the simplest PAC bound and the template for all subsequent results. The sample complexity is . logarithmic in the size of the hypothesis class. This logarithmic dependence is what makes learning from finite classes tractable even when is exponentially large.
Failure Mode
This bound requires the realizable assumption (). In the agnostic case, the rate changes from to . The square is the cost of not knowing the target is in your class.
Fundamental Theorem of PAC Learning
Statement
For binary classification with 0-1 loss, the following are equivalent:
- has finite VC dimension
- is agnostic PAC-learnable
- is PAC-learnable (realizable case)
- ERM over is a successful agnostic PAC learner
Moreover, the sample complexity of agnostic PAC learning is:
where is the VC dimension of .
Intuition
The VC dimension is the exact measure of hypothesis class complexity that determines learnability. A class is learnable if and only if it cannot shatter arbitrarily large sets of points. The VC dimension replaces the crude of the finite case with a geometric measure of complexity that works for infinite classes.
Why It Matters
This is the most important theorem in statistical learning theory. It completely characterizes PAC learnability: finite VC dimension is both necessary and sufficient. Every other complexity measure (Rademacher complexity, covering numbers, fat-shattering dimension) can be related back to this fundamental characterization.
Failure Mode
This theorem applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, the characterization is more complex and involves different complexity measures.
Proof Ideas and Templates Used
The proofs in PAC theory repeatedly use two techniques:
-
Union bound + pointwise concentration: Bound the probability that a single hypothesis is misleading (pointwise), then take a union bound over all hypotheses. This works for finite classes.
-
Symmetrization + growth function: For infinite classes, the union bound over hypotheses is replaced by a union bound over behaviors of hypotheses on a ghost sample. The number of distinct behaviors is bounded by the growth function , which is polynomial in when the VC dimension is finite (Sauer's lemma).
Canonical Examples
Learning conjunctions
The class of conjunctions over Boolean variables contains hypotheses (each variable can appear positive, negative, or absent). The PAC bound gives sample complexity . There is also a simple polynomial-time algorithm: start with all literals and remove any literal falsified by a positive example.
Learning intervals on the real line
The class of intervals has VC dimension 2. By the fundamental theorem, the sample complexity for agnostic PAC learning is , independent of the number of possible intervals (which is uncountably infinite).
Common Confusions
PAC learnability is distribution-free
The PAC guarantee must hold for every distribution . This is a very strong requirement. It means the sample complexity bounds are worst-case over distributions. For specific, well-behaved distributions, you might need far fewer samples. This is both the strength (universality) and weakness (conservatism) of PAC theory.
Efficient learnability requires polynomial time
PAC learnability as defined above only requires polynomial sample complexity. It says nothing about computational efficiency. A class can be PAC-learnable (finite VC dimension) but not efficiently PAC-learnable (finding the ERM hypothesis is NP-hard). For example, learning intersections of halfspaces is PAC-learnable but believed to be computationally hard.
The realizable assumption is very strong
Assuming means you know the true concept is representable by your hypothesis class. In practice, this is almost never true: real models always face some degree of overfitting. The agnostic setting is more realistic but requires samples instead of . The quadratic cost of agnosticism is a fundamental price you pay for not knowing the true model.
Summary
- PAC = Probably () Approximately () Correct
- Sample complexity for finite class (realizable):
- Sample complexity for finite VC dimension (agnostic):
- The fundamental theorem: finite VC dimension is equivalent to PAC learnability for binary classification
- PAC is distribution-free: bounds hold for the worst-case distribution
- Computational complexity is a separate question from sample complexity
Exercises
Problem
A concept class has 256 concepts. In the realizable setting, how many samples do you need to guarantee that with probability at least , the ERM hypothesis has error at most ?
Problem
Explain in one sentence why the sample complexity for the agnostic case is while the realizable case is .
Problem
The class of all Boolean functions on bits has concepts. What is the PAC sample complexity in the realizable case? Is this class efficiently PAC-learnable?
Related Comparisons
References
Canonical:
- Valiant, "A Theory of the Learnable," Communications of the ACM (1984)
- Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2-6
- Kearns & Vazirani, An Introduction to Computational Learning Theory (1994), Chapters 1-3
Current:
-
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
-
Wainwright, High-Dimensional Statistics (2019), Chapters 4-6
Next Topics
The natural next steps from PAC learning:
- Empirical risk minimization: the algorithmic principle that implements PAC learning
- VC dimension: the complexity measure that characterizes PAC learnability
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- VC DimensionLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A