Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Learning Theory Core

PAC Learning Framework

The foundational formalization of what it means to learn from data: a concept is PAC-learnable if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.

CoreTier 1Stable~55 min

Why This Matters

target ε = 0.2n₀ = 189PAC guaranteed regiontraingapTrain errorGen. gapTrue riskTarget0.00.20.40.60.81.0Error / Gap0100200300400500Sample size (n)With probability ≥ 1-δ, R(h) ≤ R̂(h) + ε after n ≥ n₀ samples.

The PAC (Probably Approximately Correct) framework, introduced by Leslie Valiant in 1984, is the mathematical foundation for answering the most basic question in machine learning: when is it possible to learn from data?

Before PAC theory, machine learning was a collection of algorithms without a unifying theory of learnability. PAC provides the definitions and theorems that tell you whether a learning problem is solvable in principle, and how many samples you need.

Every generalization bound you encounter, from ERM to VC dimension to Rademacher complexity, is ultimately answering a question that PAC theory formalized. The sample complexity bounds page covers the quantitative details.

Feasible regionNeeds too many samples050K100K150K200KSamples needed (n)0.050.10.20.30.40.5Target error εd=5, δ=0.05d=20, δ=0.05d=50, δ=0.05d=20, δ=0.01

Mental Model

Think of a learner as a detective. The detective wants to identify an unknown rule (the target concept) but can only observe examples. PAC theory asks: how many examples does the detective need to be probably (1δ1 - \delta) approximately (ϵ\epsilon) correct about the rule?

The "probably" accounts for the randomness of the training sample. you might get an unlucky draw. The "approximately" accounts for the fact that you do not need to identify the rule exactly; getting close is enough.

Formal Setup and Notation

Let X\mathcal{X} be an instance space (e.g., {0,1}n\{0,1\}^n). A concept is a function c:X{0,1}c: \mathcal{X} \to \{0, 1\}. A concept class C\mathcal{C} is a collection of concepts.

There is an unknown distribution D\mathcal{D} over X\mathcal{X} and an unknown target concept cCc^* \in \mathcal{C}.

The learner receives a training sample S={(x1,c(x1)),,(xn,c(xn))}S = \{(x_1, c^*(x_1)), \ldots, (x_n, c^*(x_n))\} where each xiDx_i \sim \mathcal{D} independently.

The learner outputs a hypothesis h:X{0,1}h: \mathcal{X} \to \{0, 1\}.

Definition

Generalization Error

The generalization error (or risk) of a hypothesis hh with respect to target cc^* and distribution D\mathcal{D} is:

R(h)=PrxD[h(x)c(x)]R(h) = \Pr_{x \sim \mathcal{D}}[h(x) \neq c^*(x)]

This is the probability that hh disagrees with cc^* on a random example.

Core Definitions

Definition

PAC Learnability (Realizable Case)

A concept class C\mathcal{C} is PAC-learnable if there exists an algorithm A\mathcal{A} and a function nC:(0,1)2Nn_{\mathcal{C}}: (0,1)^2 \to \mathbb{N} such that for every target concept cCc^* \in \mathcal{C}, every distribution D\mathcal{D} over X\mathcal{X}, every accuracy parameter ϵ(0,1)\epsilon \in (0, 1), and every confidence parameter δ(0,1)\delta \in (0, 1):

when A\mathcal{A} is given a sample SS of size nnC(ϵ,δ)n \geq n_{\mathcal{C}}(\epsilon, \delta) drawn i.i.d. from D\mathcal{D} and labeled by cc^*, it outputs hh satisfying:

PrS[R(h)ϵ]1δ\Pr_S[R(h) \leq \epsilon] \geq 1 - \delta

If nC(ϵ,δ)n_{\mathcal{C}}(\epsilon, \delta) is polynomial in 1/ϵ1/\epsilon and 1/δ1/\delta, the class is efficiently PAC-learnable. If additionally A\mathcal{A} runs in time polynomial in 1/ϵ1/\epsilon, 1/δ1/\delta, and size(c)\text{size}(c^*), we have polynomial-time PAC learnability.

The key features of this definition:

  1. For all distributions. The learner must work regardless of D\mathcal{D}
  2. For all target concepts. The learner must work for any cCc^* \in \mathcal{C}
  3. Probably: the guarantee holds with probability 1δ\geq 1 - \delta
  4. Approximately: the error is at most ϵ\epsilon, not necessarily zero
Definition

Sample Complexity

The sample complexity nC(ϵ,δ)n_{\mathcal{C}}(\epsilon, \delta) is the minimum number of training examples needed to achieve error ϵ\leq \epsilon with confidence 1δ\geq 1 - \delta. This is the fundamental quantity that PAC theory characterizes.

Realizable vs. Agnostic PAC

Definition

Realizable Setting

In the realizable setting, we assume cCc^* \in \mathcal{C}. The target concept is in our hypothesis class. The learner can achieve zero training error. The question is whether zero training error implies low test error.

Definition

Agnostic PAC Learning

In the agnostic setting, we make no assumption about the target. Labels may be generated by any joint distribution over X×{0,1}\mathcal{X} \times \{0,1\}. The goal is to compete with the best hypothesis in C\mathcal{C}:

PrS ⁣[R(h)minhCR(h)+ϵ]1δ\Pr_S\!\left[R(h) \leq \min_{h' \in \mathcal{C}} R(h') + \epsilon\right] \geq 1 - \delta

This is harder because even the best hypothesis in C\mathcal{C} may have nonzero error (the approximation error). Agnostic PAC learning requires more samples.

The agnostic setting is more realistic: in practice, you never know if the true relationship is exactly representable by your model class.

Main Theorems

Theorem

PAC Bound for Finite Hypothesis Classes

Statement

Let C\mathcal{C} be a finite concept class with C=M|\mathcal{C}| = M. In the realizable setting, the ERM algorithm (which outputs any hCh \in \mathcal{C} consistent with the training data) satisfies: for any ϵ,δ>0\epsilon, \delta > 0, if the sample size satisfies:

n1ϵ(logM+log1δ)n \geq \frac{1}{\epsilon}\left(\log M + \log \frac{1}{\delta}\right)

then with probability 1δ\geq 1 - \delta, the output hh has R(h)ϵR(h) \leq \epsilon.

Intuition

Each "bad" hypothesis (one with error >ϵ> \epsilon) has probability at most (1ϵ)nenϵ(1 - \epsilon)^n \leq e^{-n\epsilon} of being consistent with all nn training examples. There are at most MM bad hypotheses. By a union bound, the probability that any bad hypothesis survives is at most MenϵM e^{-n\epsilon}. Setting this δ\leq \delta and solving for nn gives the bound.

Proof Sketch

  1. Fix a hypothesis hh with R(h)>ϵR(h) > \epsilon. The probability that hh is consistent with a single random example is at most 1ϵ1 - \epsilon.
  2. By independence, Pr[h consistent with S](1ϵ)nenϵ\Pr[h \text{ consistent with } S] \leq (1 - \epsilon)^n \leq e^{-n\epsilon}.
  3. Union bound over all hCh \in \mathcal{C} with R(h)>ϵR(h) > \epsilon: Pr[ bad consistent h]Menϵ\Pr[\exists \text{ bad consistent } h] \leq M e^{-n\epsilon}.
  4. Set MenϵδM e^{-n\epsilon} \leq \delta and solve: n1ϵ(logM+log(1/δ))n \geq \frac{1}{\epsilon}(\log M + \log(1/\delta)).

Why It Matters

This is the simplest PAC bound and the template for all subsequent results. The sample complexity is O(1ϵlogM)O(\frac{1}{\epsilon}\log M). logarithmic in the size of the hypothesis class. This logarithmic dependence is what makes learning from finite classes tractable even when MM is exponentially large.

Failure Mode

This bound requires the realizable assumption (cCc^* \in \mathcal{C}). In the agnostic case, the rate changes from O(1/ϵ)O(1/\epsilon) to O(1/ϵ2)O(1/\epsilon^2). The square is the cost of not knowing the target is in your class.

Theorem

Fundamental Theorem of PAC Learning

Statement

For binary classification with 0-1 loss, the following are equivalent:

  1. C\mathcal{C} has finite VC dimension
  2. C\mathcal{C} is agnostic PAC-learnable
  3. C\mathcal{C} is PAC-learnable (realizable case)
  4. ERM over C\mathcal{C} is a successful agnostic PAC learner

Moreover, the sample complexity of agnostic PAC learning is:

n=Θ ⁣(dVC+log(1/δ)ϵ2)n = \Theta\!\left(\frac{d_{\text{VC}} + \log(1/\delta)}{\epsilon^2}\right)

where dVCd_{\text{VC}} is the VC dimension of C\mathcal{C}.

Intuition

The VC dimension is the exact measure of hypothesis class complexity that determines learnability. A class is learnable if and only if it cannot shatter arbitrarily large sets of points. The VC dimension replaces the crude logM\log M of the finite case with a geometric measure of complexity that works for infinite classes.

Why It Matters

This is the most important theorem in statistical learning theory. It completely characterizes PAC learnability: finite VC dimension is both necessary and sufficient. Every other complexity measure (Rademacher complexity, covering numbers, fat-shattering dimension) can be related back to this fundamental characterization.

Failure Mode

This theorem applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, the characterization is more complex and involves different complexity measures.

Proof Ideas and Templates Used

The proofs in PAC theory repeatedly use two techniques:

  1. Union bound + pointwise concentration: Bound the probability that a single hypothesis is misleading (pointwise), then take a union bound over all hypotheses. This works for finite classes.

  2. Symmetrization + growth function: For infinite classes, the union bound over hypotheses is replaced by a union bound over behaviors of hypotheses on a ghost sample. The number of distinct behaviors is bounded by the growth function ΠC(n)\Pi_{\mathcal{C}}(n), which is polynomial in nn when the VC dimension is finite (Sauer's lemma).

Canonical Examples

Example

Learning conjunctions

The class of conjunctions over nn Boolean variables contains 3n3^n hypotheses (each variable can appear positive, negative, or absent). The PAC bound gives sample complexity O(nϵlog3)=O(nϵ)O(\frac{n}{\epsilon}\log 3) = O(\frac{n}{\epsilon}). There is also a simple polynomial-time algorithm: start with all literals and remove any literal falsified by a positive example.

Example

Learning intervals on the real line

The class of intervals [a,b]R[a, b] \subset \mathbb{R} has VC dimension 2. By the fundamental theorem, the sample complexity for agnostic PAC learning is O(1/ϵ2)O(1/\epsilon^2), independent of the number of possible intervals (which is uncountably infinite).

Common Confusions

Watch Out

PAC learnability is distribution-free

The PAC guarantee must hold for every distribution D\mathcal{D}. This is a very strong requirement. It means the sample complexity bounds are worst-case over distributions. For specific, well-behaved distributions, you might need far fewer samples. This is both the strength (universality) and weakness (conservatism) of PAC theory.

Watch Out

Efficient learnability requires polynomial time

PAC learnability as defined above only requires polynomial sample complexity. It says nothing about computational efficiency. A class can be PAC-learnable (finite VC dimension) but not efficiently PAC-learnable (finding the ERM hypothesis is NP-hard). For example, learning intersections of halfspaces is PAC-learnable but believed to be computationally hard.

Watch Out

The realizable assumption is very strong

Assuming cCc^* \in \mathcal{C} means you know the true concept is representable by your hypothesis class. In practice, this is almost never true: real models always face some degree of overfitting. The agnostic setting is more realistic but requires O(1/ϵ2)O(1/\epsilon^2) samples instead of O(1/ϵ)O(1/\epsilon). The quadratic cost of agnosticism is a fundamental price you pay for not knowing the true model.

Summary

  • PAC = Probably (1δ1-\delta) Approximately (ϵ\epsilon) Correct
  • Sample complexity for finite class (realizable): n=O(1ϵlogC)n = O(\frac{1}{\epsilon}\log|\mathcal{C}|)
  • Sample complexity for finite VC dimension (agnostic): n=O(dVCϵ2)n = O(\frac{d_{\text{VC}}}{\epsilon^2})
  • The fundamental theorem: finite VC dimension is equivalent to PAC learnability for binary classification
  • PAC is distribution-free: bounds hold for the worst-case distribution
  • Computational complexity is a separate question from sample complexity

Exercises

ExerciseCore

Problem

A concept class C\mathcal{C} has 256 concepts. In the realizable setting, how many samples do you need to guarantee that with probability at least 0.90.9, the ERM hypothesis has error at most 0.050.05?

ExerciseCore

Problem

Explain in one sentence why the sample complexity for the agnostic case is O(1/ϵ2)O(1/\epsilon^2) while the realizable case is O(1/ϵ)O(1/\epsilon).

ExerciseAdvanced

Problem

The class of all Boolean functions on nn bits has 22n2^{2^n} concepts. What is the PAC sample complexity in the realizable case? Is this class efficiently PAC-learnable?

Related Comparisons

References

Canonical:

  • Valiant, "A Theory of the Learnable," Communications of the ACM (1984)
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2-6
  • Kearns & Vazirani, An Introduction to Computational Learning Theory (1994), Chapters 1-3

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2

  • Wainwright, High-Dimensional Statistics (2019), Chapters 4-6

Next Topics

The natural next steps from PAC learning:

  • Empirical risk minimization: the algorithmic principle that implements PAC learning
  • VC dimension: the complexity measure that characterizes PAC learnability

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics