Learning Theory Core
Realizability Assumption
The technical assumption that the target function lies inside the hypothesis class. Realizable PAC learning is the simpler half of the story; agnostic PAC drops this assumption.
Why This Matters
The realizability assumption is one of the cleanest dividing lines in learning theory. It says: the target function the data was labeled by is itself in the hypothesis class the learner is searching over. When the assumption holds, ERM generalizes via a short union-bound argument. When it fails, you need uniform convergence and a stronger result for the agnostic case.
The assumption is named in Shalev-Shwartz and Ben-David (definition 2.1). It is also the implicit setup behind most introductory treatments of PAC, because the derivation is simpler under it. Most applied learning settings violate it: there is no reason a real-world labeling function is exactly representable by, say, the class of axis-aligned rectangles. The realizable case is therefore a stepping stone, not the destination.
Mental Model
Two distributions sit underneath every supervised problem: an instance distribution over , and a labeling rule . The realizability assumption is a statement about relative to a hypothesis class .
If , then some hypothesis in the class achieves zero population error. ERM, by definition, can match training data exactly. So the gap to bound is between empirical and true error of the selected ERM hypothesis, not against the unreachable Bayes-optimal predictor.
If , the best possible hypothesis in has some nonzero error , and ERM has to compete against that floor rather than against zero. That regime is what agnostic PAC handles.
Accuracy and Confidence Parameters
Two parameters drive every PAC-style guarantee. They name the two ways the learner can fail.
- Accuracy parameter : the largest population error the learner is willing to accept. Returning with is approximately correct; returning with counts as a failure.
- Confidence parameter : the largest probability of failure the learner is willing to tolerate. The training sample is itself random, so even a correct procedure has some chance of landing on a non-representative sample. With probability at least the bound certifies ; with probability at most the certificate is silent.
The pair controls both knobs at once. The corollary below certifies a single sample size that delivers -accuracy with confidence.
Formal Setup
Let be a domain set, a label set, a distribution over , and a target labeling function.
True risk under target
The true risk of a hypothesis under and is the probability that disagrees with on a fresh sample:
This is the quantity the learner cannot directly compute, because and are both unknown.
Empirical risk
For a training sample , the empirical risk is the fraction of training points on which disagrees with the observed label:
Under realizability with i.i.d. samples, the labels are , so with probability one for the target .
Realizability assumption
The hypothesis class realizes the target if there exists with . Equivalently, is in up to measure zero under .
The assumption says nothing about how the learner finds . It only says the search space contains a perfect predictor.
Main Theorem
Realizable Finite-Class Sample Complexity
Statement
Fix . If is a finite hypothesis class and the realizability assumption holds, then for every
with probability at least over the i.i.d. draw of , every ERM hypothesis satisfies .
Intuition
Under realizability, ERM is forced to pick some with (because achieves zero training error and ERM picks an empirical minimizer). So the only way ERM fails is if a "bad" hypothesis with population error larger than also happens to have zero training error on this particular sample. The probability of any fixed bad hypothesis surviving i.i.d. coin flips is at most . A union bound over bad hypotheses gives . Setting that and solving recovers the threshold.
Proof Sketch
Let be the set of bad hypotheses. Let be the misleading samples. Realizability plus the ERM-picks-zero-empirical-risk fact imply . The union bound gives . For each , the probability all training points happen to land in the agreement set is at most . So . Setting this to and solving for recovers the bound.
Why It Matters
This is the simplest sample-complexity result in PAC learning. Three things stand out: the rate is rather than (the agnostic rate), the dependence on is only logarithmic in , and there is no Hoeffding step — pure union bound suffices. Each of these structural properties weakens when realizability is dropped.
Failure Mode
The bound is vacuous when . In that case the ERM hypothesis can have , so the proof's anchor (that ERM achieves zero training error) is gone. The agnostic version of the result uses Hoeffding plus a union bound and recovers a rate.
Step-by-step Proof of Corollary 2.3
The full argument packages four ideas: a decomposition of the failure event, the union bound, an i.i.d. factorization, and the inequality . Each appears once and the corollary falls out.
Step 1: Set up the failure event. Let with i.i.d. instances and labels . Write for the unlabeled instance vector. The probability we want to upper bound is
Sample randomness lives on the -fold product space ; depends on through the ERM rule.
Step 2: Define the bad and misleading sets. A hypothesis is bad if its true risk exceeds the accuracy parameter:
A sample is misleading if it makes some bad hypothesis look perfect on the training set:
Step 3: Establish the failure-event subset relation. Realizability gives with , so holds with probability one (the agreement set has full -measure, and i.i.d. sampling lands every in it almost surely). ERM picks an empirical minimizer, so as well. If additionally (the failure event), then by definition , and the sample is misleading. Therefore
Step 4: Apply the union bound. Probability is monotone under set inclusion, and is a finite union over . Combining,
Step 5: Factor the per-hypothesis term over the i.i.d. coordinates. For any fixed , the event holds iff for every . Independence factors the joint probability into the product of marginals:
The first equality uses independence; the second uses the definition of the true risk, since the -probability that agrees with on a fresh draw is .
Step 6: Apply the exponential bound. For , , so
The first inequality is monotonicity of on . The second uses for , which follows from convexity: the tangent to at is , and a convex function lies above each of its tangents.
Step 7: Combine and solve for . Substituting the per-hypothesis bound back into the union-bound inequality,
The failure probability is at most . To make the right side , solve:
This is exactly the threshold in the corollary statement.
Numerical Scaling
The bound is logarithmic in and linear in and . Concretely:
| 10 | 0.10 | 0.05 | 53 |
| 100 | 0.10 | 0.05 | 77 |
| 1,000 | 0.10 | 0.05 | 100 |
| 10,000 | 0.10 | 0.05 | 123 |
| 100 | 0.05 | 0.05 | 153 |
| 100 | 0.01 | 0.05 | 761 |
| 100 | 0.10 | 0.001 | 116 |
Reading the table: each decade of growth costs roughly extra samples; halving doubles ; tightening from to adds samples through the term.
Common Confusions
Realizability is about the class containing f, not about training error being zero.
A learner can achieve zero training error on a finite sample while . It just means the sample happened to be coverable by some , not that the labeling rule itself is in the class. Conversely, realizability does not guarantee any specific learner finds ; it only guarantees the search space contains a perfect predictor.
Realizability is not the i.i.d. assumption.
The two assumptions are independent. Realizability is about relative to ; i.i.d. is about how is drawn from . Both are needed for the corollary above. Drop realizability and you move to agnostic PAC. Drop i.i.d. and you move into online learning, distribution-shift, or martingale-style analyses.
Worked Example
Realizable axis-aligned rectangles
Let , let be the indicator of an unknown rectangle , and let be the class of all axis-aligned rectangles. By construction , so realizability holds. The corollary above gives , but is uncountable in this setup so the bound as stated is vacuous and the analysis must move to VC dimension. The realizable assumption is doing real work in the analysis; the finite-class wrapper around it is just the simplest illustration.
Exercises
Problem
Suppose with , and the realizable corollary requires accuracy with confidence . What is the smallest the bound certifies?
Problem
Show that without realizability, the empirical risk of the ERM hypothesis can be strictly positive even on the optimal-in-class predictor, and give an example where this gap is the binding constraint on the agnostic-PAC sample complexity.
Summary
Realizability is a small assumption with a large structural effect. Under realizability plus i.i.d. plus a finite hypothesis class, the sample complexity to reach -accuracy with confidence is — a clean linear-in- rate. Drop realizability and you move to the agnostic regime, where the rate weakens to and the analysis needs uniform convergence rather than a single union bound.
The assumption is rarely true in practice. Most labeling functions are not exactly representable by any practical hypothesis class. But the realizable case is the right starting point because every concept in PAC learning — the bad-hypothesis decomposition, the union-bound technique, the role of , the sample-complexity scaling — appears in its simplest form here.
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), definition 2.1 and corollary 2.3.
- Kearns & Vazirani, An Introduction to Computational Learning Theory (1994), chapters 1-3.
- Vapnik, Statistical Learning Theory (1998), chapter 1.
Current:
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), chapter 2.
Next Topics
- PAC learning framework: the agnostic version of this corollary.
- Uniform convergence: the proof technique that handles the agnostic case.
- Sample complexity bounds: how the realizable rate compares to the VC bound.
Last reviewed: May 3, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Empirical Risk Minimizationlayer 2 · tier 1
- Hypothesis Classes and Function Spaceslayer 2 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.