Learning Theory Core

Realizability Assumption

The technical assumption that the target function lies inside the hypothesis class. Realizable PAC learning is the simpler half of the story; agnostic PAC drops this assumption.

CoreTier 1StableCore spine~35 min

Prerequisites

Empirical Risk Minimization Hypothesis Classes and Function Spaces

Quiz (9)Pulse Check Prereq Map

Why This Matters

The realizability assumption is one of the cleanest dividing lines in learning theory. It says: the target function the data was labeled by is itself in the hypothesis class the learner is searching over. When the assumption holds, ERM generalizes via a short union-bound argument. When it fails, you need uniform convergence and a stronger result for the agnostic case.

The assumption is named in Shalev-Shwartz and Ben-David (definition 2.1). It is also the implicit setup behind most introductory treatments of PAC, because the derivation is simpler under it. Most applied learning settings violate it: there is no reason a real-world labeling function is exactly representable by, say, the class of axis-aligned rectangles. The realizable case is therefore a stepping stone, not the destination.

Mental Model

Two distributions sit underneath every supervised problem: an instance distribution $\mathcal{D}$ over $\mathcal{X}$ , and a labeling rule $f: \mathcal{X} \to \mathcal{Y}$ . The realizability assumption is a statement about $f$ relative to a hypothesis class $\mathcal{H}$ .

If $f \in \mathcal{H}$ , then some hypothesis in the class achieves zero population error. ERM, by definition, can match training data exactly. So the gap to bound is between empirical and true error of the selected ERM hypothesis, not against the unreachable Bayes-optimal predictor.

If $f \notin \mathcal{H}$ , the best possible hypothesis in $\mathcal{H}$ has some nonzero error $\epsilon_{\mathcal{H}}^*$ , and ERM has to compete against that floor rather than against zero. That regime is what agnostic PAC handles.

Accuracy and Confidence Parameters

Two parameters drive every PAC-style guarantee. They name the two ways the learner can fail.

Accuracy parameter $\epsilon \in (0, 1)$ : the largest population error the learner is willing to accept. Returning $h$ with $L_{(\mathcal{D}, f)}(h) \le \epsilon$ is approximately correct; returning $h$ with $L_{(\mathcal{D}, f)}(h) > \epsilon$ counts as a failure.
Confidence parameter $\delta \in (0, 1)$ : the largest probability of failure the learner is willing to tolerate. The training sample $S$ is itself random, so even a correct procedure has some chance of landing on a non-representative sample. With probability at least $1 - \delta$ the bound certifies $L_{(\mathcal{D}, f)}(h_S) \le \epsilon$ ; with probability at most $\delta$ the certificate is silent.

The pair $(\epsilon, \delta)$ controls both knobs at once. The corollary below certifies a single sample size $m$ that delivers $\epsilon$ -accuracy with $1 - \delta$ confidence.

Formal Setup

Let $\mathcal{X}$ be a domain set, $\mathcal{Y} = \{0, 1\}$ a label set, $\mathcal{D}$ a distribution over $\mathcal{X}$ , and $f: \mathcal{X} \to \mathcal{Y}$ a target labeling function.

Definition

True risk under target $L_{(D, f)} (h)$

The true risk of a hypothesis $h$ under $\mathcal{D}$ and $f$ is the probability that $h$ disagrees with $f$ on a fresh sample:

$L_{(\mathcal{D}, f)}(h) := \mathbb{P}_{x \sim \mathcal{D}}[h(x) \ne f(x)] = \mathcal{D}(\{x : h(x) \ne f(x)\}).$

This is the quantity the learner cannot directly compute, because $\mathcal{D}$ and $f$ are both unknown.

Definition

Empirical risk $L_{S} (h)$

For a training sample $S = (x_1, y_1), \ldots, (x_m, y_m)$ , the empirical risk is the fraction of training points on which $h$ disagrees with the observed label:

$L_S(h) := \frac{|\{i \in [m] : h(x_i) \ne y_i\}|}{m}.$

Under realizability with i.i.d. samples, the labels are $y_i = f(x_i)$ , so $L_S(h^{*}) = 0$ with probability one for the target $h^{*} = f$ .

Definition

Realizability assumption

The hypothesis class $\mathcal{H}$ realizes the target $(\mathcal{D}, f)$ if there exists $h^{*} \in \mathcal{H}$ with $L_{(\mathcal{D}, f)}(h^{*}) = 0$ . Equivalently, $f$ is in $\mathcal{H}$ up to measure zero under $\mathcal{D}$ .

The assumption says nothing about how the learner finds $h^{*}$ . It only says the search space contains a perfect predictor.

Main Theorem

Corollary

Realizable Finite-Class Sample Complexity

Statement

Fix $\epsilon, \delta > 0$ . If $\mathcal{H}$ is a finite hypothesis class and the realizability assumption holds, then for every

$m \ge \frac{\log(|\mathcal{H}| / \delta)}{\epsilon},$

with probability at least $1 - \delta$ over the i.i.d. draw of $S$ , every ERM hypothesis $h_S$ satisfies $L_{(\mathcal{D}, f)}(h_S) \le \epsilon$ .

Intuition

Under realizability, ERM is forced to pick some $h_S$ with $L_S(h_S) = 0$ (because $h^{*}$ achieves zero training error and ERM picks an empirical minimizer). So the only way ERM fails is if a "bad" hypothesis with population error larger than $\epsilon$ also happens to have zero training error on this particular sample. The probability of any fixed bad hypothesis surviving $m$ i.i.d. coin flips is at most $(1 - \epsilon)^m \le e^{-\epsilon m}$ . A union bound over $|\mathcal{H}|$ bad hypotheses gives $|\mathcal{H}| e^{-\epsilon m}$ . Setting that $\le \delta$ and solving recovers the threshold.

Proof Sketch

Let $\mathcal{H}_B = \{h \in \mathcal{H} : L_{(\mathcal{D}, f)}(h) > \epsilon\}$ be the set of bad hypotheses. Let $M = \{S : \exists h \in \mathcal{H}_B,\, L_S(h) = 0\}$ be the misleading samples. Realizability plus the ERM-picks-zero-empirical-risk fact imply $\{S : L_{(\mathcal{D}, f)}(h_S) > \epsilon\} \subseteq M$ . The union bound gives $\mathbb{P}[M] \le \sum_{h \in \mathcal{H}_B} \mathbb{P}[L_S(h) = 0]$ . For each $h \in \mathcal{H}_B$ , the probability all $m$ training points happen to land in the agreement set is at most $(1 - \epsilon)^m \le e^{-\epsilon m}$ . So $\mathbb{P}[M] \le |\mathcal{H}_B| \cdot e^{-\epsilon m} \le |\mathcal{H}| \cdot e^{-\epsilon m}$ . Setting this to $\delta$ and solving for $m$ recovers the bound.

Why It Matters

This is the simplest sample-complexity result in PAC learning. Three things stand out: the rate is $1/\epsilon$ rather than $1/\epsilon^2$ (the agnostic rate), the dependence on $\mathcal{H}$ is only logarithmic in $|\mathcal{H}|$ , and there is no Hoeffding step — pure union bound suffices. Each of these structural properties weakens when realizability is dropped.

Failure Mode

The bound is vacuous when $f \notin \mathcal{H}$ . In that case the ERM hypothesis can have $L_S(h_S) > 0$ , so the proof's anchor (that ERM achieves zero training error) is gone. The agnostic version of the result uses Hoeffding plus a union bound and recovers a $1/\epsilon^2$ rate.

report a correction →

Step-by-step Proof of Corollary 2.3

The full argument packages four ideas: a decomposition of the failure event, the union bound, an i.i.d. factorization, and the inequality $1 - \epsilon \le e^{-\epsilon}$ . Each appears once and the corollary falls out.

Step 1: Set up the failure event. Let $S = ((x_1, y_1), \ldots, (x_m, y_m))$ with i.i.d. instances $x_i \sim \mathcal{D}$ and labels $y_i = f(x_i)$ . Write $S|_x = (x_1, \ldots, x_m)$ for the unlabeled instance vector. The probability we want to upper bound is

\mathcal{D}^m\bigl(\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\}\bigr).

Sample randomness lives on the $m$ -fold product space $(\mathcal{X}^m, \mathcal{D}^m)$ ; $h_S$ depends on $S|_x$ through the ERM rule.

Step 2: Define the bad and misleading sets. A hypothesis is bad if its true risk exceeds the accuracy parameter:

\mathcal{H}_B := \bigl\{h \in \mathcal{H} : L_{(\mathcal{D}, f)}(h) > \epsilon\bigr\}.

A sample is misleading if it makes some bad hypothesis look perfect on the training set:

M := \bigl\{S|_x : \exists h \in \mathcal{H}_B,\ L_S(h) = 0\bigr\} = \bigcup_{h \in \mathcal{H}_B} \bigl\{S|_x : L_S(h) = 0\bigr\}.

Step 3: Establish the failure-event subset relation. Realizability gives $h^{*} \in \mathcal{H}$ with $L_{(\mathcal{D}, f)}(h^{*}) = 0$ , so $L_S(h^{*}) = 0$ holds with probability one (the agreement set has full $\mathcal{D}$ -measure, and i.i.d. sampling lands every $x_i$ in it almost surely). ERM picks an empirical minimizer, so $L_S(h_S) = 0$ as well. If additionally $L_{(\mathcal{D}, f)}(h_S) > \epsilon$ (the failure event), then by definition $h_S \in \mathcal{H}_B$ , and the sample $S|_x$ is misleading. Therefore

\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\} \subseteq M.

Step 4: Apply the union bound. Probability is monotone under set inclusion, and $M$ is a finite union over $\mathcal{H}_B$ . Combining,

\mathcal{D}^m\bigl(\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\}\bigr) \le \mathcal{D}^m(M) \le \sum_{h \in \mathcal{H}_B} \mathcal{D}^m\bigl(\bigl\{S|_x : L_S(h) = 0\bigr\}\bigr).

Step 5: Factor the per-hypothesis term over the i.i.d. coordinates. For any fixed $h$ , the event $L_S(h) = 0$ holds iff $h(x_i) = f(x_i)$ for every $i$ . Independence factors the joint probability into the product of marginals:

\mathcal{D}^m\bigl(\bigl\{S|_x : L_S(h) = 0\bigr\}\bigr) = \prod_{i=1}^{m} \mathcal{D}\bigl(\bigl\{x_i : h(x_i) = f(x_i)\bigr\}\bigr) = \bigl(1 - L_{(\mathcal{D}, f)}(h)\bigr)^m.

The first equality uses independence; the second uses the definition of the true risk, since the $\mathcal{D}$ -probability that $h$ agrees with $f$ on a fresh draw is $1 - L_{(\mathcal{D}, f)}(h)$ .

Step 6: Apply the exponential bound. For $h \in \mathcal{H}_B$ , $L_{(\mathcal{D}, f)}(h) > \epsilon$ , so

\bigl(1 - L_{(\mathcal{D}, f)}(h)\bigr)^m \le (1 - \epsilon)^m \le e^{-\epsilon m}.

The first inequality is monotonicity of $x \mapsto x^m$ on $[0, 1]$ . The second uses $1 - x \le e^{-x}$ for $x \ge 0$ , which follows from convexity: the tangent to $e^{-x}$ at $x = 0$ is $1 - x$ , and a convex function lies above each of its tangents.

Step 7: Combine and solve for $m$ . Substituting the per-hypothesis bound back into the union-bound inequality,

\mathcal{D}^m\bigl(\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\}\bigr) \le |\mathcal{H}_B|\, e^{-\epsilon m} \le |\mathcal{H}|\, e^{-\epsilon m}.

The failure probability is at most $|\mathcal{H}| e^{-\epsilon m}$ . To make the right side $\le \delta$ , solve:

|\mathcal{H}|\, e^{-\epsilon m} \le \delta \iff e^{-\epsilon m} \le \delta / |\mathcal{H}| \iff m \ge \frac{\log(|\mathcal{H}| / \delta)}{\epsilon}.

This is exactly the threshold in the corollary statement. $\blacksquare$

Numerical Scaling

The bound is logarithmic in $|\mathcal{H}|$ and linear in $1/\epsilon$ and $\log(1/\delta)$ . Concretely:

$\lvert\mathcal{H}\rvert$	$\epsilon$	$\delta$	$\lceil \log(\lvert\mathcal{H}\rvert / \delta) / \epsilon \rceil$
10	0.10	0.05	53
100	0.10	0.05	77
1,000	0.10	0.05	100
10,000	0.10	0.05	123
100	0.05	0.05	153
100	0.01	0.05	761
100	0.10	0.001	116

Reading the table: each decade of $|\mathcal{H}|$ growth costs roughly $\log(10)/\epsilon \approx 23$ extra samples; halving $\epsilon$ doubles $m$ ; tightening $\delta$ from $0.05$ to $0.001$ adds $\log(50)/\epsilon \approx 39$ samples through the $\log(1/\delta)$ term.

Common Confusions

Watch Out

Realizability is about the class containing f, not about training error being zero.

A learner can achieve zero training error on a finite sample while $f \notin \mathcal{H}$ . It just means the sample happened to be coverable by some $h \in \mathcal{H}$ , not that the labeling rule itself is in the class. Conversely, realizability does not guarantee any specific learner finds $h^{*}$ ; it only guarantees the search space contains a perfect predictor.

Watch Out

Realizability is not the i.i.d. assumption.

The two assumptions are independent. Realizability is about $f$ relative to $\mathcal{H}$ ; i.i.d. is about how $S$ is drawn from $\mathcal{D}$ . Both are needed for the corollary above. Drop realizability and you move to agnostic PAC. Drop i.i.d. and you move into online learning, distribution-shift, or martingale-style analyses.

Worked Example

Example

Realizable axis-aligned rectangles

Let $\mathcal{X} = [0, 1]^2$ , let $f$ be the indicator of an unknown rectangle $R^{*} \subseteq \mathcal{X}$ , and let $\mathcal{H}$ be the class of all axis-aligned rectangles. By construction $f \in \mathcal{H}$ , so realizability holds. The corollary above gives $m = O(\log(|\mathcal{H}|/\delta) / \epsilon)$ , but $|\mathcal{H}|$ is uncountable in this setup so the bound as stated is vacuous and the analysis must move to VC dimension. The realizable assumption is doing real work in the analysis; the finite-class wrapper around it is just the simplest illustration.

Exercises

ExerciseCore

Problem

Suppose $\mathcal{H} = \{h_1, h_2\}$ with $|\mathcal{H}| = 2$ , and the realizable corollary requires accuracy $\epsilon = 0.05$ with confidence $1 - \delta = 0.95$ . What is the smallest $m$ the bound certifies?

ExerciseAdvanced

Problem

Show that without realizability, the empirical risk of the ERM hypothesis can be strictly positive even on the optimal-in-class predictor, and give an example where this gap is the binding constraint on the agnostic-PAC sample complexity.

Summary

Realizability is a small assumption with a large structural effect. Under realizability plus i.i.d. plus a finite hypothesis class, the sample complexity to reach $\epsilon$ -accuracy with confidence $1 - \delta$ is $O(\log(|\mathcal{H}|/\delta)/\epsilon)$ — a clean linear-in- $1/\epsilon$ rate. Drop realizability and you move to the agnostic regime, where the rate weakens to $1/\epsilon^2$ and the analysis needs uniform convergence rather than a single union bound.

The assumption is rarely true in practice. Most labeling functions are not exactly representable by any practical hypothesis class. But the realizable case is the right starting point because every concept in PAC learning — the bad-hypothesis decomposition, the union-bound technique, the role of $|\mathcal{H}|$ , the sample-complexity scaling — appears in its simplest form here.

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), definition 2.1 and corollary 2.3.
Kearns & Vazirani, An Introduction to Computational Learning Theory (1994), chapters 1-3.
Vapnik, Statistical Learning Theory (1998), chapter 1.

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), chapter 2.

Next Topics

PAC learning framework: the agnostic version of this corollary.
Uniform convergence: the proof technique that handles the agnostic case.
Sample complexity bounds: how the realizable rate compares to the VC bound.

Last reviewed: May 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

Empirical Risk Minimizationlayer 2 · tier 1
Hypothesis Classes and Function Spaceslayer 2 · tier 1

Derived topics

No published topic currently declares this as a prerequisite.