Skip to main content

Learning Theory Core

Realizability Assumption

The technical assumption that the target function lies inside the hypothesis class. Realizable PAC learning is the simpler half of the story; agnostic PAC drops this assumption.

CoreTier 1StableCore spine~35 min

Why This Matters

The realizability assumption is one of the cleanest dividing lines in learning theory. It says: the target function the data was labeled by is itself in the hypothesis class the learner is searching over. When the assumption holds, ERM generalizes via a short union-bound argument. When it fails, you need uniform convergence and a stronger result for the agnostic case.

The assumption is named in Shalev-Shwartz and Ben-David (definition 2.1). It is also the implicit setup behind most introductory treatments of PAC, because the derivation is simpler under it. Most applied learning settings violate it: there is no reason a real-world labeling function is exactly representable by, say, the class of axis-aligned rectangles. The realizable case is therefore a stepping stone, not the destination.

Mental Model

Two distributions sit underneath every supervised problem: an instance distribution D\mathcal{D} over X\mathcal{X}, and a labeling rule f:XYf: \mathcal{X} \to \mathcal{Y}. The realizability assumption is a statement about ff relative to a hypothesis class H\mathcal{H}.

If fHf \in \mathcal{H}, then some hypothesis in the class achieves zero population error. ERM, by definition, can match training data exactly. So the gap to bound is between empirical and true error of the selected ERM hypothesis, not against the unreachable Bayes-optimal predictor.

If fHf \notin \mathcal{H}, the best possible hypothesis in H\mathcal{H} has some nonzero error ϵH\epsilon_{\mathcal{H}}^*, and ERM has to compete against that floor rather than against zero. That regime is what agnostic PAC handles.

Accuracy and Confidence Parameters

Two parameters drive every PAC-style guarantee. They name the two ways the learner can fail.

  • Accuracy parameter ϵ(0,1)\epsilon \in (0, 1): the largest population error the learner is willing to accept. Returning hh with L(D,f)(h)ϵL_{(\mathcal{D}, f)}(h) \le \epsilon is approximately correct; returning hh with L(D,f)(h)>ϵL_{(\mathcal{D}, f)}(h) > \epsilon counts as a failure.
  • Confidence parameter δ(0,1)\delta \in (0, 1): the largest probability of failure the learner is willing to tolerate. The training sample SS is itself random, so even a correct procedure has some chance of landing on a non-representative sample. With probability at least 1δ1 - \delta the bound certifies L(D,f)(hS)ϵL_{(\mathcal{D}, f)}(h_S) \le \epsilon; with probability at most δ\delta the certificate is silent.

The pair (ϵ,δ)(\epsilon, \delta) controls both knobs at once. The corollary below certifies a single sample size mm that delivers ϵ\epsilon-accuracy with 1δ1 - \delta confidence.

Formal Setup

Let X\mathcal{X} be a domain set, Y={0,1}\mathcal{Y} = \{0, 1\} a label set, D\mathcal{D} a distribution over X\mathcal{X}, and f:XYf: \mathcal{X} \to \mathcal{Y} a target labeling function.

Definition

True risk under target

The true risk of a hypothesis hh under D\mathcal{D} and ff is the probability that hh disagrees with ff on a fresh sample:

L(D,f)(h):=PxD[h(x)f(x)]=D({x:h(x)f(x)}).L_{(\mathcal{D}, f)}(h) := \mathbb{P}_{x \sim \mathcal{D}}[h(x) \ne f(x)] = \mathcal{D}(\{x : h(x) \ne f(x)\}).

This is the quantity the learner cannot directly compute, because D\mathcal{D} and ff are both unknown.

Definition

Empirical risk

For a training sample S=(x1,y1),,(xm,ym)S = (x_1, y_1), \ldots, (x_m, y_m), the empirical risk is the fraction of training points on which hh disagrees with the observed label:

LS(h):={i[m]:h(xi)yi}m.L_S(h) := \frac{|\{i \in [m] : h(x_i) \ne y_i\}|}{m}.

Under realizability with i.i.d. samples, the labels are yi=f(xi)y_i = f(x_i), so LS(h)=0L_S(h^{*}) = 0 with probability one for the target h=fh^{*} = f.

Definition

Realizability assumption

The hypothesis class H\mathcal{H} realizes the target (D,f)(\mathcal{D}, f) if there exists hHh^{*} \in \mathcal{H} with L(D,f)(h)=0L_{(\mathcal{D}, f)}(h^{*}) = 0. Equivalently, ff is in H\mathcal{H} up to measure zero under D\mathcal{D}.

The assumption says nothing about how the learner finds hh^{*}. It only says the search space contains a perfect predictor.

Main Theorem

Corollary

Realizable Finite-Class Sample Complexity

Statement

Fix ϵ,δ>0\epsilon, \delta > 0. If H\mathcal{H} is a finite hypothesis class and the realizability assumption holds, then for every

mlog(H/δ)ϵ,m \ge \frac{\log(|\mathcal{H}| / \delta)}{\epsilon},

with probability at least 1δ1 - \delta over the i.i.d. draw of SS, every ERM hypothesis hSh_S satisfies L(D,f)(hS)ϵL_{(\mathcal{D}, f)}(h_S) \le \epsilon.

Intuition

Under realizability, ERM is forced to pick some hSh_S with LS(hS)=0L_S(h_S) = 0 (because hh^{*} achieves zero training error and ERM picks an empirical minimizer). So the only way ERM fails is if a "bad" hypothesis with population error larger than ϵ\epsilon also happens to have zero training error on this particular sample. The probability of any fixed bad hypothesis surviving mm i.i.d. coin flips is at most (1ϵ)meϵm(1 - \epsilon)^m \le e^{-\epsilon m}. A union bound over H|\mathcal{H}| bad hypotheses gives Heϵm|\mathcal{H}| e^{-\epsilon m}. Setting that δ\le \delta and solving recovers the threshold.

Proof Sketch

Let HB={hH:L(D,f)(h)>ϵ}\mathcal{H}_B = \{h \in \mathcal{H} : L_{(\mathcal{D}, f)}(h) > \epsilon\} be the set of bad hypotheses. Let M={S:hHB,LS(h)=0}M = \{S : \exists h \in \mathcal{H}_B,\, L_S(h) = 0\} be the misleading samples. Realizability plus the ERM-picks-zero-empirical-risk fact imply {S:L(D,f)(hS)>ϵ}M\{S : L_{(\mathcal{D}, f)}(h_S) > \epsilon\} \subseteq M. The union bound gives P[M]hHBP[LS(h)=0]\mathbb{P}[M] \le \sum_{h \in \mathcal{H}_B} \mathbb{P}[L_S(h) = 0]. For each hHBh \in \mathcal{H}_B, the probability all mm training points happen to land in the agreement set is at most (1ϵ)meϵm(1 - \epsilon)^m \le e^{-\epsilon m}. So P[M]HBeϵmHeϵm\mathbb{P}[M] \le |\mathcal{H}_B| \cdot e^{-\epsilon m} \le |\mathcal{H}| \cdot e^{-\epsilon m}. Setting this to δ\delta and solving for mm recovers the bound.

Why It Matters

This is the simplest sample-complexity result in PAC learning. Three things stand out: the rate is 1/ϵ1/\epsilon rather than 1/ϵ21/\epsilon^2 (the agnostic rate), the dependence on H\mathcal{H} is only logarithmic in H|\mathcal{H}|, and there is no Hoeffding step — pure union bound suffices. Each of these structural properties weakens when realizability is dropped.

Failure Mode

The bound is vacuous when fHf \notin \mathcal{H}. In that case the ERM hypothesis can have LS(hS)>0L_S(h_S) > 0, so the proof's anchor (that ERM achieves zero training error) is gone. The agnostic version of the result uses Hoeffding plus a union bound and recovers a 1/ϵ21/\epsilon^2 rate.

Step-by-step Proof of Corollary 2.3

The full argument packages four ideas: a decomposition of the failure event, the union bound, an i.i.d. factorization, and the inequality 1ϵeϵ1 - \epsilon \le e^{-\epsilon}. Each appears once and the corollary falls out.

Step 1: Set up the failure event. Let S=((x1,y1),,(xm,ym))S = ((x_1, y_1), \ldots, (x_m, y_m)) with i.i.d. instances xiDx_i \sim \mathcal{D} and labels yi=f(xi)y_i = f(x_i). Write Sx=(x1,,xm)S|_x = (x_1, \ldots, x_m) for the unlabeled instance vector. The probability we want to upper bound is

Dm({Sx:L(D,f)(hS)>ϵ}).\mathcal{D}^m\bigl(\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\}\bigr).

Sample randomness lives on the mm-fold product space (Xm,Dm)(\mathcal{X}^m, \mathcal{D}^m); hSh_S depends on SxS|_x through the ERM rule.

Step 2: Define the bad and misleading sets. A hypothesis is bad if its true risk exceeds the accuracy parameter:

HB:={hH:L(D,f)(h)>ϵ}.\mathcal{H}_B := \bigl\{h \in \mathcal{H} : L_{(\mathcal{D}, f)}(h) > \epsilon\bigr\}.

A sample is misleading if it makes some bad hypothesis look perfect on the training set:

M:={Sx:hHB, LS(h)=0}=hHB{Sx:LS(h)=0}.M := \bigl\{S|_x : \exists h \in \mathcal{H}_B,\ L_S(h) = 0\bigr\} = \bigcup_{h \in \mathcal{H}_B} \bigl\{S|_x : L_S(h) = 0\bigr\}.

Step 3: Establish the failure-event subset relation. Realizability gives hHh^{*} \in \mathcal{H} with L(D,f)(h)=0L_{(\mathcal{D}, f)}(h^{*}) = 0, so LS(h)=0L_S(h^{*}) = 0 holds with probability one (the agreement set has full D\mathcal{D}-measure, and i.i.d. sampling lands every xix_i in it almost surely). ERM picks an empirical minimizer, so LS(hS)=0L_S(h_S) = 0 as well. If additionally L(D,f)(hS)>ϵL_{(\mathcal{D}, f)}(h_S) > \epsilon (the failure event), then by definition hSHBh_S \in \mathcal{H}_B, and the sample SxS|_x is misleading. Therefore

{Sx:L(D,f)(hS)>ϵ}M.\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\} \subseteq M.

Step 4: Apply the union bound. Probability is monotone under set inclusion, and MM is a finite union over HB\mathcal{H}_B. Combining,

Dm({Sx:L(D,f)(hS)>ϵ})Dm(M)hHBDm({Sx:LS(h)=0}).\mathcal{D}^m\bigl(\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\}\bigr) \le \mathcal{D}^m(M) \le \sum_{h \in \mathcal{H}_B} \mathcal{D}^m\bigl(\bigl\{S|_x : L_S(h) = 0\bigr\}\bigr).

Step 5: Factor the per-hypothesis term over the i.i.d. coordinates. For any fixed hh, the event LS(h)=0L_S(h) = 0 holds iff h(xi)=f(xi)h(x_i) = f(x_i) for every ii. Independence factors the joint probability into the product of marginals:

Dm({Sx:LS(h)=0})=i=1mD({xi:h(xi)=f(xi)})=(1L(D,f)(h))m.\mathcal{D}^m\bigl(\bigl\{S|_x : L_S(h) = 0\bigr\}\bigr) = \prod_{i=1}^{m} \mathcal{D}\bigl(\bigl\{x_i : h(x_i) = f(x_i)\bigr\}\bigr) = \bigl(1 - L_{(\mathcal{D}, f)}(h)\bigr)^m.

The first equality uses independence; the second uses the definition of the true risk, since the D\mathcal{D}-probability that hh agrees with ff on a fresh draw is 1L(D,f)(h)1 - L_{(\mathcal{D}, f)}(h).

Step 6: Apply the exponential bound. For hHBh \in \mathcal{H}_B, L(D,f)(h)>ϵL_{(\mathcal{D}, f)}(h) > \epsilon, so

(1L(D,f)(h))m(1ϵ)meϵm.\bigl(1 - L_{(\mathcal{D}, f)}(h)\bigr)^m \le (1 - \epsilon)^m \le e^{-\epsilon m}.

The first inequality is monotonicity of xxmx \mapsto x^m on [0,1][0, 1]. The second uses 1xex1 - x \le e^{-x} for x0x \ge 0, which follows from convexity: the tangent to exe^{-x} at x=0x = 0 is 1x1 - x, and a convex function lies above each of its tangents.

Step 7: Combine and solve for mm. Substituting the per-hypothesis bound back into the union-bound inequality,

Dm({Sx:L(D,f)(hS)>ϵ})HBeϵmHeϵm.\mathcal{D}^m\bigl(\bigl\{S|_x : L_{(\mathcal{D}, f)}(h_S) > \epsilon\bigr\}\bigr) \le |\mathcal{H}_B|\, e^{-\epsilon m} \le |\mathcal{H}|\, e^{-\epsilon m}.

The failure probability is at most Heϵm|\mathcal{H}| e^{-\epsilon m}. To make the right side δ\le \delta, solve:

Heϵmδ    eϵmδ/H    mlog(H/δ)ϵ.|\mathcal{H}|\, e^{-\epsilon m} \le \delta \iff e^{-\epsilon m} \le \delta / |\mathcal{H}| \iff m \ge \frac{\log(|\mathcal{H}| / \delta)}{\epsilon}.

This is exactly the threshold in the corollary statement. \blacksquare

Numerical Scaling

The bound is logarithmic in H|\mathcal{H}| and linear in 1/ϵ1/\epsilon and log(1/δ)\log(1/\delta). Concretely:

H\lvert\mathcal{H}\rvertϵ\epsilonδ\deltalog(H/δ)/ϵ\lceil \log(\lvert\mathcal{H}\rvert / \delta) / \epsilon \rceil
100.100.0553
1000.100.0577
1,0000.100.05100
10,0000.100.05123
1000.050.05153
1000.010.05761
1000.100.001116

Reading the table: each decade of H|\mathcal{H}| growth costs roughly log(10)/ϵ23\log(10)/\epsilon \approx 23 extra samples; halving ϵ\epsilon doubles mm; tightening δ\delta from 0.050.05 to 0.0010.001 adds log(50)/ϵ39\log(50)/\epsilon \approx 39 samples through the log(1/δ)\log(1/\delta) term.

Common Confusions

Watch Out

Realizability is about the class containing f, not about training error being zero.

A learner can achieve zero training error on a finite sample while fHf \notin \mathcal{H}. It just means the sample happened to be coverable by some hHh \in \mathcal{H}, not that the labeling rule itself is in the class. Conversely, realizability does not guarantee any specific learner finds hh^{*}; it only guarantees the search space contains a perfect predictor.

Watch Out

Realizability is not the i.i.d. assumption.

The two assumptions are independent. Realizability is about ff relative to H\mathcal{H}; i.i.d. is about how SS is drawn from D\mathcal{D}. Both are needed for the corollary above. Drop realizability and you move to agnostic PAC. Drop i.i.d. and you move into online learning, distribution-shift, or martingale-style analyses.

Worked Example

Example

Realizable axis-aligned rectangles

Let X=[0,1]2\mathcal{X} = [0, 1]^2, let ff be the indicator of an unknown rectangle RXR^{*} \subseteq \mathcal{X}, and let H\mathcal{H} be the class of all axis-aligned rectangles. By construction fHf \in \mathcal{H}, so realizability holds. The corollary above gives m=O(log(H/δ)/ϵ)m = O(\log(|\mathcal{H}|/\delta) / \epsilon), but H|\mathcal{H}| is uncountable in this setup so the bound as stated is vacuous and the analysis must move to VC dimension. The realizable assumption is doing real work in the analysis; the finite-class wrapper around it is just the simplest illustration.

Exercises

ExerciseCore

Problem

Suppose H={h1,h2}\mathcal{H} = \{h_1, h_2\} with H=2|\mathcal{H}| = 2, and the realizable corollary requires accuracy ϵ=0.05\epsilon = 0.05 with confidence 1δ=0.951 - \delta = 0.95. What is the smallest mm the bound certifies?

ExerciseAdvanced

Problem

Show that without realizability, the empirical risk of the ERM hypothesis can be strictly positive even on the optimal-in-class predictor, and give an example where this gap is the binding constraint on the agnostic-PAC sample complexity.

Summary

Realizability is a small assumption with a large structural effect. Under realizability plus i.i.d. plus a finite hypothesis class, the sample complexity to reach ϵ\epsilon-accuracy with confidence 1δ1 - \delta is O(log(H/δ)/ϵ)O(\log(|\mathcal{H}|/\delta)/\epsilon) — a clean linear-in-1/ϵ1/\epsilon rate. Drop realizability and you move to the agnostic regime, where the rate weakens to 1/ϵ21/\epsilon^2 and the analysis needs uniform convergence rather than a single union bound.

The assumption is rarely true in practice. Most labeling functions are not exactly representable by any practical hypothesis class. But the realizable case is the right starting point because every concept in PAC learning — the bad-hypothesis decomposition, the union-bound technique, the role of H|\mathcal{H}|, the sample-complexity scaling — appears in its simplest form here.

References

Canonical:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), definition 2.1 and corollary 2.3.
  • Kearns & Vazirani, An Introduction to Computational Learning Theory (1994), chapters 1-3.
  • Vapnik, Statistical Learning Theory (1998), chapter 1.

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), chapter 2.

Next Topics

Last reviewed: May 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

0

No published topic currently declares this as a prerequisite.