PAC vs. Agnostic PAC Learning. Realizable vs. Unrealizable Settings

What Each Assumes

Both PAC and agnostic PAC define what it means for a learning algorithm to succeed. They differ in a single assumption: whether the true target function is in the hypothesis class.

PAC (realizable): There exists $h^* \in \mathcal{H}$ with $R(h^*) = 0$ . The target is perfectly representable by the class.

Agnostic PAC: No assumption on the target. The learner competes with $\min_{h \in \mathcal{H}} R(h)$ , which may be nonzero.

Side-by-Side Statement

Definition

PAC Learnability (Realizable)

A hypothesis class $\mathcal{H}$ is PAC learnable if and only if there exists an algorithm $A$ and a function $m(\varepsilon, \delta)$ such that: for every distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$ for which there exists $h^* \in \mathcal{H}$ with $R(h^*) = 0$ , given $n \geq m(\varepsilon, \delta)$ i.i.d. samples, $A$ returns $\hat{h}$ satisfying:

$\Pr[R(\hat{h}) \leq \varepsilon] \geq 1 - \delta$

Definition

Agnostic PAC Learnability

A hypothesis class $\mathcal{H}$ is agnostic PAC learnable if and only if there exists an algorithm $A$ and a function $m(\varepsilon, \delta)$ such that: for every distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$ (no realizability assumption), given $n \geq m(\varepsilon, \delta)$ i.i.d. samples, $A$ returns $\hat{h}$ satisfying:

$\Pr\!\left[R(\hat{h}) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon\right] \geq 1 - \delta$

Where Each Is Stronger

Realizable PAC has better sample complexity

In the realizable setting with a finite hypothesis class, ERM achieves:

$m(\varepsilon, \delta) = \frac{\log|\mathcal{H}| + \log(1/\delta)}{\varepsilon}$

The dependence on $\varepsilon$ is $1/\varepsilon$ , not $1/\varepsilon^2$ . This is because in the realizable case, the true risk $R(h^*) = 0$ , so any hypothesis with positive empirical risk can be eliminated. You only need one-sided concentration.

Agnostic PAC applies to real problems

The realizability assumption almost never holds in practice. The Bayes-optimal classifier is rarely in $\mathcal{H}$ . Agnostic PAC is the relevant framework for practical learning, because it handles model misspecification.

Sample Complexity Comparison

Theorem

Sample Complexity Gap

Statement

For a finite hypothesis class $\mathcal{H}$ with 0-1 loss:

Realizable PAC: ERM achieves sample complexity $m(\varepsilon, \delta) = \lceil \frac{\log(|\mathcal{H}|/\delta)}{\varepsilon} \rceil$ .

Agnostic PAC: ERM achieves sample complexity $m(\varepsilon, \delta) = \lceil \frac{2(\log(2|\mathcal{H}|) + \log(2/\delta))}{\varepsilon^2} \rceil$ .

Intuition

The gap comes from the difference between one-sided and two-sided concentration. In the realizable case, $R(h^*) = 0$ , so $\hat{R}_n(h^*) = 0$ always. Any hypothesis with $\hat{R}_n(h) > 0$ is certified to be imperfect. The effective number of "dangerous" hypotheses (those with $\hat{R}_n = 0$ but $R > \varepsilon$ ) shrinks exponentially in $n\varepsilon$ .

In the agnostic case, no hypothesis has zero risk. You must estimate $R(h)$ to accuracy $\varepsilon$ for all $h$ simultaneously, which requires $n \sim 1/\varepsilon^2$ by standard concentration.

Failure Mode

For infinite hypothesis classes, the finite-class analysis does not apply. The fundamental theorem of statistical learning shows that for binary classification, $\mathcal{H}$ is agnostic PAC learnable if and only if it has finite VC dimension $d$ , with sample complexity $\Theta(d/\varepsilon^2)$ .

report a correction →

The Mechanism: Why Agnostic Is Harder

In realizable PAC, the algorithm only needs to identify a consistent hypothesis (one with zero empirical risk). The set of consistent hypotheses is a "version space" that shrinks with more data, and any element of this set has low true risk.

In agnostic PAC, the algorithm must uniformly estimate the risk of all hypotheses. This requires two-sided uniform convergence:

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \varepsilon$

Uniform convergence is sufficient for agnostic PAC learning: if the above holds, then the ERM hypothesis satisfies $R(\hat{h}) \leq \min_h R(h) + 2\varepsilon$ . For binary classification with finite VC dimension, uniform convergence is also necessary.

Key Assumptions That Differ

	Realizable PAC	Agnostic PAC
Realizability	$\exists h^* \in \mathcal{H}$ with $R(h^*) = 0$	No assumption
Success criterion	$R(\hat{h}) \leq \varepsilon$	$R(\hat{h}) \leq \min_h R(h) + \varepsilon$
Sample complexity	$O(\log \lvert\mathcal{H}\rvert/\varepsilon)$	$O(\log \lvert\mathcal{H}\rvert/\varepsilon^2)$
Required tool	One-sided concentration	Uniform convergence
Practical relevance	Rarely applicable	Standard setting

The Fundamental Theorem

For binary classification, PAC learnability and agnostic PAC learnability are equivalent: both are characterized by finite VC dimension. The difference is only in the sample complexity rate ( $1/\varepsilon$ vs. $1/\varepsilon^2$ ). This equivalence does not extend to all loss functions. For some losses, agnostic learnability is strictly harder than realizable learnability.

Common Confusions

Watch Out

ERM works in both settings, but for different reasons

In the realizable case, ERM succeeds because any consistent hypothesis has low risk. In the agnostic case, ERM succeeds because uniform convergence ensures the empirical risk is a good proxy for true risk everywhere. The proofs are structurally different, even though the algorithm (minimize training loss) is the same.

Watch Out

Agnostic PAC does not mean the learner is worse

The learner in agnostic PAC competes with the best hypothesis in the class, not with the Bayes-optimal predictor. If $\mathcal{H}$ is rich enough that $\min_h R(h) \approx 0$ , agnostic PAC gives a bound close to what realizable PAC gives. The extra $\varepsilon$ is relative to the best-in-class risk, not an absolute degradation.

Watch Out

The 1/eps vs 1/eps^2 gap is real but context-dependent

The gap between $1/\varepsilon$ and $1/\varepsilon^2$ sample complexity is a genuine difference in the finite-class case. For infinite classes with VC dimension $d$ , the realizable rate is $O(d/\varepsilon)$ and the agnostic rate is $O(d/\varepsilon^2)$ . However, "fast rates" in agnostic learning (using Bernstein-type conditions) can recover the $1/\varepsilon$ rate when the best hypothesis has small risk. The gap is worst when $\min_h R(h)$ is large.

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2-6
Vapnik, Statistical Learning Theory (1998), Chapters 1-3

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-3