Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

PAC Learning vs. Agnostic PAC Learning

Realizable PAC learning assumes the target is in the hypothesis class. Agnostic PAC drops this assumption and competes with the best hypothesis in the class. Agnostic learning is harder, requiring uniform convergence and yielding slower sample complexity.

What Each Assumes

Both PAC and agnostic PAC define what it means for a learning algorithm to succeed. They differ in a single assumption: whether the true target function is in the hypothesis class.

PAC (realizable): There exists hHh^* \in \mathcal{H} with R(h)=0R(h^*) = 0. The target is perfectly representable by the class.

Agnostic PAC: No assumption on the target. The learner competes with minhHR(h)\min_{h \in \mathcal{H}} R(h), which may be nonzero.

Side-by-Side Statement

Definition

PAC Learnability (Realizable)

A hypothesis class H\mathcal{H} is PAC learnable if there exists an algorithm AA and a function m(ε,δ)m(\varepsilon, \delta) such that: for every distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y} for which there exists hHh^* \in \mathcal{H} with R(h)=0R(h^*) = 0, given nm(ε,δ)n \geq m(\varepsilon, \delta) i.i.d. samples, AA returns h^\hat{h} satisfying:

Pr[R(h^)ε]1δ\Pr[R(\hat{h}) \leq \varepsilon] \geq 1 - \delta

Definition

Agnostic PAC Learnability

A hypothesis class H\mathcal{H} is agnostic PAC learnable if there exists an algorithm AA and a function m(ε,δ)m(\varepsilon, \delta) such that: for every distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y} (no realizability assumption), given nm(ε,δ)n \geq m(\varepsilon, \delta) i.i.d. samples, AA returns h^\hat{h} satisfying:

Pr ⁣[R(h^)minhHR(h)+ε]1δ\Pr\!\left[R(\hat{h}) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon\right] \geq 1 - \delta

Where Each Is Stronger

Realizable PAC has better sample complexity

In the realizable setting with a finite hypothesis class, ERM achieves:

m(ε,δ)=logH+log(1/δ)εm(\varepsilon, \delta) = \frac{\log|\mathcal{H}| + \log(1/\delta)}{\varepsilon}

The dependence on ε\varepsilon is 1/ε1/\varepsilon, not 1/ε21/\varepsilon^2. This is because in the realizable case, the true risk R(h)=0R(h^*) = 0, so any hypothesis with positive empirical risk can be eliminated. You only need one-sided concentration.

Agnostic PAC applies to real problems

The realizability assumption almost never holds in practice. The Bayes-optimal classifier is rarely in H\mathcal{H}. Agnostic PAC is the relevant framework for practical learning, because it handles model misspecification.

Sample Complexity Comparison

Theorem

Sample Complexity Gap

Statement

For a finite hypothesis class H\mathcal{H} with 0-1 loss:

Realizable PAC: ERM achieves sample complexity m(ε,δ)=log(H/δ)εm(\varepsilon, \delta) = \lceil \frac{\log(|\mathcal{H}|/\delta)}{\varepsilon} \rceil.

Agnostic PAC: ERM achieves sample complexity m(ε,δ)=2(log(2H)+log(2/δ))ε2m(\varepsilon, \delta) = \lceil \frac{2(\log(2|\mathcal{H}|) + \log(2/\delta))}{\varepsilon^2} \rceil.

Intuition

The gap comes from the difference between one-sided and two-sided concentration. In the realizable case, R(h)=0R(h^*) = 0, so R^n(h)=0\hat{R}_n(h^*) = 0 always. Any hypothesis with R^n(h)>0\hat{R}_n(h) > 0 is certified to be imperfect. The effective number of "dangerous" hypotheses (those with R^n=0\hat{R}_n = 0 but R>εR > \varepsilon) shrinks exponentially in nεn\varepsilon.

In the agnostic case, no hypothesis has zero risk. You must estimate R(h)R(h) to accuracy ε\varepsilon for all hh simultaneously, which requires n1/ε2n \sim 1/\varepsilon^2 by standard concentration.

Failure Mode

For infinite hypothesis classes, the finite-class analysis does not apply. The fundamental theorem of statistical learning shows that for binary classification, H\mathcal{H} is agnostic PAC learnable if and only if it has finite VC dimension dd, with sample complexity Θ(d/ε2)\Theta(d/\varepsilon^2).

The Mechanism: Why Agnostic Is Harder

In realizable PAC, the algorithm only needs to identify a consistent hypothesis (one with zero empirical risk). The set of consistent hypotheses is a "version space" that shrinks with more data, and any element of this set has low true risk.

In agnostic PAC, the algorithm must uniformly estimate the risk of all hypotheses. This requires two-sided uniform convergence:

suphHR(h)R^n(h)ε\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \varepsilon

Uniform convergence is sufficient for agnostic PAC learning: if the above holds, then the ERM hypothesis satisfies R(h^)minhR(h)+2εR(\hat{h}) \leq \min_h R(h) + 2\varepsilon. For binary classification with finite VC dimension, uniform convergence is also necessary.

Key Assumptions That Differ

Realizable PACAgnostic PAC
RealizabilityhH\exists h^* \in \mathcal{H} with R(h)=0R(h^*) = 0No assumption
Success criterionR(h^)εR(\hat{h}) \leq \varepsilonR(h^)minhR(h)+εR(\hat{h}) \leq \min_h R(h) + \varepsilon
Sample complexity$O(\log\mathcal
Required toolOne-sided concentrationUniform convergence
Practical relevanceRarely applicableStandard setting

The Fundamental Theorem

For binary classification, PAC learnability and agnostic PAC learnability are equivalent: both are characterized by finite VC dimension. The difference is only in the sample complexity rate (1/ε1/\varepsilon vs. 1/ε21/\varepsilon^2). This equivalence does not extend to all loss functions. For some losses, agnostic learnability is strictly harder than realizable learnability.

Common Confusions

Watch Out

ERM works in both settings, but for different reasons

In the realizable case, ERM succeeds because any consistent hypothesis has low risk. In the agnostic case, ERM succeeds because uniform convergence ensures the empirical risk is a good proxy for true risk everywhere. The proofs are structurally different, even though the algorithm (minimize training loss) is the same.

Watch Out

Agnostic PAC does not mean the learner is worse

The learner in agnostic PAC competes with the best hypothesis in the class, not with the Bayes-optimal predictor. If H\mathcal{H} is rich enough that minhR(h)0\min_h R(h) \approx 0, agnostic PAC gives a bound close to what realizable PAC gives. The extra ε\varepsilon is relative to the best-in-class risk, not an absolute degradation.

Watch Out

The 1/eps vs 1/eps^2 gap is real but context-dependent

The gap between 1/ε1/\varepsilon and 1/ε21/\varepsilon^2 sample complexity is a genuine difference in the finite-class case. For infinite classes with VC dimension dd, the realizable rate is O(d/ε)O(d/\varepsilon) and the agnostic rate is O(d/ε2)O(d/\varepsilon^2). However, "fast rates" in agnostic learning (using Bernstein-type conditions) can recover the 1/ε1/\varepsilon rate when the best hypothesis has small risk. The gap is worst when minhR(h)\min_h R(h) is large.

References

Canonical:

Current: