What Each Assumes
Both PAC and agnostic PAC define what it means for a learning algorithm to succeed. They differ in a single assumption: whether the true target function is in the hypothesis class.
PAC (realizable): There exists with . The target is perfectly representable by the class.
Agnostic PAC: No assumption on the target. The learner competes with , which may be nonzero.
Side-by-Side Statement
PAC Learnability (Realizable)
A hypothesis class is PAC learnable if there exists an algorithm and a function such that: for every distribution over for which there exists with , given i.i.d. samples, returns satisfying:
Agnostic PAC Learnability
A hypothesis class is agnostic PAC learnable if there exists an algorithm and a function such that: for every distribution over (no realizability assumption), given i.i.d. samples, returns satisfying:
Where Each Is Stronger
Realizable PAC has better sample complexity
In the realizable setting with a finite hypothesis class, ERM achieves:
The dependence on is , not . This is because in the realizable case, the true risk , so any hypothesis with positive empirical risk can be eliminated. You only need one-sided concentration.
Agnostic PAC applies to real problems
The realizability assumption almost never holds in practice. The Bayes-optimal classifier is rarely in . Agnostic PAC is the relevant framework for practical learning, because it handles model misspecification.
Sample Complexity Comparison
Sample Complexity Gap
Statement
For a finite hypothesis class with 0-1 loss:
Realizable PAC: ERM achieves sample complexity .
Agnostic PAC: ERM achieves sample complexity .
Intuition
The gap comes from the difference between one-sided and two-sided concentration. In the realizable case, , so always. Any hypothesis with is certified to be imperfect. The effective number of "dangerous" hypotheses (those with but ) shrinks exponentially in .
In the agnostic case, no hypothesis has zero risk. You must estimate to accuracy for all simultaneously, which requires by standard concentration.
Failure Mode
For infinite hypothesis classes, the finite-class analysis does not apply. The fundamental theorem of statistical learning shows that for binary classification, is agnostic PAC learnable if and only if it has finite VC dimension , with sample complexity .
The Mechanism: Why Agnostic Is Harder
In realizable PAC, the algorithm only needs to identify a consistent hypothesis (one with zero empirical risk). The set of consistent hypotheses is a "version space" that shrinks with more data, and any element of this set has low true risk.
In agnostic PAC, the algorithm must uniformly estimate the risk of all hypotheses. This requires two-sided uniform convergence:
Uniform convergence is sufficient for agnostic PAC learning: if the above holds, then the ERM hypothesis satisfies . For binary classification with finite VC dimension, uniform convergence is also necessary.
Key Assumptions That Differ
| Realizable PAC | Agnostic PAC | |
|---|---|---|
| Realizability | with | No assumption |
| Success criterion | ||
| Sample complexity | $O(\log | \mathcal |
| Required tool | One-sided concentration | Uniform convergence |
| Practical relevance | Rarely applicable | Standard setting |
The Fundamental Theorem
For binary classification, PAC learnability and agnostic PAC learnability are equivalent: both are characterized by finite VC dimension. The difference is only in the sample complexity rate ( vs. ). This equivalence does not extend to all loss functions. For some losses, agnostic learnability is strictly harder than realizable learnability.
Common Confusions
ERM works in both settings, but for different reasons
In the realizable case, ERM succeeds because any consistent hypothesis has low risk. In the agnostic case, ERM succeeds because uniform convergence ensures the empirical risk is a good proxy for true risk everywhere. The proofs are structurally different, even though the algorithm (minimize training loss) is the same.
Agnostic PAC does not mean the learner is worse
The learner in agnostic PAC competes with the best hypothesis in the class, not with the Bayes-optimal predictor. If is rich enough that , agnostic PAC gives a bound close to what realizable PAC gives. The extra is relative to the best-in-class risk, not an absolute degradation.
The 1/eps vs 1/eps^2 gap is real but context-dependent
The gap between and sample complexity is a genuine difference in the finite-class case. For infinite classes with VC dimension , the realizable rate is and the agnostic rate is . However, "fast rates" in agnostic learning (using Bernstein-type conditions) can recover the rate when the best hypothesis has small risk. The gap is worst when is large.
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2-6
- Vapnik, Statistical Learning Theory (1998), Chapters 1-3
Current:
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-3