Learning Theory Core
No-Free-Lunch Theorem
For binary classification with 0-1 loss, no learning algorithm can succeed on every distribution: for any algorithm and any sample size m smaller than half the domain, some realizable distribution forces error at least 1/8 with probability at least 1/7. Universal learners do not exist; prior knowledge enters through the choice of hypothesis class.
Why This Matters
The No-Free-Lunch theorem is the formal answer to a tempting question: is there a single learning algorithm that works on every problem? The answer is no, and the proof is constructive. For any fixed algorithm and any sample size smaller than half the domain, there exists a distribution under which does worse than chance on a constant fraction of inputs, even though some hypothesis achieves zero error on that distribution.
The theorem clarifies what learning actually requires. It is not enough to have a clever optimization procedure or more data. You must also commit to a hypothesis class, and that commitment is the prior knowledge that makes generalization possible. The choice of is the bias the algorithm imposes on the world; without it, the PAC framework provides no guarantee.
This is also the entry point to the bias-complexity tradeoff and to VC dimension: NFL says you must restrict , and the next question is how restricted is restricted enough?
Mental Model
Think of a learner as a guesser. NFL says: for every fixed strategy, an adversary picking the distribution last can ensure the guesser is wrong on a noticeable fraction of unseen points. The adversary's power comes from the unseen-point set: anything the learner has not observed is fair game, and the adversary chooses the labels there to maximize confusion.
The way out is to refuse to consider every possible labeling. If the learner restricts attention to a small class of "reasonable" hypotheses, the adversary's choices shrink. NFL is the formal proof that this restriction is not optional.
Formal Setup
Fix a domain and label space . Loss is 0-1. A learning algorithm takes a labeled sample and returns a hypothesis . Population risk of a hypothesis under distribution on is
A distribution is realizable by some target if , that is, the labels are deterministic given via .
The Theorem
No-Free-Lunch (Shalev-Shwartz and Ben-David, Theorem 5.1)
Statement
Let be any learning algorithm for binary classification with 0-1 loss over a domain , and let be a sample size. There exists a distribution over such that:
- There is a function with (the distribution is realizable).
- With probability at least over the iid sample ,
Intuition
The proof picks a subset of size and considers all ways to label . Each labeling defines a realizable distribution uniform on . Averaged over , the algorithm cannot do better than chance on points it has not seen, because half of all labelings disagree with whatever predicts at any unobserved point. Pigeonhole on the worst converts the average bound into an existence statement.
Proof Sketch
Fix with . There are functions from to . For each , let be the uniform distribution on . Each is realizable by .
Goal: show
Let enumerate the possible -sequences in , and write for the sample on the -th sequence under labeling . Then
Use the chain :
Fix any . The sequence uses at most points of , so there are points in that the sample never visits. For any unobserved , so
Pair the labelings: for each , group the into pairs that agree on the sample but disagree at . On each such pair, the algorithm's prediction at matches one labeling and not the other, so exactly half the disagree with . Hence
Each unobserved point of has mass under , so . Averaging over and exchanging sums, where the final inequality uses .
So there exists with .
Convert expectation to high probability. Let with . For any , Take :
Why It Matters
NFL is the lower-bound counterpart to the PAC upper bounds. Together they say: learning is possible exactly when you commit to a hypothesis class of controlled complexity, and impossible when you do not. Every PAC bound is a quantitative version of "you bought enough prior knowledge"; NFL is the proof that you had to buy any at all.
Failure Mode
NFL applies to binary classification with 0-1 loss in the worst case over distributions. It says nothing about average-case behavior on natural distributions, regression with a different loss, or settings where the adversary cannot place mass on points absent from training. The constants and also depend on the assumption; for the learner can simply memorize.
Corollary: All Functions Is Not PAC-Learnable
The class of all binary functions on an infinite domain is not PAC-learnable (SSBD Corollary 5.2)
Statement
Let be an infinite domain and let be the class of all binary functions on . Then is not PAC-learnable.
Intuition
PAC learnability requires sample complexity that works for every distribution and target in . Pick and ; if a learner achieved this on , NFL would give a distribution where fails, a direct contradiction.
Proof Sketch
Suppose for contradiction that is PAC-learnable. Then there is an algorithm and a function such that for every realizable distribution and every , whenever .
Set and , and let . Since is infinite, , so NFL applies: there exists a realizable distribution with i.e., for any . The PAC guarantee says this probability should be at least . Contradiction.
Why It Matters
This is the cleanest way to see why some hypothesis classes are simply unlearnable. The class of all functions is too rich for any algorithm to generalize from a finite sample. The next chapter of learning theory is about which restrictions on are sufficient: finite classes work via the union bound, and infinite classes work iff they have finite VC dimension.
Failure Mode
The corollary needs infinite (or at least large enough that NFL applies at the chosen ). For finite small enough that , the learner can memorize and the argument breaks. Memorization on a finite domain is technically PAC-learning, but the sample complexity scales with and is not what the framework is designed to characterize.
NFL and Prior Knowledge
NFL says no algorithm wins on every problem; it does not say nothing works. The way out is prior knowledge, and the most common form of prior knowledge is a restricted hypothesis class.
When you commit to , you declare which patterns you expect the world to follow. If the true labeling lies in (realizable case) or close to it (agnostic case), the PAC machinery applies and learning succeeds with sample complexity controlled by or by VC dimension. If the true labeling lies far outside , the approximation error is large and even infinite data does not save you. Either way, the choice of is the prior, and the prior is what NFL forces you to commit.
Other ways to express prior knowledge include weighting hypotheses (structural risk minimization, MDL), restricting the loss function or optimization landscape, and Bayesian priors. SSBD Chapter 7 develops these alternatives. All of them encode the same fact: no commitment, no generalization.
Common Confusions
NFL says learning is possible, given the right H
NFL does not say learning is impossible. It says no algorithm works on every task. Choosing the right hypothesis class is the prior knowledge that makes learning possible. PAC bounds quantify what "right" means: a class with small VC dimension or small relative to the sample size.
NFL is worst-case, not average-case
The bound is over the worst distribution for the fixed algorithm. For natural distributions (images, text, physical processes), well-chosen hypothesis classes like neural networks, linear models on engineered features, or kernel methods routinely succeed. NFL says these classes must fail on some distribution; it does not say they fail on distributions you actually care about.
The 1/4 average error is not a bound on a single algorithm's behavior
The proof shows the average over labelings of the algorithm's expected risk is at least . This is what lets you pull out a worst labeling where a fixed algorithm has expected risk . It does not mean any specific produces error in expectation; only that some one does.
Realizability does not save you
The hard distribution constructed in the proof is realizable by (i.e., ). NFL therefore holds even with the realizability assumption that powers the cleanest PAC bounds. The reason is that the learner does not know which generated the labels, and must commit to a hypothesis from data alone. With , nothing in the data identifies on the unseen points.
Connection to PAC and VC
The relationship between NFL, PAC, and VC dimension fits a single chain:
- NFL: is not PAC-learnable.
- Finite-class PAC bound: implies is PAC-learnable with sample complexity (realizable) or (agnostic).
- Fundamental theorem of PAC learning: for binary classification with 0-1 loss, is PAC-learnable iff has finite VC dimension.
NFL is the negative end of (3): infinite VC dimension means not PAC-learnable. Restriction is mandatory; finite VC dimension is the right notion of "restricted enough" for binary classification.
Worked Example: Why m < |X|/2 Is Tight
Take and . NFL applies: some realizable distribution forces with probability . Now take . The learner can adopt the strategy "memorize and predict 0 on unseen points" and the adversary loses traction: with and uniform on of size , the sample covers half of in expectation, but with positive probability the learner sees fewer than distinct points and the unseen-point argument still bites, until when every point has been observed.
The clean cutoff is for the proof as stated. The constants and are specific to the of size construction, not fundamental: tighter constructions improve them (Anthony and Bartlett 1999, Theorem 5.3) at the cost of a more delicate argument.
Exercises
Problem
State the No-Free-Lunch theorem in your own words, naming the loss, the domain assumption, and the two probabilistic constants.
Problem
Where in the proof is the realizability assumption used? Could the proof be adapted to the agnostic setting?
Problem
Use NFL to prove that if shatters an infinite set (every finite subset of admits every binary labeling within ), then is not PAC-learnable.
Problem
Show that on a finite domain of size , the class is PAC-learnable with sample complexity in the realizable case. Reconcile with the NFL corollary.
Summary
- For 0-1 binary classification with , every algorithm has a realizable distribution where it fails with probability to achieve error .
- Proof structure: pick of size , average over the realizable labelings, pair labelings on each unseen point so half disagree with the algorithm.
- Corollary: the class of all binary functions on an infinite domain is not PAC-learnable.
- NFL forces every learner to commit to a hypothesis class. The choice of is the prior knowledge that enables generalization.
- Restriction is mandatory; finite VC dimension characterizes "restricted enough" for binary classification.
References
Canonical:
- Shalev-Shwartz and Ben-David, Understanding Machine Learning, Chapter 5, Theorem 5.1 and Corollary 5.2.
- Wolpert, "The Lack of A Priori Distinctions Between Learning Algorithms," Neural Computation 8 (1996), 1341-1390 — the original NFL theorem in a more general form.
- Wolpert and Macready, "No Free Lunch Theorems for Optimization," IEEE Transactions on Evolutionary Computation 1 (1997), 67-82 — the optimization analog.
Current:
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Section 3.4.
- Anthony and Bartlett, Neural Network Learning: Theoretical Foundations (1999), Chapter 5 — sharper constants and the connection to VC lower bounds.
- Vapnik, Statistical Learning Theory (1998), Chapter 4 — the VC-dimension framing of why prior knowledge is needed.
Critique:
- Sterkenburg and Grünwald, "The No-Free-Lunch Theorems of Supervised Learning," Synthese 199 (2021), 9979-10015, doi:10.1007/s11229-021-03233-1 — argues the NFL theorems are routinely overstated and clarifies what they do and do not imply about real-world learning.
Next Topics
- VC dimension: the complexity measure that, by the fundamental theorem, determines exactly which hypothesis classes are PAC-learnable.
- Bias-complexity tradeoff: the decomposition of ERM excess risk into approximation error (the price of restricting ) and estimation error (the price of finite data).
Last reviewed: May 8, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Loss Functions Cataloglayer 1 · tier 1
- PAC Learning Frameworklayer 1 · tier 1
- Empirical Risk Minimizationlayer 2 · tier 1
Derived topics
1- Bias-Complexity Tradeofflayer 2 · tier 2
Graph-backed continuations