No-Free-Lunch Theorem

Sneiderman, Robby

Learning Theory Core

No-Free-Lunch Theorem

For binary classification with 0-1 loss, no learning algorithm can succeed on every distribution: for any algorithm and any sample size m smaller than half the domain, some realizable distribution forces error at least 1/8 with probability at least 1/7. Universal learners do not exist; prior knowledge enters through the choice of hypothesis class.

CoreTier 2StableCore spine~35 min

Prerequisites

PAC Learning Framework Empirical Risk Minimization Loss Functions Catalog

Quiz (6)Pulse Check Prereq Map

Why This Matters

The No-Free-Lunch theorem is the formal answer to a tempting question: is there a single learning algorithm that works on every problem? The answer is no, and the proof is constructive. For any fixed algorithm $A$ and any sample size $m$ smaller than half the domain, there exists a distribution under which $A$ does worse than chance on a constant fraction of inputs, even though some hypothesis achieves zero error on that distribution.

The theorem clarifies what learning actually requires. It is not enough to have a clever optimization procedure or more data. You must also commit to a hypothesis class, and that commitment is the prior knowledge that makes generalization possible. The choice of $\mathcal{H}$ is the bias the algorithm imposes on the world; without it, the PAC framework provides no guarantee.

This is also the entry point to the bias-complexity tradeoff and to VC dimension: NFL says you must restrict $\mathcal{H}$ , and the next question is how restricted is restricted enough?

Mental Model

Think of a learner as a guesser. NFL says: for every fixed strategy, an adversary picking the distribution last can ensure the guesser is wrong on a noticeable fraction of unseen points. The adversary's power comes from the unseen-point set: anything the learner has not observed is fair game, and the adversary chooses the labels there to maximize confusion.

The way out is to refuse to consider every possible labeling. If the learner restricts attention to a small class $\mathcal{H}$ of "reasonable" hypotheses, the adversary's choices shrink. NFL is the formal proof that this restriction is not optional.

Formal Setup

Fix a domain $\mathcal{X}$ and label space $\{0, 1\}$ . Loss is 0-1. A learning algorithm $A$ takes a labeled sample $S \in (\mathcal{X} \times \{0,1\})^m$ and returns a hypothesis $A(S): \mathcal{X} \to \{0, 1\}$ . Population risk of a hypothesis $h$ under distribution $\mathcal{D}$ on $\mathcal{X} \times \{0,1\}$ is

$L_\mathcal{D}(h) = \Pr_{(x, y) \sim \mathcal{D}}[h(x) \ne y].$

A distribution $\mathcal{D}$ is realizable by some target $f: \mathcal{X} \to \{0,1\}$ if $L_\mathcal{D}(f) = 0$ , that is, the labels are deterministic given $x$ via $f$ .

The Theorem

Theorem

No-Free-Lunch (Shalev-Shwartz and Ben-David, Theorem 5.1)

Statement

Let $A$ be any learning algorithm for binary classification with 0-1 loss over a domain $\mathcal{X}$ , and let $m < |\mathcal{X}|/2$ be a sample size. There exists a distribution $\mathcal{D}$ over $\mathcal{X} \times \{0,1\}$ such that:

There is a function $f: \mathcal{X} \to \{0, 1\}$ with $L_\mathcal{D}(f) = 0$ (the distribution is realizable).
With probability at least $1/7$ over the iid sample $S \sim \mathcal{D}^m$ ,

$L_\mathcal{D}(A(S)) \ge \tfrac{1}{8}.$

Intuition

The proof picks a subset $C \subset \mathcal{X}$ of size $2m$ and considers all $2^{2m}$ ways to label $C$ . Each labeling $f_i$ defines a realizable distribution $\mathcal{D}_i$ uniform on $C$ . Averaged over $i$ , the algorithm cannot do better than chance on points it has not seen, because half of all labelings disagree with whatever $A$ predicts at any unobserved point. Pigeonhole on the worst $i$ converts the average bound into an existence statement.

Proof Sketch

Fix $C \subset \mathcal{X}$ with $|C| = 2m$ . There are $T = 2^{2m}$ functions $f_1, \ldots, f_T$ from $C$ to $\{0, 1\}$ . For each $i$ , let $\mathcal{D}_i$ be the uniform distribution on $\{(x, f_i(x)) : x \in C\}$ . Each $\mathcal{D}_i$ is realizable by $f_i$ .

Goal: show $\max_{i \in [T]} \mathbb{E}_{S \sim \mathcal{D}_i^m}\!\left[L_{\mathcal{D}_i}(A(S))\right] \ge \tfrac{1}{4}.$

Let $k = (2m)^m$ enumerate the possible $\mathcal{X}$ -sequences in $C^m$ , and write $S^i_j$ for the sample $(x_{j,1}, f_i(x_{j,1})), \ldots, (x_{j,m}, f_i(x_{j,m}))$ on the $j$ -th sequence under labeling $f_i$ . Then $\mathbb{E}_{S \sim \mathcal{D}_i^m}[L_{\mathcal{D}_i}(A(S))] = \frac{1}{k} \sum_{j=1}^{k} L_{\mathcal{D}_i}(A(S^i_j)).$

Use the chain $\max \ge \mathrm{avg} \ge \min$ : $\max_i \frac{1}{k} \sum_j L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{T} \sum_i \frac{1}{k} \sum_j L_{\mathcal{D}_i}(A(S^i_j)) = \frac{1}{k} \sum_j \frac{1}{T} \sum_i L_{\mathcal{D}_i}(A(S^i_j)) \ge \min_j \frac{1}{T} \sum_i L_{\mathcal{D}_i}(A(S^i_j)).$

Fix any $j$ . The sequence $x_{j,1}, \ldots, x_{j,m}$ uses at most $m$ points of $C$ , so there are $p \ge m$ points $v_1, \ldots, v_p$ in $C$ that the sample never visits. For any unobserved $v_r$ , $L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \mathbb{1}\!\left[A(S^i_j)(v_r) \ne f_i(v_r)\right],$ so $L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \cdot \frac{1}{p} \sum_{r=1}^p \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)].$

Pair the labelings: for each $r$ , group the $f_i$ into $T/2$ pairs that agree on the sample but disagree at $v_r$ . On each such pair, the algorithm's prediction at $v_r$ matches one labeling and not the other, so exactly half the $f_i$ disagree with $A(S^i_j)(v_r)$ . Hence $\frac{1}{T} \sum_i \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)] = \tfrac{1}{2}.$

Each unobserved point of $C$ has mass $1/(2m)$ under $\mathcal{D}_i$ , so $L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \sum_{r=1}^p \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)]$ . Averaging over $i$ and exchanging sums, $\frac{1}{T} \sum_i L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \sum_{r=1}^p \frac{1}{T} \sum_i \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)] = \frac{1}{2m} \cdot p \cdot \tfrac{1}{2} = \frac{p}{4m} \ge \tfrac{1}{4},$ where the final inequality uses $p \ge m$ .

So there exists $i^*$ with $\mathbb{E}_S[L_{\mathcal{D}_{i^*}}(A(S))] \ge 1/4$ .

Convert expectation to high probability. Let $Z = L_{\mathcal{D}_{i^*}}(A(S)) \in [0, 1]$ with $\mathbb{E}[Z] \ge 1/4$ . For any $a \in (0, \mathbb{E}[Z])$ , $\Pr[Z \ge a] \ge \frac{\mathbb{E}[Z] - a}{1 - a}.$ Take $a = 1/8$ : $\Pr[Z \ge 1/8] \ge (1/4 - 1/8)/(1 - 1/8) = (1/8)/(7/8) = 1/7.$

Why It Matters

NFL is the lower-bound counterpart to the PAC upper bounds. Together they say: learning is possible exactly when you commit to a hypothesis class of controlled complexity, and impossible when you do not. Every PAC bound is a quantitative version of "you bought enough prior knowledge"; NFL is the proof that you had to buy any at all.

Failure Mode

NFL applies to binary classification with 0-1 loss in the worst case over distributions. It says nothing about average-case behavior on natural distributions, regression with a different loss, or settings where the adversary cannot place mass on points absent from training. The constants $1/8$ and $1/7$ also depend on the $|\mathcal{X}| \ge 2m$ assumption; for $m \ge |\mathcal{X}|/2$ the learner can simply memorize.

report a correction →

Corollary: All Functions Is Not PAC-Learnable

Corollary

The class of all binary functions on an infinite domain is not PAC-learnable (SSBD Corollary 5.2)

Statement

Let $\mathcal{X}$ be an infinite domain and let $\mathcal{H} = \{0, 1\}^\mathcal{X}$ be the class of all binary functions on $\mathcal{X}$ . Then $\mathcal{H}$ is not PAC-learnable.

Intuition

PAC learnability requires sample complexity $m_\mathcal{H}(\epsilon, \delta)$ that works for every distribution and target in $\mathcal{H}$ . Pick $\epsilon < 1/8$ and $\delta < 1/7$ ; if a learner $A$ achieved this on $\mathcal{H}$ , NFL would give a distribution where $A$ fails, a direct contradiction.

Proof Sketch

Suppose for contradiction that $\mathcal{H} = \{0,1\}^\mathcal{X}$ is PAC-learnable. Then there is an algorithm $A$ and a function $m_\mathcal{H}(\epsilon, \delta)$ such that for every realizable distribution and every $\epsilon, \delta \in (0, 1)$ , $\Pr_{S \sim \mathcal{D}^m}[L_\mathcal{D}(A(S)) \le \epsilon] \ge 1 - \delta$ whenever $m \ge m_\mathcal{H}(\epsilon, \delta)$ .

Set $\epsilon = 1/8$ and $\delta = 1/7$ , and let $m = m_\mathcal{H}(1/8, 1/7)$ . Since $\mathcal{X}$ is infinite, $|\mathcal{X}| \ge 2m + 1 > 2m$ , so NFL applies: there exists a realizable distribution $\mathcal{D}$ with $\Pr_{S \sim \mathcal{D}^m}[L_\mathcal{D}(A(S)) \ge 1/8] \ge 1/7,$ i.e., $\Pr[L_\mathcal{D}(A(S)) \le 1/8 - \eta] < 1 - 1/7$ for any $\eta > 0$ . The PAC guarantee says this probability should be at least $1 - 1/7$ . Contradiction.

Why It Matters

This is the cleanest way to see why some hypothesis classes are simply unlearnable. The class of all functions is too rich for any algorithm to generalize from a finite sample. The next chapter of learning theory is about which restrictions on $\mathcal{H}$ are sufficient: finite classes work via the union bound, and infinite classes work iff they have finite VC dimension.

Failure Mode

The corollary needs $\mathcal{X}$ infinite (or at least large enough that NFL applies at the chosen $\epsilon, \delta$ ). For finite $\mathcal{X}$ small enough that $m_\mathcal{H}(\epsilon, \delta) \ge |\mathcal{X}|/2$ , the learner can memorize and the argument breaks. Memorization on a finite domain is technically PAC-learning, but the sample complexity scales with $|\mathcal{X}|$ and is not what the framework is designed to characterize.

report a correction →

NFL and Prior Knowledge

NFL says no algorithm wins on every problem; it does not say nothing works. The way out is prior knowledge, and the most common form of prior knowledge is a restricted hypothesis class.

When you commit to $\mathcal{H} \subsetneq \{0,1\}^\mathcal{X}$ , you declare which patterns you expect the world to follow. If the true labeling lies in $\mathcal{H}$ (realizable case) or close to it (agnostic case), the PAC machinery applies and learning succeeds with sample complexity controlled by $\log|\mathcal{H}|$ or by VC dimension. If the true labeling lies far outside $\mathcal{H}$ , the approximation error is large and even infinite data does not save you. Either way, the choice of $\mathcal{H}$ is the prior, and the prior is what NFL forces you to commit.

Other ways to express prior knowledge include weighting hypotheses (structural risk minimization, MDL), restricting the loss function or optimization landscape, and Bayesian priors. SSBD Chapter 7 develops these alternatives. All of them encode the same fact: no commitment, no generalization.

Common Confusions

Watch Out

NFL says learning is possible, given the right H

NFL does not say learning is impossible. It says no algorithm works on every task. Choosing the right hypothesis class is the prior knowledge that makes learning possible. PAC bounds quantify what "right" means: a class with small VC dimension or small $\log|\mathcal{H}|$ relative to the sample size.

Watch Out

NFL is worst-case, not average-case

The bound is over the worst distribution for the fixed algorithm. For natural distributions (images, text, physical processes), well-chosen hypothesis classes like neural networks, linear models on engineered features, or kernel methods routinely succeed. NFL says these classes must fail on some distribution; it does not say they fail on distributions you actually care about.

Watch Out

The 1/4 average error is not a bound on a single algorithm's behavior

The proof shows the average over labelings of the algorithm's expected risk is at least $1/4$ . This is what lets you pull out a worst labeling where a fixed algorithm has expected risk $\ge 1/4$ . It does not mean any specific $\mathcal{D}_i$ produces $1/4$ error in expectation; only that some one does.

Watch Out

Realizability does not save you

The hard distribution constructed in the proof is realizable by $f_i$ (i.e., $L_{\mathcal{D}_i}(f_i) = 0$ ). NFL therefore holds even with the realizability assumption that powers the cleanest PAC bounds. The reason is that the learner does not know which $f_i$ generated the labels, and must commit to a hypothesis from data alone. With $\mathcal{H} = \{0,1\}^\mathcal{X}$ , nothing in the data identifies $f_i$ on the unseen points.

Connection to PAC and VC

The relationship between NFL, PAC, and VC dimension fits a single chain:

NFL: $\mathcal{H} = \{0,1\}^\mathcal{X}$ is not PAC-learnable.
Finite-class PAC bound: $|\mathcal{H}| < \infty$ implies $\mathcal{H}$ is PAC-learnable with sample complexity $O(\log|\mathcal{H}| / \epsilon)$ (realizable) or $O(\log|\mathcal{H}| / \epsilon^2)$ (agnostic).
Fundamental theorem of PAC learning: for binary classification with 0-1 loss, $\mathcal{H}$ is PAC-learnable iff $\mathcal{H}$ has finite VC dimension.

NFL is the negative end of (3): infinite VC dimension means not PAC-learnable. Restriction is mandatory; finite VC dimension is the right notion of "restricted enough" for binary classification.

Worked Example: Why m < |X|/2 Is Tight

Take $|\mathcal{X}| = 100$ and $m = 49$ . NFL applies: some realizable distribution forces $L_\mathcal{D}(A(S)) \ge 1/8$ with probability $\ge 1/7$ . Now take $m = 50$ . The learner can adopt the strategy "memorize $S$ and predict 0 on unseen points" and the adversary loses traction: with $m = 50$ and uniform $\mathcal{D}$ on $C$ of size $2m = 100$ , the sample covers half of $C$ in expectation, but with positive probability the learner sees fewer than $|C|/2$ distinct points and the unseen-point argument still bites, until $m = |\mathcal{X}|$ when every point has been observed.

The clean cutoff is $m < |\mathcal{X}|/2$ for the proof as stated. The constants $1/8$ and $1/7$ are specific to the $C$ of size $2m$ construction, not fundamental: tighter constructions improve them (Anthony and Bartlett 1999, Theorem 5.3) at the cost of a more delicate argument.

Exercises

ExerciseCore

Problem

State the No-Free-Lunch theorem in your own words, naming the loss, the domain assumption, and the two probabilistic constants.

ExerciseCore

Problem

Where in the proof is the realizability assumption used? Could the proof be adapted to the agnostic setting?

ExerciseAdvanced

Problem

Use NFL to prove that if $\mathcal{H}$ shatters an infinite set $T \subset \mathcal{X}$ (every finite subset of $T$ admits every binary labeling within $\mathcal{H}$ ), then $\mathcal{H}$ is not PAC-learnable.

ExerciseAdvanced

Problem

Show that on a finite domain $\mathcal{X}$ of size $N$ , the class $\mathcal{H} = \{0,1\}^\mathcal{X}$ is PAC-learnable with sample complexity $m = O(\log(N/\delta) / \epsilon)$ in the realizable case. Reconcile with the NFL corollary.

Summary

For 0-1 binary classification with $m < |\mathcal{X}|/2$ , every algorithm has a realizable distribution where it fails with probability $\ge 1/7$ to achieve error $< 1/8$ .
Proof structure: pick $C$ of size $2m$ , average over the $2^{2m}$ realizable labelings, pair labelings on each unseen point so half disagree with the algorithm.
Corollary: the class of all binary functions on an infinite domain is not PAC-learnable.
NFL forces every learner to commit to a hypothesis class. The choice of $\mathcal{H}$ is the prior knowledge that enables generalization.
Restriction is mandatory; finite VC dimension characterizes "restricted enough" for binary classification.

References

Canonical:

Shalev-Shwartz and Ben-David, Understanding Machine Learning, Chapter 5, Theorem 5.1 and Corollary 5.2.
Wolpert, "The Lack of A Priori Distinctions Between Learning Algorithms," Neural Computation 8 (1996), 1341-1390 — the original NFL theorem in a more general form.
Wolpert and Macready, "No Free Lunch Theorems for Optimization," IEEE Transactions on Evolutionary Computation 1 (1997), 67-82 — the optimization analog.

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Section 3.4.
Anthony and Bartlett, Neural Network Learning: Theoretical Foundations (1999), Chapter 5 — sharper constants and the connection to VC lower bounds.
Vapnik, Statistical Learning Theory (1998), Chapter 4 — the VC-dimension framing of why prior knowledge is needed.

Critique:

Sterkenburg and Grünwald, "The No-Free-Lunch Theorems of Supervised Learning," Synthese 199 (2021), 9979-10015, doi:10.1007/s11229-021-03233-1 — argues the NFL theorems are routinely overstated and clarifies what they do and do not imply about real-world learning.

Next Topics

VC dimension: the complexity measure that, by the fundamental theorem, determines exactly which hypothesis classes are PAC-learnable.
Bias-complexity tradeoff: the decomposition of ERM excess risk into approximation error (the price of restricting $\mathcal{H}$ ) and estimation error (the price of finite data).

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Loss Functions Cataloglayer 1 · tier 1
PAC Learning Frameworklayer 1 · tier 1
Empirical Risk Minimizationlayer 2 · tier 1

Derived topics

1

Bias-Complexity Tradeofflayer 2 · tier 2

Graph-backed continuations

Bias-Complexity Tradeoff