Skip to main content

Learning Theory Core

No-Free-Lunch Theorem

For binary classification with 0-1 loss, no learning algorithm can succeed on every distribution: for any algorithm and any sample size m smaller than half the domain, some realizable distribution forces error at least 1/8 with probability at least 1/7. Universal learners do not exist; prior knowledge enters through the choice of hypothesis class.

CoreTier 2StableCore spine~35 min

Why This Matters

The No-Free-Lunch theorem is the formal answer to a tempting question: is there a single learning algorithm that works on every problem? The answer is no, and the proof is constructive. For any fixed algorithm AA and any sample size mm smaller than half the domain, there exists a distribution under which AA does worse than chance on a constant fraction of inputs, even though some hypothesis achieves zero error on that distribution.

The theorem clarifies what learning actually requires. It is not enough to have a clever optimization procedure or more data. You must also commit to a hypothesis class, and that commitment is the prior knowledge that makes generalization possible. The choice of H\mathcal{H} is the bias the algorithm imposes on the world; without it, the PAC framework provides no guarantee.

This is also the entry point to the bias-complexity tradeoff and to VC dimension: NFL says you must restrict H\mathcal{H}, and the next question is how restricted is restricted enough?

Mental Model

Think of a learner as a guesser. NFL says: for every fixed strategy, an adversary picking the distribution last can ensure the guesser is wrong on a noticeable fraction of unseen points. The adversary's power comes from the unseen-point set: anything the learner has not observed is fair game, and the adversary chooses the labels there to maximize confusion.

The way out is to refuse to consider every possible labeling. If the learner restricts attention to a small class H\mathcal{H} of "reasonable" hypotheses, the adversary's choices shrink. NFL is the formal proof that this restriction is not optional.

Formal Setup

Fix a domain X\mathcal{X} and label space {0,1}\{0, 1\}. Loss is 0-1. A learning algorithm AA takes a labeled sample S(X×{0,1})mS \in (\mathcal{X} \times \{0,1\})^m and returns a hypothesis A(S):X{0,1}A(S): \mathcal{X} \to \{0, 1\}. Population risk of a hypothesis hh under distribution D\mathcal{D} on X×{0,1}\mathcal{X} \times \{0,1\} is

LD(h)=Pr(x,y)D[h(x)y].L_\mathcal{D}(h) = \Pr_{(x, y) \sim \mathcal{D}}[h(x) \ne y].

A distribution D\mathcal{D} is realizable by some target f:X{0,1}f: \mathcal{X} \to \{0,1\} if LD(f)=0L_\mathcal{D}(f) = 0, that is, the labels are deterministic given xx via ff.

The Theorem

Theorem

No-Free-Lunch (Shalev-Shwartz and Ben-David, Theorem 5.1)

Statement

Let AA be any learning algorithm for binary classification with 0-1 loss over a domain X\mathcal{X}, and let m<X/2m < |\mathcal{X}|/2 be a sample size. There exists a distribution D\mathcal{D} over X×{0,1}\mathcal{X} \times \{0,1\} such that:

  1. There is a function f:X{0,1}f: \mathcal{X} \to \{0, 1\} with LD(f)=0L_\mathcal{D}(f) = 0 (the distribution is realizable).
  2. With probability at least 1/71/7 over the iid sample SDmS \sim \mathcal{D}^m,

LD(A(S))18.L_\mathcal{D}(A(S)) \ge \tfrac{1}{8}.

Intuition

The proof picks a subset CXC \subset \mathcal{X} of size 2m2m and considers all 22m2^{2m} ways to label CC. Each labeling fif_i defines a realizable distribution Di\mathcal{D}_i uniform on CC. Averaged over ii, the algorithm cannot do better than chance on points it has not seen, because half of all labelings disagree with whatever AA predicts at any unobserved point. Pigeonhole on the worst ii converts the average bound into an existence statement.

Proof Sketch

Fix CXC \subset \mathcal{X} with C=2m|C| = 2m. There are T=22mT = 2^{2m} functions f1,,fTf_1, \ldots, f_T from CC to {0,1}\{0, 1\}. For each ii, let Di\mathcal{D}_i be the uniform distribution on {(x,fi(x)):xC}\{(x, f_i(x)) : x \in C\}. Each Di\mathcal{D}_i is realizable by fif_i.

Goal: show maxi[T]ESDim ⁣[LDi(A(S))]14.\max_{i \in [T]} \mathbb{E}_{S \sim \mathcal{D}_i^m}\!\left[L_{\mathcal{D}_i}(A(S))\right] \ge \tfrac{1}{4}.

Let k=(2m)mk = (2m)^m enumerate the possible X\mathcal{X}-sequences in CmC^m, and write SjiS^i_j for the sample (xj,1,fi(xj,1)),,(xj,m,fi(xj,m))(x_{j,1}, f_i(x_{j,1})), \ldots, (x_{j,m}, f_i(x_{j,m})) on the jj-th sequence under labeling fif_i. Then ESDim[LDi(A(S))]=1kj=1kLDi(A(Sji)).\mathbb{E}_{S \sim \mathcal{D}_i^m}[L_{\mathcal{D}_i}(A(S))] = \frac{1}{k} \sum_{j=1}^{k} L_{\mathcal{D}_i}(A(S^i_j)).

Use the chain maxavgmin\max \ge \mathrm{avg} \ge \min: maxi1kjLDi(A(Sji))1Ti1kjLDi(A(Sji))=1kj1TiLDi(A(Sji))minj1TiLDi(A(Sji)).\max_i \frac{1}{k} \sum_j L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{T} \sum_i \frac{1}{k} \sum_j L_{\mathcal{D}_i}(A(S^i_j)) = \frac{1}{k} \sum_j \frac{1}{T} \sum_i L_{\mathcal{D}_i}(A(S^i_j)) \ge \min_j \frac{1}{T} \sum_i L_{\mathcal{D}_i}(A(S^i_j)).

Fix any jj. The sequence xj,1,,xj,mx_{j,1}, \ldots, x_{j,m} uses at most mm points of CC, so there are pmp \ge m points v1,,vpv_1, \ldots, v_p in CC that the sample never visits. For any unobserved vrv_r, LDi(A(Sji))12m1 ⁣[A(Sji)(vr)fi(vr)],L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \mathbb{1}\!\left[A(S^i_j)(v_r) \ne f_i(v_r)\right], so LDi(A(Sji))12m1pr=1p1[A(Sji)(vr)fi(vr)].L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \cdot \frac{1}{p} \sum_{r=1}^p \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)].

Pair the labelings: for each rr, group the fif_i into T/2T/2 pairs that agree on the sample but disagree at vrv_r. On each such pair, the algorithm's prediction at vrv_r matches one labeling and not the other, so exactly half the fif_i disagree with A(Sji)(vr)A(S^i_j)(v_r). Hence 1Ti1[A(Sji)(vr)fi(vr)]=12.\frac{1}{T} \sum_i \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)] = \tfrac{1}{2}.

Each unobserved point of CC has mass 1/(2m)1/(2m) under Di\mathcal{D}_i, so LDi(A(Sji))12mr=1p1[A(Sji)(vr)fi(vr)]L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \sum_{r=1}^p \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)]. Averaging over ii and exchanging sums, 1TiLDi(A(Sji))12mr=1p1Ti1[A(Sji)(vr)fi(vr)]=12mp12=p4m14,\frac{1}{T} \sum_i L_{\mathcal{D}_i}(A(S^i_j)) \ge \frac{1}{2m} \sum_{r=1}^p \frac{1}{T} \sum_i \mathbb{1}[A(S^i_j)(v_r) \ne f_i(v_r)] = \frac{1}{2m} \cdot p \cdot \tfrac{1}{2} = \frac{p}{4m} \ge \tfrac{1}{4}, where the final inequality uses pmp \ge m.

So there exists ii^* with ES[LDi(A(S))]1/4\mathbb{E}_S[L_{\mathcal{D}_{i^*}}(A(S))] \ge 1/4.

Convert expectation to high probability. Let Z=LDi(A(S))[0,1]Z = L_{\mathcal{D}_{i^*}}(A(S)) \in [0, 1] with E[Z]1/4\mathbb{E}[Z] \ge 1/4. For any a(0,E[Z])a \in (0, \mathbb{E}[Z]), Pr[Za]E[Z]a1a.\Pr[Z \ge a] \ge \frac{\mathbb{E}[Z] - a}{1 - a}. Take a=1/8a = 1/8: Pr[Z1/8](1/41/8)/(11/8)=(1/8)/(7/8)=1/7.\Pr[Z \ge 1/8] \ge (1/4 - 1/8)/(1 - 1/8) = (1/8)/(7/8) = 1/7.

Why It Matters

NFL is the lower-bound counterpart to the PAC upper bounds. Together they say: learning is possible exactly when you commit to a hypothesis class of controlled complexity, and impossible when you do not. Every PAC bound is a quantitative version of "you bought enough prior knowledge"; NFL is the proof that you had to buy any at all.

Failure Mode

NFL applies to binary classification with 0-1 loss in the worst case over distributions. It says nothing about average-case behavior on natural distributions, regression with a different loss, or settings where the adversary cannot place mass on points absent from training. The constants 1/81/8 and 1/71/7 also depend on the X2m|\mathcal{X}| \ge 2m assumption; for mX/2m \ge |\mathcal{X}|/2 the learner can simply memorize.

Corollary: All Functions Is Not PAC-Learnable

Corollary

The class of all binary functions on an infinite domain is not PAC-learnable (SSBD Corollary 5.2)

Statement

Let X\mathcal{X} be an infinite domain and let H={0,1}X\mathcal{H} = \{0, 1\}^\mathcal{X} be the class of all binary functions on X\mathcal{X}. Then H\mathcal{H} is not PAC-learnable.

Intuition

PAC learnability requires sample complexity mH(ϵ,δ)m_\mathcal{H}(\epsilon, \delta) that works for every distribution and target in H\mathcal{H}. Pick ϵ<1/8\epsilon < 1/8 and δ<1/7\delta < 1/7; if a learner AA achieved this on H\mathcal{H}, NFL would give a distribution where AA fails, a direct contradiction.

Proof Sketch

Suppose for contradiction that H={0,1}X\mathcal{H} = \{0,1\}^\mathcal{X} is PAC-learnable. Then there is an algorithm AA and a function mH(ϵ,δ)m_\mathcal{H}(\epsilon, \delta) such that for every realizable distribution and every ϵ,δ(0,1)\epsilon, \delta \in (0, 1), PrSDm[LD(A(S))ϵ]1δ\Pr_{S \sim \mathcal{D}^m}[L_\mathcal{D}(A(S)) \le \epsilon] \ge 1 - \delta whenever mmH(ϵ,δ)m \ge m_\mathcal{H}(\epsilon, \delta).

Set ϵ=1/8\epsilon = 1/8 and δ=1/7\delta = 1/7, and let m=mH(1/8,1/7)m = m_\mathcal{H}(1/8, 1/7). Since X\mathcal{X} is infinite, X2m+1>2m|\mathcal{X}| \ge 2m + 1 > 2m, so NFL applies: there exists a realizable distribution D\mathcal{D} with PrSDm[LD(A(S))1/8]1/7,\Pr_{S \sim \mathcal{D}^m}[L_\mathcal{D}(A(S)) \ge 1/8] \ge 1/7, i.e., Pr[LD(A(S))1/8η]<11/7\Pr[L_\mathcal{D}(A(S)) \le 1/8 - \eta] < 1 - 1/7 for any η>0\eta > 0. The PAC guarantee says this probability should be at least 11/71 - 1/7. Contradiction.

Why It Matters

This is the cleanest way to see why some hypothesis classes are simply unlearnable. The class of all functions is too rich for any algorithm to generalize from a finite sample. The next chapter of learning theory is about which restrictions on H\mathcal{H} are sufficient: finite classes work via the union bound, and infinite classes work iff they have finite VC dimension.

Failure Mode

The corollary needs X\mathcal{X} infinite (or at least large enough that NFL applies at the chosen ϵ,δ\epsilon, \delta). For finite X\mathcal{X} small enough that mH(ϵ,δ)X/2m_\mathcal{H}(\epsilon, \delta) \ge |\mathcal{X}|/2, the learner can memorize and the argument breaks. Memorization on a finite domain is technically PAC-learning, but the sample complexity scales with X|\mathcal{X}| and is not what the framework is designed to characterize.

NFL and Prior Knowledge

NFL says no algorithm wins on every problem; it does not say nothing works. The way out is prior knowledge, and the most common form of prior knowledge is a restricted hypothesis class.

When you commit to H{0,1}X\mathcal{H} \subsetneq \{0,1\}^\mathcal{X}, you declare which patterns you expect the world to follow. If the true labeling lies in H\mathcal{H} (realizable case) or close to it (agnostic case), the PAC machinery applies and learning succeeds with sample complexity controlled by logH\log|\mathcal{H}| or by VC dimension. If the true labeling lies far outside H\mathcal{H}, the approximation error is large and even infinite data does not save you. Either way, the choice of H\mathcal{H} is the prior, and the prior is what NFL forces you to commit.

Other ways to express prior knowledge include weighting hypotheses (structural risk minimization, MDL), restricting the loss function or optimization landscape, and Bayesian priors. SSBD Chapter 7 develops these alternatives. All of them encode the same fact: no commitment, no generalization.

Common Confusions

Watch Out

NFL says learning is possible, given the right H

NFL does not say learning is impossible. It says no algorithm works on every task. Choosing the right hypothesis class is the prior knowledge that makes learning possible. PAC bounds quantify what "right" means: a class with small VC dimension or small logH\log|\mathcal{H}| relative to the sample size.

Watch Out

NFL is worst-case, not average-case

The bound is over the worst distribution for the fixed algorithm. For natural distributions (images, text, physical processes), well-chosen hypothesis classes like neural networks, linear models on engineered features, or kernel methods routinely succeed. NFL says these classes must fail on some distribution; it does not say they fail on distributions you actually care about.

Watch Out

The 1/4 average error is not a bound on a single algorithm's behavior

The proof shows the average over labelings of the algorithm's expected risk is at least 1/41/4. This is what lets you pull out a worst labeling where a fixed algorithm has expected risk 1/4\ge 1/4. It does not mean any specific Di\mathcal{D}_i produces 1/41/4 error in expectation; only that some one does.

Watch Out

Realizability does not save you

The hard distribution constructed in the proof is realizable by fif_i (i.e., LDi(fi)=0L_{\mathcal{D}_i}(f_i) = 0). NFL therefore holds even with the realizability assumption that powers the cleanest PAC bounds. The reason is that the learner does not know which fif_i generated the labels, and must commit to a hypothesis from data alone. With H={0,1}X\mathcal{H} = \{0,1\}^\mathcal{X}, nothing in the data identifies fif_i on the unseen points.

Connection to PAC and VC

The relationship between NFL, PAC, and VC dimension fits a single chain:

  1. NFL: H={0,1}X\mathcal{H} = \{0,1\}^\mathcal{X} is not PAC-learnable.
  2. Finite-class PAC bound: H<|\mathcal{H}| < \infty implies H\mathcal{H} is PAC-learnable with sample complexity O(logH/ϵ)O(\log|\mathcal{H}| / \epsilon) (realizable) or O(logH/ϵ2)O(\log|\mathcal{H}| / \epsilon^2) (agnostic).
  3. Fundamental theorem of PAC learning: for binary classification with 0-1 loss, H\mathcal{H} is PAC-learnable iff H\mathcal{H} has finite VC dimension.

NFL is the negative end of (3): infinite VC dimension means not PAC-learnable. Restriction is mandatory; finite VC dimension is the right notion of "restricted enough" for binary classification.

Worked Example: Why m < |X|/2 Is Tight

Take X=100|\mathcal{X}| = 100 and m=49m = 49. NFL applies: some realizable distribution forces LD(A(S))1/8L_\mathcal{D}(A(S)) \ge 1/8 with probability 1/7\ge 1/7. Now take m=50m = 50. The learner can adopt the strategy "memorize SS and predict 0 on unseen points" and the adversary loses traction: with m=50m = 50 and uniform D\mathcal{D} on CC of size 2m=1002m = 100, the sample covers half of CC in expectation, but with positive probability the learner sees fewer than C/2|C|/2 distinct points and the unseen-point argument still bites, until m=Xm = |\mathcal{X}| when every point has been observed.

The clean cutoff is m<X/2m < |\mathcal{X}|/2 for the proof as stated. The constants 1/81/8 and 1/71/7 are specific to the CC of size 2m2m construction, not fundamental: tighter constructions improve them (Anthony and Bartlett 1999, Theorem 5.3) at the cost of a more delicate argument.

Exercises

ExerciseCore

Problem

State the No-Free-Lunch theorem in your own words, naming the loss, the domain assumption, and the two probabilistic constants.

ExerciseCore

Problem

Where in the proof is the realizability assumption used? Could the proof be adapted to the agnostic setting?

ExerciseAdvanced

Problem

Use NFL to prove that if H\mathcal{H} shatters an infinite set TXT \subset \mathcal{X} (every finite subset of TT admits every binary labeling within H\mathcal{H}), then H\mathcal{H} is not PAC-learnable.

ExerciseAdvanced

Problem

Show that on a finite domain X\mathcal{X} of size NN, the class H={0,1}X\mathcal{H} = \{0,1\}^\mathcal{X} is PAC-learnable with sample complexity m=O(log(N/δ)/ϵ)m = O(\log(N/\delta) / \epsilon) in the realizable case. Reconcile with the NFL corollary.

Summary

  • For 0-1 binary classification with m<X/2m < |\mathcal{X}|/2, every algorithm has a realizable distribution where it fails with probability 1/7\ge 1/7 to achieve error <1/8< 1/8.
  • Proof structure: pick CC of size 2m2m, average over the 22m2^{2m} realizable labelings, pair labelings on each unseen point so half disagree with the algorithm.
  • Corollary: the class of all binary functions on an infinite domain is not PAC-learnable.
  • NFL forces every learner to commit to a hypothesis class. The choice of H\mathcal{H} is the prior knowledge that enables generalization.
  • Restriction is mandatory; finite VC dimension characterizes "restricted enough" for binary classification.

References

Canonical:

  • Shalev-Shwartz and Ben-David, Understanding Machine Learning, Chapter 5, Theorem 5.1 and Corollary 5.2.
  • Wolpert, "The Lack of A Priori Distinctions Between Learning Algorithms," Neural Computation 8 (1996), 1341-1390 — the original NFL theorem in a more general form.
  • Wolpert and Macready, "No Free Lunch Theorems for Optimization," IEEE Transactions on Evolutionary Computation 1 (1997), 67-82 — the optimization analog.

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Section 3.4.
  • Anthony and Bartlett, Neural Network Learning: Theoretical Foundations (1999), Chapter 5 — sharper constants and the connection to VC lower bounds.
  • Vapnik, Statistical Learning Theory (1998), Chapter 4 — the VC-dimension framing of why prior knowledge is needed.

Critique:

  • Sterkenburg and Grünwald, "The No-Free-Lunch Theorems of Supervised Learning," Synthese 199 (2021), 9979-10015, doi:10.1007/s11229-021-03233-1 — argues the NFL theorems are routinely overstated and clarifies what they do and do not imply about real-world learning.

Next Topics

  • VC dimension: the complexity measure that, by the fundamental theorem, determines exactly which hypothesis classes are PAC-learnable.
  • Bias-complexity tradeoff: the decomposition of ERM excess risk into approximation error (the price of restricting H\mathcal{H}) and estimation error (the price of finite data).

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

1

Graph-backed continuations