Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Fano Inequality

Fano inequality as the standard tool for information-theoretic lower bounds: if X -> Y -> X_hat, then error probability is bounded below by conditional entropy and alphabet size.

AdvancedTier 2Stable~60 min

Why This Matters

Fano inequality is the standard tool for proving information-theoretic lower bounds in statistics and machine learning. Whenever you see a minimax lower bound proved via "reduction to multiple hypothesis testing," it almost certainly uses Fano. It converts a bound on mutual information into a bound on error probability, giving you a direct way to show that estimation is hard.

Le Cam uses two hypotheses. Fano uses MM hypotheses, which makes it far more powerful for problems with rich parameter spaces. Most minimax lower bounds in high-dimensional statistics, nonparametric regression, and learning theory are proved via the Fano method.

Mental Model

You observe data YY and want to recover a discrete random variable XX (which takes MM values). You produce an estimate X^(Y)\hat{X}(Y). Fano says: if the data YY does not carry much information about XX (low mutual information), then you must make errors frequently.

The bound is sharp in an information-theoretic sense: the mutual information I(X;Y)I(X; Y) measures how much the data tells you about XX, and logM\log M measures how hard the problem is (how many things you need to distinguish). When I(X;Y)logMI(X; Y) \ll \log M, you cannot reliably recover XX.

The Fano Inequality

Theorem

Fano Inequality

Statement

Let XX be uniformly distributed on {1,,M}\{1, \ldots, M\}, and let XYX^X \to Y \to \hat{X} be a Markov chain (meaning X^\hat{X} is a function of YY only). Then:

P(X^X)H(XY)log2logMP(\hat{X} \neq X) \geq \frac{H(X \mid Y) - \log 2}{\log M}

Equivalently, using H(XY)=H(X)I(X;Y)=logMI(X;Y)H(X \mid Y) = H(X) - I(X; Y) = \log M - I(X; Y):

P(X^X)1I(X;Y)+log2logMP(\hat{X} \neq X) \geq 1 - \frac{I(X; Y) + \log 2}{\log M}

Intuition

If you need to distinguish among MM possibilities (logM\log M bits of information) but the data only provides I(X;Y)I(X; Y) bits, you cannot possibly succeed. The error probability must be at least 1(I(X;Y)+log2)/logM1 - (I(X;Y) + \log 2)/\log M. When I(X;Y)I(X;Y) is small relative to logM\log M, the error probability is close to 11/M1 - 1/M (random guessing).

Proof Sketch

Define the error indicator E=1[X^X]E = \mathbf{1}[\hat{X} \neq X]. By the chain rule of entropy:

H(E,XX^)=H(XX^)=H(EX^)+H(XE,X^)H(E, X \mid \hat{X}) = H(X \mid \hat{X}) = H(E \mid \hat{X}) + H(X \mid E, \hat{X})

Now bound each term:

  • H(EX^)H(E)log2H(E \mid \hat{X}) \leq H(E) \leq \log 2 (binary entropy)
  • H(XE=0,X^)=0H(X \mid E = 0, \hat{X}) = 0 (if no error, X=X^X = \hat{X})
  • H(XE=1,X^)log(M1)logMH(X \mid E = 1, \hat{X}) \leq \log(M - 1) \leq \log M (if error, XX can be any of the other M1M-1 values)

Combining: H(XX^)log2+P(E)logMH(X \mid \hat{X}) \leq \log 2 + P(E) \log M.

Since X^\hat{X} is a function of YY, data processing gives H(XX^)H(XY)H(X \mid \hat{X}) \geq H(X \mid Y). Rearranging:

P(E)H(XY)log2logMP(E) \geq \frac{H(X \mid Y) - \log 2}{\log M}

Why It Matters

Fano converts an information-theoretic quantity (mutual information or conditional entropy) into a concrete error probability bound. This is the bridge between information theory and statistical decision theory. The KL divergence plays a central role in bounding the mutual information term. Once you can bound the mutual information I(X;Y)I(X; Y), you immediately get a lower bound on the probability of error for any estimator.

Failure Mode

Fano requires the hypotheses to be well-separated (so that an estimation error implies a testing error) and the mutual information to be bounded. If the hypotheses are too close, the bound becomes trivial. If MM is too small (e.g., M=2M = 2), Fano reduces to something weaker than Le Cam. The power of Fano comes from using many (MM large) well-separated hypotheses.

The Fano Method for Minimax Lower Bounds

The Fano inequality becomes a lower bound tool for minimax estimation when combined with a clever construction of hypotheses. The recipe:

Proposition

The Fano Method for Minimax Rates

Statement

Let θ1,,θMΘ\theta_1, \ldots, \theta_M \in \Theta satisfy:

  1. Separation: (θj,θk)2s\ell(\theta_j, \theta_k) \geq 2s for all jkj \neq k
  2. Closeness: 1Mj=1MDKL(PθjnPˉn)βlogM\frac{1}{M} \sum_{j=1}^M D_{\text{KL}}(P_{\theta_j}^n \| \bar{P}^n) \leq \beta \log M for some β<1\beta < 1, where Pˉn=1MjPθjn\bar{P}^n = \frac{1}{M}\sum_j P_{\theta_j}^n

Then:

Rn(Θ,)s(1βlog2/logM)R_n^*(\Theta, \ell) \geq s \cdot (1 - \beta - \log 2 / \log M)

Intuition

Step 1 ensures that if an estimator confuses any two hypotheses, it incurs loss at least ss. Step 2 ensures that the data does not carry enough information to reliably distinguish among the MM hypotheses. Together, they force any estimator to have large risk on at least one hypothesis.

Proof Sketch

Let VV be uniform on {1,,M}\{1, \ldots, M\} (the index of the true hypothesis). The estimator observes XnPθVnX^n \sim P_{\theta_V}^n and must guess VV.

By the separation condition, any estimator θ^\hat{\theta} with (θ^,θV)<s\ell(\hat{\theta}, \theta_V) < s must correctly identify VV. So:

supjEθj[(θ^,θj)]sP(V^V)\sup_j \mathbb{E}_{\theta_j}[\ell(\hat\theta, \theta_j)] \geq s \cdot P(\hat{V} \neq V)

The mutual information satisfies I(V;Xn)1MjDKL(PθjnPˉn)βlogMI(V; X^n) \leq \frac{1}{M}\sum_j D_{\text{KL}}(P_{\theta_j}^n \| \bar{P}^n) \leq \beta \log M.

Applying Fano: P(V^V)1βlogM+log2logM=1βlog2logMP(\hat{V} \neq V) \geq 1 - \frac{\beta \log M + \log 2}{\log M} = 1 - \beta - \frac{\log 2}{\log M}.

Why It Matters

This is the practical form of Fano that you use to prove lower bounds. The art is in step 1 and step 2: constructing MM hypotheses that are both well-separated in the loss metric and close in KL divergence. Typically, you want MM as large as possible (exponential in dd for dd-dimensional problems) while keeping the average KL divergence per hypothesis bounded.

Failure Mode

The method requires the KL condition to hold for the average over hypotheses, not for every pair. This is weaker than requiring small pairwise KL divergence, but bounding the average KL against the mixture Pˉ\bar{P} can still be technically challenging. In practice, the pairwise KL bound maxj,kDKL(PθjPθk)α\max_{j,k} D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) \leq \alpha gives I(V;Xn)nαI(V; X^n) \leq n\alpha, which is a looser but simpler condition.

Bounding the Mutual Information

The main technical step in the Fano method is bounding I(V;Xn)I(V; X^n). Common approaches:

Pairwise KL bound: If maxjkDKL(PθjPθk)α\max_{j \neq k} D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) \leq \alpha, then I(V;Xn)nαI(V; X^n) \leq n\alpha. This is the simplest approach but can be loose.

Average KL bound: I(V;Xn)=1Mj=1MDKL(PθjnPˉn)nM2j<kDKL(PθjPθk)I(V; X^n) = \frac{1}{M}\sum_{j=1}^M D_{\text{KL}}(P_{\theta_j}^n \| \bar{P}^n) \leq \frac{n}{M^2}\sum_{j < k} D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}). This uses the convexity of KL divergence.

For Gaussian location models Pθ=N(θ,σ2Id)P_\theta = \mathcal{N}(\theta, \sigma^2 I_d):

DKL(PθjPθk)=θjθk22σ2D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) = \frac{\|\theta_j - \theta_k\|^2}{2\sigma^2}

This makes the KL computation particularly clean.

Connection to Mutual Information

Fano inequality can be restated purely in terms of mutual information:

P(X^X)1I(X;Y)+log2logMP(\hat{X} \neq X) \geq 1 - \frac{I(X; Y) + \log 2}{\log M}

This reveals the fundamental tradeoff:

  • logM\log M measures the complexity of the problem (how many things to distinguish)
  • I(X;Y)I(X; Y) measures the information the data provides
  • If information \ll complexity, estimation is impossible

This is why Fano is called an "information-theoretic" lower bound: it directly quantifies how much information the data carries about the parameter, and shows that insufficient information implies large error.

Canonical Examples

Example

Fano lower bound for Gaussian mean estimation

Let X1,,XnN(θ,Id)X_1, \ldots, X_n \sim \mathcal{N}(\theta, I_d) with θRd\theta \in \mathbb{R}^d. We want a lower bound on E[θ^θ2]\mathbb{E}[\|\hat{\theta} - \theta\|^2].

Construct hypotheses: let θj=δej\theta_j = \delta e_j for j=1,,dj = 1, \ldots, d, where eje_j is the jj-th standard basis vector and δ>0\delta > 0 is chosen later.

Separation: θjθk2=2δ2\|\theta_j - \theta_k\|^2 = 2\delta^2 for jkj \neq k.

KL divergence: DKL(PθjPθk)=θjθk2/2=δ2D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) = \|\theta_j - \theta_k\|^2/2 = \delta^2.

Mutual information bound: I(V;Xn)nδ2I(V; X^n) \leq n\delta^2 (using the pairwise bound).

Apply Fano: P(V^V)1(nδ2+log2)/logdP(\hat{V} \neq V) \geq 1 - (n\delta^2 + \log 2)/\log d.

For this to be bounded away from zero, we need nδ2logdn\delta^2 \lesssim \log d. Choose δ2=clogd/n\delta^2 = c\log d / n. Then P(V^V)c>0P(\hat{V} \neq V) \geq c' > 0, and:

Rncδ2=clogdnR_n^* \geq c' \cdot \delta^2 = \frac{c'' \log d}{n}

This is not tight -- the true rate is d/nd/n. To get the tight bound, use M=2dM = 2^d hypotheses on the hypercube (Assouad), or use a more refined packing with the Varshamov-Gilbert bound to get M=2d/8M = 2^{d/8} hypotheses with larger pairwise separation. The full argument gives Rncd/nR_n^* \geq cd/n.

Common Confusions

Watch Out

Fano requires uniform prior, but the lower bound is frequentist

A subtle point: the Fano method places a uniform prior over the MM hypotheses for the purpose of analysis. But the final minimax lower bound is frequentist: it says that for some fixed (non-random) θj\theta_j, the risk is large. The uniform prior is an analytical device, not an assumption about the data-generating process. The key step is: supjEθj[]EV[EθV[]]\sup_j \mathbb{E}_{\theta_j}[\ell] \geq \mathbb{E}_V[\mathbb{E}_{\theta_V}[\ell]], which converts the sup (frequentist) into an average (Bayesian).

Watch Out

More hypotheses is not always better

Increasing MM makes logM\log M larger (stronger bound) but also makes it harder to satisfy the KL condition (you need more hypotheses that are mutually close). There is an optimal MM that balances separation against closeness. For dd-dimensional Gaussian problems, MecdM \approx e^{cd} (exponential in dimension) is typically optimal.

Watch Out

The log 2 term is not important

The log2\log 2 in Fano's bound is an artifact of the proof technique (bounding the binary entropy). For large MM, log2/logM\log 2 / \log M is negligible. You can safely ignore it when computing rates. It only matters when MM is very small (e.g., M=2M = 2), where Fano gives a weaker result than Le Cam.

Summary

  • Fano inequality: P(X^X)(H(XY)log2)/logMP(\hat{X} \neq X) \geq (H(X|Y) - \log 2) / \log M
  • Equivalently: P(error)1(I(X;Y)+log2)/logMP(\text{error}) \geq 1 - (I(X;Y) + \log 2)/\log M
  • The Fano method: construct MM separated hypotheses, bound mutual information, get a lower bound on minimax risk
  • Art of the method: pack as many hypotheses as possible while keeping distributions close (mutual information small)
  • For Gaussian problems: KL divergence is θjθk2/2σ2\|\theta_j - \theta_k\|^2/2\sigma^2
  • Fano is the standard tool for proving minimax rates in statistics

Exercises

ExerciseAdvanced

Problem

Use the Fano method to prove that for estimating the mean of a N(θ,σ2)\mathcal{N}(\theta, \sigma^2) distribution (one-dimensional) from nn i.i.d. observations, the minimax risk for squared error satisfies Rncσ2/nR_n^* \geq c\sigma^2/n.

ExerciseResearch

Problem

The Varshamov-Gilbert lemma states that there exist M2d/8M \geq 2^{d/8} binary vectors τ1,,τM{0,1}d\tau_1, \ldots, \tau_M \in \{0, 1\}^d such that τjτk1d/4\|\tau_j - \tau_k\|_1 \geq d/4 for all jkj \neq k. How is this used in the Fano method to prove lower bounds for high-dimensional problems?

Related Comparisons

References

Canonical:

  • Fano, Transmission of Information (1961) -- the original
  • Cover & Thomas, Elements of Information Theory, Chapter 2 -- clean presentation of Fano inequality
  • Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 2 -- the Fano method for minimax bounds
  • Yu, "Assouad, Fano, and Le Cam" (1997)

Current:

  • Duchi, Information-Theoretic Methods in Statistics (Stanford lecture notes)
  • Wainwright, High-Dimensional Statistics (2019), Chapter 15

Next Topics

Fano inequality is a terminal topic in the lower bounds sequence. From here, apply the Fano method to specific estimation problems as they arise in other topics (nonparametric regression, density estimation, high-dimensional statistics).

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.