Statistical Foundations
Fano Inequality
Fano inequality as the standard tool for information-theoretic lower bounds: if X -> Y -> X_hat, then error probability is bounded below by conditional entropy and alphabet size.
Prerequisites
Why This Matters
Fano inequality is the standard tool for proving information-theoretic lower bounds in statistics and machine learning. Whenever you see a minimax lower bound proved via "reduction to multiple hypothesis testing," it almost certainly uses Fano. It converts a bound on mutual information into a bound on error probability, giving you a direct way to show that estimation is hard.
Le Cam uses two hypotheses. Fano uses hypotheses, which makes it far more powerful for problems with rich parameter spaces. Most minimax lower bounds in high-dimensional statistics, nonparametric regression, and learning theory are proved via the Fano method.
Mental Model
You observe data and want to recover a discrete random variable (which takes values). You produce an estimate . Fano says: if the data does not carry much information about (low mutual information), then you must make errors frequently.
The bound is sharp in an information-theoretic sense: the mutual information measures how much the data tells you about , and measures how hard the problem is (how many things you need to distinguish). When , you cannot reliably recover .
The Fano Inequality
Fano Inequality
Statement
Let be uniformly distributed on , and let be a Markov chain (meaning is a function of only). Then:
Equivalently, using :
Intuition
If you need to distinguish among possibilities ( bits of information) but the data only provides bits, you cannot possibly succeed. The error probability must be at least . When is small relative to , the error probability is close to (random guessing).
Proof Sketch
Define the error indicator . By the chain rule of entropy:
Now bound each term:
- (binary entropy)
- (if no error, )
- (if error, can be any of the other values)
Combining: .
Since is a function of , data processing gives . Rearranging:
Why It Matters
Fano converts an information-theoretic quantity (mutual information or conditional entropy) into a concrete error probability bound. This is the bridge between information theory and statistical decision theory. The KL divergence plays a central role in bounding the mutual information term. Once you can bound the mutual information , you immediately get a lower bound on the probability of error for any estimator.
Failure Mode
Fano requires the hypotheses to be well-separated (so that an estimation error implies a testing error) and the mutual information to be bounded. If the hypotheses are too close, the bound becomes trivial. If is too small (e.g., ), Fano reduces to something weaker than Le Cam. The power of Fano comes from using many ( large) well-separated hypotheses.
The Fano Method for Minimax Lower Bounds
The Fano inequality becomes a lower bound tool for minimax estimation when combined with a clever construction of hypotheses. The recipe:
The Fano Method for Minimax Rates
Statement
Let satisfy:
- Separation: for all
- Closeness: for some , where
Then:
Intuition
Step 1 ensures that if an estimator confuses any two hypotheses, it incurs loss at least . Step 2 ensures that the data does not carry enough information to reliably distinguish among the hypotheses. Together, they force any estimator to have large risk on at least one hypothesis.
Proof Sketch
Let be uniform on (the index of the true hypothesis). The estimator observes and must guess .
By the separation condition, any estimator with must correctly identify . So:
The mutual information satisfies .
Applying Fano: .
Why It Matters
This is the practical form of Fano that you use to prove lower bounds. The art is in step 1 and step 2: constructing hypotheses that are both well-separated in the loss metric and close in KL divergence. Typically, you want as large as possible (exponential in for -dimensional problems) while keeping the average KL divergence per hypothesis bounded.
Failure Mode
The method requires the KL condition to hold for the average over hypotheses, not for every pair. This is weaker than requiring small pairwise KL divergence, but bounding the average KL against the mixture can still be technically challenging. In practice, the pairwise KL bound gives , which is a looser but simpler condition.
Bounding the Mutual Information
The main technical step in the Fano method is bounding . Common approaches:
Pairwise KL bound: If , then . This is the simplest approach but can be loose.
Average KL bound: . This uses the convexity of KL divergence.
For Gaussian location models :
This makes the KL computation particularly clean.
Connection to Mutual Information
Fano inequality can be restated purely in terms of mutual information:
This reveals the fundamental tradeoff:
- measures the complexity of the problem (how many things to distinguish)
- measures the information the data provides
- If information complexity, estimation is impossible
This is why Fano is called an "information-theoretic" lower bound: it directly quantifies how much information the data carries about the parameter, and shows that insufficient information implies large error.
Canonical Examples
Fano lower bound for Gaussian mean estimation
Let with . We want a lower bound on .
Construct hypotheses: let for , where is the -th standard basis vector and is chosen later.
Separation: for .
KL divergence: .
Mutual information bound: (using the pairwise bound).
Apply Fano: .
For this to be bounded away from zero, we need . Choose . Then , and:
This is not tight -- the true rate is . To get the tight bound, use hypotheses on the hypercube (Assouad), or use a more refined packing with the Varshamov-Gilbert bound to get hypotheses with larger pairwise separation. The full argument gives .
Common Confusions
Fano requires uniform prior, but the lower bound is frequentist
A subtle point: the Fano method places a uniform prior over the hypotheses for the purpose of analysis. But the final minimax lower bound is frequentist: it says that for some fixed (non-random) , the risk is large. The uniform prior is an analytical device, not an assumption about the data-generating process. The key step is: , which converts the sup (frequentist) into an average (Bayesian).
More hypotheses is not always better
Increasing makes larger (stronger bound) but also makes it harder to satisfy the KL condition (you need more hypotheses that are mutually close). There is an optimal that balances separation against closeness. For -dimensional Gaussian problems, (exponential in dimension) is typically optimal.
The log 2 term is not important
The in Fano's bound is an artifact of the proof technique (bounding the binary entropy). For large , is negligible. You can safely ignore it when computing rates. It only matters when is very small (e.g., ), where Fano gives a weaker result than Le Cam.
Summary
- Fano inequality:
- Equivalently:
- The Fano method: construct separated hypotheses, bound mutual information, get a lower bound on minimax risk
- Art of the method: pack as many hypotheses as possible while keeping distributions close (mutual information small)
- For Gaussian problems: KL divergence is
- Fano is the standard tool for proving minimax rates in statistics
Exercises
Problem
Use the Fano method to prove that for estimating the mean of a distribution (one-dimensional) from i.i.d. observations, the minimax risk for squared error satisfies .
Problem
The Varshamov-Gilbert lemma states that there exist binary vectors such that for all . How is this used in the Fano method to prove lower bounds for high-dimensional problems?
Related Comparisons
References
Canonical:
- Fano, Transmission of Information (1961) -- the original
- Cover & Thomas, Elements of Information Theory, Chapter 2 -- clean presentation of Fano inequality
- Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 2 -- the Fano method for minimax bounds
- Yu, "Assouad, Fano, and Le Cam" (1997)
Current:
- Duchi, Information-Theoretic Methods in Statistics (Stanford lecture notes)
- Wainwright, High-Dimensional Statistics (2019), Chapter 15
Next Topics
Fano inequality is a terminal topic in the lower bounds sequence. From here, apply the Fano method to specific estimation problems as they arise in other topics (nonparametric regression, density estimation, high-dimensional statistics).
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Minimax Lower BoundsLayer 3
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A