Fano Inequality

Sneiderman, Robby

Statistical Foundations

Fano Inequality

Fano inequality as the standard tool for information-theoretic lower bounds: error probability in estimating a random variable from a noisy observation is bounded below by conditional entropy and alphabet size.

AdvancedTier 2StableSupporting~60 min

Prerequisites

Minimax Lower Bounds Information Theory Foundations

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

statistical-foundations | layer 3 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Fano inequality is the standard tool for proving information-theoretic lower bounds in statistics and machine learning. Whenever you see a minimax lower bound proved via "reduction to multiple hypothesis testing," it almost certainly uses Fano. It converts a bound on mutual information into a bound on error probability, giving you a direct way to show that estimation is hard.

Le Cam uses two hypotheses. Fano uses $M$ hypotheses, which makes it far more powerful for problems with rich parameter spaces. Most minimax lower bounds in high-dimensional statistics, nonparametric regression, and learning theory are proved via the Fano method.

theorem visual

Many separated hypotheses, not enough information

$Fano turns a packing construction plus an information bound into a lower bound on every estimator.$

packing

$Build M parameters separated in the loss metric.$

information

$Show I (V; X^{n}) is small compared with lo g M .$

lower bound

$Then P (\hat{V} \neq = V) stays large for every estimator.$

Mental Model

You observe data $Y$ and want to recover a discrete random variable $X$ (which takes $M$ values). You produce an estimate $\hat{X}(Y)$ . Fano says: if the data $Y$ does not carry much information about $X$ (low mutual information), then you must make errors frequently.

The bound is sharp in an information-theoretic sense: the mutual information $I(X; Y)$ measures how much the data tells you about $X$ , and $\log M$ measures how hard the problem is (how many things you need to distinguish). When $I(X; Y) \ll \log M$ , you cannot reliably recover $X$ .

The Fano Inequality

Theorem

Fano Inequality

Statement

Let $X$ be uniformly distributed on $\{1, \ldots, M\}$ , and let $X \to Y \to \hat{X}$ be a Markov chain (meaning $\hat{X}$ is a function of $Y$ only). Then:

$P(\hat{X} \neq X) \geq \frac{H(X \mid Y) - \log 2}{\log M}$

Equivalently, using $H(X \mid Y) = H(X) - I(X; Y) = \log M - I(X; Y)$ :

$P(\hat{X} \neq X) \geq 1 - \frac{I(X; Y) + \log 2}{\log M}$

Intuition

If you need to distinguish among $M$ possibilities ( $\log M$ bits of information) but the data only provides $I(X; Y)$ bits, you cannot possibly succeed. The error probability must be at least $1 - (I(X;Y) + \log 2)/\log M$ . When $I(X;Y)$ is small relative to $\log M$ , the error probability is close to $1 - 1/M$ (random guessing).

Proof Sketch

Define the error indicator $E = \mathbf{1}[\hat{X} \neq X]$ . By the chain rule of entropy:

$H(E, X \mid \hat{X}) = H(X \mid \hat{X}) = H(E \mid \hat{X}) + H(X \mid E, \hat{X})$

Now bound each term:

$H(E \mid \hat{X}) \leq H(E) \leq \log 2$ (binary entropy)
$H(X \mid E = 0, \hat{X}) = 0$ (if no error, $X = \hat{X}$ )
$H(X \mid E = 1, \hat{X}) \leq \log(M - 1) \leq \log M$ (if error, $X$ can be any of the other $M-1$ values)

Combining: $H(X \mid \hat{X}) \leq \log 2 + P(E) \log M$ .

Since $\hat{X}$ is a function of $Y$ , data processing gives $H(X \mid \hat{X}) \geq H(X \mid Y)$ . Rearranging:

$P(E) \geq \frac{H(X \mid Y) - \log 2}{\log M}$

Why It Matters

Fano converts an information-theoretic quantity (mutual information or conditional entropy) into a concrete error probability bound. This is the bridge between information theory and statistical decision theory. The KL divergence plays a central role in bounding the mutual information term. Once you can bound the mutual information $I(X; Y)$ , you immediately get a lower bound on the probability of error for any estimator.

Failure Mode

Fano requires the hypotheses to be well-separated (so that an estimation error implies a testing error) and the mutual information to be bounded. If the hypotheses are too close, the bound becomes trivial. If $M$ is too small (e.g., $M = 2$ ), Fano reduces to something weaker than Le Cam. The power of Fano comes from using many ( $M$ large) well-separated hypotheses.

report a correction →

The Fano Method for Minimax Lower Bounds

The Fano inequality becomes a lower bound tool for minimax estimation when combined with a clever construction of hypotheses. The recipe:

Proposition

The Fano Method for Minimax Rates

Statement

Let $\theta_1, \ldots, \theta_M \in \Theta$ satisfy:

Separation: $\ell(\theta_j, \theta_k) \geq 2s$ for all $j \neq k$
Closeness: $\frac{1}{M} \sum_{j=1}^M D_{\text{KL}}(P_{\theta_j}^n \| \bar{P}^n) \leq \beta \log M$ for some $\beta < 1$ , where $\bar{P}^n = \frac{1}{M}\sum_j P_{\theta_j}^n$

Then:

$R_n^*(\Theta, \ell) \geq s \cdot (1 - \beta - \log 2 / \log M)$

Intuition

Step 1 ensures that if an estimator confuses any two hypotheses, it incurs loss at least $s$ . Step 2 ensures that the data does not carry enough information to reliably distinguish among the $M$ hypotheses. Together, they force any estimator to have large risk on at least one hypothesis.

Proof Sketch

Let $V$ be uniform on $\{1, \ldots, M\}$ (the index of the true hypothesis). The estimator observes $X^n \sim P_{\theta_V}^n$ and must guess $V$ .

By the separation condition, any estimator $\hat{\theta}$ with $\ell(\hat{\theta}, \theta_V) < s$ must correctly identify $V$ . So:

$\sup_j \mathbb{E}_{\theta_j}[\ell(\hat\theta, \theta_j)] \geq s \cdot P(\hat{V} \neq V)$

The mutual information satisfies $I(V; X^n) \leq \frac{1}{M}\sum_j D_{\text{KL}}(P_{\theta_j}^n \| \bar{P}^n) \leq \beta \log M$ .

Applying Fano: $P(\hat{V} \neq V) \geq 1 - \frac{\beta \log M + \log 2}{\log M} = 1 - \beta - \frac{\log 2}{\log M}$ .

Why It Matters

This is the practical form of Fano that you use to prove lower bounds. The art is in step 1 and step 2: constructing $M$ hypotheses that are both well-separated in the loss metric and close in KL divergence. Typically, you want $M$ as large as possible (exponential in $d$ for $d$ -dimensional problems) while keeping the average KL divergence per hypothesis bounded.

Failure Mode

The method requires the KL condition to hold for the average over hypotheses, not for every pair. This is weaker than requiring small pairwise KL divergence, but bounding the average KL against the mixture $\bar{P}$ can still be technically challenging. In practice, the pairwise KL bound $\max_{j,k} D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) \leq \alpha$ gives $I(V; X^n) \leq n\alpha$ , which is a looser but simpler condition.

report a correction →

Bounding the Mutual Information

The main technical step in the Fano method is bounding $I(V; X^n)$ . Common approaches:

Pairwise KL bound: If $\max_{j \neq k} D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) \leq \alpha$ , then $I(V; X^n) \leq n\alpha$ . This is the simplest approach but can be loose.

Average KL bound: $I(V; X^n) = \frac{1}{M}\sum_{j=1}^M D_{\text{KL}}(P_{\theta_j}^n \| \bar{P}^n) \leq \frac{n}{M^2}\sum_{j,k} D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k})$ . The double sum is over all ordered pairs $(j, k)$ -- KL is asymmetric, so $\sum_{j,k} = \sum_{j<k}\big(D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) + D_{\text{KL}}(P_{\theta_k} \| P_{\theta_j})\big)$ , which collapses to $2\sum_{j<k} D_{\text{KL}}$ only when KL happens to be symmetric (e.g. equal-covariance Gaussians). This uses the convexity of KL divergence in its second argument.

For Gaussian location models $P_\theta = \mathcal{N}(\theta, \sigma^2 I_d)$ :

$D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) = \frac{\|\theta_j - \theta_k\|^2}{2\sigma^2}$

This makes the KL computation particularly clean.

Connection to Mutual Information

Fano inequality can be restated purely in terms of mutual information:

$P(\hat{X} \neq X) \geq 1 - \frac{I(X; Y) + \log 2}{\log M}$

This reveals the fundamental tradeoff:

$\log M$ measures the complexity of the problem (how many things to distinguish)
$I(X; Y)$ measures the information the data provides
If information $\ll$ complexity, estimation is impossible

This is why Fano is called an "information-theoretic" lower bound: it directly quantifies how much information the data carries about the parameter, and shows that insufficient information implies large error.

Canonical Examples

Example

Fano lower bound for Gaussian mean estimation

Let $X_1, \ldots, X_n \sim \mathcal{N}(\theta, I_d)$ with $\theta \in \mathbb{R}^d$ . We want a lower bound on $\mathbb{E}[\|\hat{\theta} - \theta\|^2]$ .

Construct hypotheses: let $\theta_j = \delta e_j$ for $j = 1, \ldots, d$ , where $e_j$ is the $j$ -th standard basis vector and $\delta > 0$ is chosen later.

Separation: $\|\theta_j - \theta_k\|^2 = 2\delta^2$ for $j \neq k$ .

KL divergence: $D_{\text{KL}}(P_{\theta_j} \| P_{\theta_k}) = \|\theta_j - \theta_k\|^2/2 = \delta^2$ .

Mutual information bound: $I(V; X^n) \leq n\delta^2$ (using the pairwise bound).

Apply Fano: $P(\hat{V} \neq V) \geq 1 - (n\delta^2 + \log 2)/\log d$ .

For this to be bounded away from zero, we need $n\delta^2 \lesssim \log d$ . Choose $\delta^2 = c\log d / n$ . Then $P(\hat{V} \neq V) \geq c' > 0$ , and:

$R_n^* \geq c' \cdot \delta^2 = \frac{c'' \log d}{n}$

This is not tight -- the true rate is $d/n$ . To get the tight bound, use $M = 2^d$ hypotheses on the hypercube (Assouad), or use a more refined packing with the Varshamov-Gilbert bound to get $M = 2^{d/8}$ hypotheses with larger pairwise separation. The full argument gives $R_n^* \geq cd/n$ .

Common Confusions

Watch Out

Fano requires uniform prior, but the lower bound is frequentist

A subtle point: the Fano method places a uniform prior over the $M$ hypotheses for the purpose of analysis. But the final minimax lower bound is frequentist: it says that for some fixed (non-random) $\theta_j$ , the risk is large. The uniform prior is an analytical device, not an assumption about the data-generating process. The key step is: $\sup_j \mathbb{E}_{\theta_j}[\ell] \geq \mathbb{E}_V[\mathbb{E}_{\theta_V}[\ell]]$ , which converts the sup (frequentist) into an average (Bayesian).

Watch Out

More hypotheses is not always better

Increasing $M$ makes $\log M$ larger (stronger bound) but also makes it harder to satisfy the KL condition (you need more hypotheses that are mutually close). There is an optimal $M$ that balances separation against closeness. For $d$ -dimensional Gaussian problems, $M \approx e^{cd}$ (exponential in dimension) is typically optimal.

Watch Out

The log 2 term is not important

The $\log 2$ in Fano's bound is an artifact of the proof technique (bounding the binary entropy). For large $M$ , $\log 2 / \log M$ is negligible. You can safely ignore it when computing rates. It only matters when $M$ is very small (e.g., $M = 2$ ), where Fano gives a weaker result than Le Cam.

Summary

Fano inequality: $P(\hat{X} \neq X) \geq (H(X|Y) - \log 2) / \log M$
Equivalently: $P(\text{error}) \geq 1 - (I(X;Y) + \log 2)/\log M$
The Fano method: construct $M$ separated hypotheses, bound mutual information, get a lower bound on minimax risk
Art of the method: pack as many hypotheses as possible while keeping distributions close (mutual information small)
For Gaussian problems: KL divergence is $\|\theta_j - \theta_k\|^2/2\sigma^2$
Fano is the standard tool for proving minimax rates in statistics

Exercises

ExerciseAdvanced

Problem

Use the Fano method to prove that for estimating the mean of a $\mathcal{N}(\theta, \sigma^2)$ distribution (one-dimensional) from $n$ i.i.d. observations, the minimax risk for squared error satisfies $R_n^* \geq c\sigma^2/n$ .

ExerciseResearch

Problem

The Varshamov-Gilbert lemma states that there exist $M \geq 2^{d/8}$ binary vectors $\tau_1, \ldots, \tau_M \in \{0, 1\}^d$ such that $\|\tau_j - \tau_k\|_1 \geq d/4$ for all $j \neq k$ . How is this used in the Fano method to prove lower bounds for high-dimensional problems?

Related Comparisons

Fano's Method vs. Le Cam's Method

References

Canonical:

Fano, Transmission of Information (1961) -- the original
Cover & Thomas, Elements of Information Theory, Chapter 2 -- clean presentation of Fano inequality
Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 2 -- the Fano method for minimax bounds
Yu, "Assouad, Fano, and Le Cam" (1997)

Current:

Duchi, Information-Theoretic Methods in Statistics (Stanford lecture notes)
Wainwright, High-Dimensional Statistics (2019), Chapter 15

Next Topics

Fano inequality is a terminal topic in the lower bounds sequence. From here, apply the Fano method to specific estimation problems as they arise in other topics (nonparametric regression, density estimation, high-dimensional statistics).

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1
Information Theory Foundationslayer 0B · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.