Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Minimax Lower Bounds

Why upper bounds are not enough: minimax risk, Le Cam two-point method, Fano inequality, and Assouad lemma for proving that no estimator can beat a given rate.

AdvancedTier 2Stable~60 min

Why This Matters

Proving an upper bound (e.g., "my estimator achieves rate n1/2n^{-1/2}") is only half the story. The natural question is: can any estimator do better? A minimax lower bound answers this by showing that for every estimator, there exists a problem instance where the estimator must incur at least a certain error.

Lower bounds are intellectually crucial because they tell you when to stop trying to improve. If your estimator achieves rate n2/(2+d)n^{-2/(2+d)} and you can prove a matching lower bound, you know your estimator is rate-optimal. No amount of cleverness can beat the fundamental statistical limit.

In modern ML theory, lower bounds separate what is statistically possible from what is merely a failure of current algorithms.

Mental Model

Upper bounds say: "here is an estimator that works this well." Lower bounds say: "no estimator can do better than this." Together, they characterize the fundamental difficulty of a statistical problem.

The key technique for lower bounds is reduction to testing: show that estimating a parameter accurately requires distinguishing between hypotheses. If no test can reliably distinguish the hypotheses, no estimator can reliably estimate the parameter.

Formal Setup

Definition

Minimax Risk

The minimax risk over a parameter space Θ\Theta with loss function \ell and sample size nn is:

Rn(Θ,)=infθ^supθΘEθ[(θ^,θ)]R_n^*(\Theta, \ell) = \inf_{\hat{\theta}} \sup_{\theta \in \Theta} \mathbb{E}_\theta[\ell(\hat{\theta}, \theta)]

The infimum is over all estimators θ^=θ^(X1,,Xn)\hat{\theta} = \hat{\theta}(X_1, \ldots, X_n), and the supremum is over all parameter values in Θ\Theta. This is the best worst-case risk achievable by any estimator.

The minimax framework asks: what is the best any estimator can guarantee, when nature is adversarial? The inf-sup structure captures this game:

  1. The statistician chooses an estimator (inf)
  2. Nature chooses the worst-case parameter (sup)
  3. The minimax risk is the value of this game

A lower bound on RnR_n^* means: no matter what estimator you use, there exists some θ\theta where your expected loss is at least this large.

The Reduction to Testing

The fundamental idea behind all minimax lower bound techniques:

If you cannot test, you cannot estimate.

More precisely, if two parameter values θ0\theta_0 and θ1\theta_1 are far apart (in the loss metric) but the corresponding distributions Pθ0P_{\theta_0} and Pθ1P_{\theta_1} are close (in a statistical sense), then no estimator can reliably tell them apart, which means no estimator can accurately estimate θ\theta.

This insight takes three forms, in order of increasing power:

  1. Le Cam (two hypotheses): reduce to a single binary test
  2. Fano (multiple hypotheses): reduce to an MM-ary test
  3. Assouad (hypercube): reduce to dd simultaneous binary tests

Le Cam Two-Point Method

Theorem

Le Cam Two-Point Method

Statement

Let θ0,θ1Θ\theta_0, \theta_1 \in \Theta with (θ0,θ1)2s>0\ell(\theta_0, \theta_1) \geq 2s > 0. Then:

Rn(Θ,)s2(1Pθ0nPθ1nTV)R_n^*(\Theta, \ell) \geq \frac{s}{2}\left(1 - \|P_{\theta_0}^n - P_{\theta_1}^n\|_{\text{TV}}\right)

where PQTV[0,1]\|P - Q\|_{\text{TV}} \in [0, 1] is the total variation distance between the joint distributions of nn samples. We follow the convention of Yu (1997, Lemma 1) and Tsybakov (2009, Theorem 2.2): ss is the half-separation in the loss (assumption (θ0,θ1)2s\ell(\theta_0, \theta_1) \geq 2s), TV is normalized to [0,1][0,1], and the 1/21/2 factor comes from the average-error bound sup12(Eθ0[]+Eθ1[])\sup \geq \tfrac{1}{2}(\mathbb{E}_{\theta_0}[\ell] + \mathbb{E}_{\theta_1}[\ell]).

Intuition

If the two distributions Pθ0nP_{\theta_0}^n and Pθ1nP_{\theta_1}^n are nearly indistinguishable (small total variation distance), then any estimator must err on at least one of them. Since θ0\theta_0 and θ1\theta_1 are far apart in loss, this error is of order ss.

The total variation distance controls how well you can distinguish the two hypotheses. If Pθ0nPθ1nTV1/2\|P_{\theta_0}^n - P_{\theta_1}^n\|_{\text{TV}} \leq 1/2, you get a lower bound of at least s/4s/4.

Proof Sketch

Any estimator θ^\hat{\theta} satisfies:

supθEθ[(θ^,θ)]12(Eθ0[(θ^,θ0)]+Eθ1[(θ^,θ1)])\sup_\theta \mathbb{E}_\theta[\ell(\hat{\theta}, \theta)] \geq \frac{1}{2}\left(\mathbb{E}_{\theta_0}[\ell(\hat{\theta}, \theta_0)] + \mathbb{E}_{\theta_1}[\ell(\hat{\theta}, \theta_1)]\right)

For any outcome, by the triangle-like property, (θ^,θ0)+(θ^,θ1)2s\ell(\hat{\theta}, \theta_0) + \ell(\hat{\theta}, \theta_1) \geq 2s. Define ψ=1[(θ^,θ0)<s]\psi = \mathbf{1}[\ell(\hat{\theta}, \theta_0) < s] as a test. Then:

12(Eθ0[(θ^,θ0)]+Eθ1[(θ^,θ1)])s12(Pθ0(ψ=1)+Pθ1(ψ=0))\frac{1}{2}(\mathbb{E}_{\theta_0}[\ell(\hat\theta, \theta_0)] + \mathbb{E}_{\theta_1}[\ell(\hat\theta, \theta_1)]) \geq s \cdot \frac{1}{2}(P_{\theta_0}(\psi = 1) + P_{\theta_1}(\psi = 0))

The last term is the average testing error, which is bounded below by (1Pθ0nPθ1nTV)/2(1 - \|P_{\theta_0}^n - P_{\theta_1}^n\|_{\text{TV}})/2 by the Neyman-Pearson lemma.

Why It Matters

Le Cam is the simplest lower bound technique and often the first one to try. It reduces the estimation problem to binary hypothesis testing, which is the most basic statistical task. If even this simple test is hard, estimation must be hard.

Failure Mode

Le Cam only uses two hypotheses. When the parameter space is rich (high-dimensional), two points cannot capture the full difficulty of the problem. Fano and Assouad address this by using many hypotheses simultaneously.

Fano Method (Preview)

Fano extends Le Cam by using MM hypotheses instead of 2. The idea: construct MM parameter values θ1,,θM\theta_1, \ldots, \theta_M that are pairwise well-separated in loss but whose distributions are mutually close (bounded mutual information).

Fano's inequality then gives:

Rns(1I(Xn;Θ)+log2logM)R_n^* \geq s \cdot \left(1 - \frac{I(X^n; \Theta) + \log 2}{\log M}\right)

where I(Xn;Θ)I(X^n; \Theta) is the mutual information between the data and the uniform index over the MM hypotheses. The more hypotheses you can pack while keeping the mutual information small, the stronger the lower bound.

The full treatment of Fano's inequality is in the dedicated topic page.

Assouad Lemma

Lemma

Assouad Lemma

Statement

Let the parameter space be indexed by τ{1,+1}d\tau \in \{-1, +1\}^d, with loss (θ^,θτ)j=1dsj1[τ^jτj]\ell(\hat{\theta}, \theta_\tau) \geq \sum_{j=1}^d s_j \cdot \mathbf{1}[\hat{\tau}_j \neq \tau_j]. Then:

Rn12j=1dsj(1Pτj=+1nPτj=1nTV)R_n^* \geq \frac{1}{2}\sum_{j=1}^d s_j \cdot \left(1 - \|P_{\tau_j = +1}^n - P_{\tau_j = -1}^n\|_{\text{TV}}\right)

where Pτj=±1nP_{\tau_j = \pm 1}^n are mixtures over all τ\tau with the jj-th coordinate fixed. Convention: Yu (1997, Lemma 2), Tsybakov (2009, Theorem 2.12), Wainwright (2019, Proposition 15.5); sjs_j is the per-coordinate separation in the loss and TV is normalized to [0,1][0,1]. The 1/21/2 factor arises from reducing each coordinate to a Le Cam two-point test.

Intuition

Assouad reduces the dd-dimensional estimation problem to dd independent binary testing problems, one per coordinate of the hypercube. Each coordinate contributes a Le Cam-style lower bound. The total lower bound is the sum over all coordinates.

Proof Sketch

The loss decomposes across coordinates. For each coordinate jj, estimating τj\tau_j correctly is a binary testing problem between the mixture with τj=+1\tau_j = +1 and the mixture with τj=1\tau_j = -1. Apply Le Cam to each coordinate and sum the results.

Why It Matters

Assouad is the go-to method for lower bounds in high-dimensional problems. When the parameter is a dd-dimensional vector, Assouad naturally captures the "curse of dimensionality" by showing that each coordinate independently contributes to the difficulty. It is the standard tool for proving lower bounds in nonparametric estimation, where dd plays the role of the effective dimensionality.

Failure Mode

Assouad requires the loss to decompose (at least approximately) as a sum over coordinates. When the loss has a more complex structure (e.g., spectral norm for matrix estimation), Assouad may not apply directly, and you may need a modified version or a different technique.

How to Use These Tools

A practical guide to choosing the right lower bound method:

  1. Start with Le Cam if the problem is one-dimensional or you only need a rough lower bound. Construct two hypotheses that are far apart in loss but close in total variation.

  2. Use Fano when the parameter space is rich and you can construct many well-separated hypotheses. This is the most common method for proving minimax rates in statistics and learning theory.

  3. Use Assouad for high-dimensional problems where the loss decomposes coordinate-wise. The hypercube construction naturally yields 2d2^d hypotheses from dd binary choices.

In all cases, the art is in the construction: choosing the right hypotheses (or packing) so that the separation is large and the distributions are hard to distinguish.

Canonical Examples

Example

Le Cam for Gaussian mean estimation

Let X1,,XnN(θ,1)X_1, \ldots, X_n \sim \mathcal{N}(\theta, 1) with θR\theta \in \mathbb{R}. Take θ0=δ/2\theta_0 = -\delta/2 and θ1=+δ/2\theta_1 = +\delta/2. The total variation distance satisfies:

Pθ0nPθ1nTVnδ24=δ2n\|P_{\theta_0}^n - P_{\theta_1}^n\|_{\text{TV}} \leq \sqrt{\frac{n\delta^2}{4}} = \frac{\delta}{2}\sqrt{n}

by Pinsker's inequality PQTVDKL(PQ)/2\|P - Q\|_{\text{TV}} \leq \sqrt{D_{\text{KL}}(P\|Q)/2} applied to DKL(Pθ0nPθ1n)=nδ2/2D_{\text{KL}}(P_{\theta_0}^n \| P_{\theta_1}^n) = n\delta^2/2 (KL for Gaussians with unit variance and mean gap δ\delta). Setting δ=c/n\delta = c/\sqrt{n} makes the TV distance bounded by a constant, and the lower bound gives Rnc/nR_n^* \geq c'/n for squared error loss. This matches the upper bound from the sample mean (Var(Xˉ)=1/n\text{Var}(\bar{X}) = 1/n).

Example

Why lower bounds matter: nonparametric regression

For estimating a function in the Sobolev ball of smoothness ss in dd dimensions, the minimax rate for squared error is n2s/(2s+d)n^{-2s/(2s+d)}. Without the lower bound, you would not know whether this rate is optimal or whether a better estimator exists. The lower bound (typically proved via Assouad) shows this rate is tight -- no estimator can do better, and the curse of dimensionality (dd in the exponent) is unavoidable.

Common Confusions

Watch Out

A lower bound does not mean all estimators achieve that rate

The minimax lower bound says: the best worst-case risk is at least this large. It does not say that every estimator achieves this rate. Most estimators are much worse than minimax-optimal. The lower bound is a statement about the best possible estimator, not about typical estimators.

Watch Out

Minimax optimality does not mean the estimator is good for your problem

Minimax theory is worst-case. An estimator that is minimax-optimal may be overly conservative for specific easy instances. Adaptive estimators try to achieve near-minimax rates while also exploiting favorable structure when it is present.

Watch Out

Total variation and KL divergence are different

Le Cam uses total variation distance. Fano uses mutual information (related to KL divergence). They are connected by Pinsker's inequality: PQTVDKL(PQ)/2\|P - Q\|_{\text{TV}} \leq \sqrt{D_{\text{KL}}(P \| Q)/2}. But they are not interchangeable. Using the wrong divergence in the wrong method gives incorrect bounds.

Summary

  • Minimax risk = infθ^supθE[(θ^,θ)]\inf_{\hat\theta} \sup_\theta \mathbb{E}[\ell(\hat\theta, \theta)]: the best worst-case performance
  • Lower bounds prove that no estimator can beat a certain rate
  • All methods reduce estimation to hypothesis testing: if you cannot test, you cannot estimate
  • Le Cam: two hypotheses, uses total variation distance
  • Fano: MM hypotheses, uses mutual information
  • Assouad: 2d2^d hypotheses on a hypercube, decomposes into dd binary tests
  • A matching upper and lower bound proves an estimator is rate-optimal

Exercises

ExerciseAdvanced

Problem

Use Le Cam's two-point method to prove a lower bound for estimating the mean of a Bernoulli distribution. Let X1,,XnBernoulli(p)X_1, \ldots, X_n \sim \text{Bernoulli}(p) with p[0,1]p \in [0, 1]. Show that for squared error loss, Rnc/nR_n^* \geq c/n for some constant c>0c > 0.

ExerciseResearch

Problem

Why does Le Cam give a suboptimal lower bound for high-dimensional problems? Specifically, for estimating a dd-dimensional Gaussian mean with identity covariance, Le Cam with two points gives Rnc/nR_n^* \geq c/n, while the true minimax rate for squared 2\ell_2 loss is d/nd/n. Where does the factor of dd come from?

Related Comparisons

References

Canonical:

  • Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 2 -- the best textbook treatment
  • Le Cam, "Convergence of Estimates Under Dimensionality Restrictions" (1973)
  • Yu, "Assouad, Fano, and Le Cam" (1997) -- a unified exposition

Current:

  • Duchi, Information-Theoretic Methods in Statistics (lecture notes, Stanford) -- modern treatment

  • Wainwright, High-Dimensional Statistics (2019), Chapter 15

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

Next Topics

Deepening the lower bound toolkit:

  • Fano inequality: the detailed mechanics of the most widely used lower bound technique

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics