Statistical Foundations
Minimax Lower Bounds
Why upper bounds are not enough: minimax risk, Le Cam two-point method, Fano inequality, and Assouad lemma for proving that no estimator can beat a given rate.
Why This Matters
Proving an upper bound (e.g., "my estimator achieves rate ") is only half the story. The natural question is: can any estimator do better? A minimax lower bound answers this by showing that for every estimator, there exists a problem instance where the estimator must incur at least a certain error.
Lower bounds are intellectually crucial because they tell you when to stop trying to improve. If your estimator achieves rate and you can prove a matching lower bound, you know your estimator is rate-optimal. No amount of cleverness can beat the fundamental statistical limit.
In modern ML theory, lower bounds separate what is statistically possible from what is merely a failure of current algorithms.
Mental Model
Upper bounds say: "here is an estimator that works this well." Lower bounds say: "no estimator can do better than this." Together, they characterize the fundamental difficulty of a statistical problem.
The key technique for lower bounds is reduction to testing: show that estimating a parameter accurately requires distinguishing between hypotheses. If no test can reliably distinguish the hypotheses, no estimator can reliably estimate the parameter.
Formal Setup
Minimax Risk
The minimax risk over a parameter space with loss function and sample size is:
The infimum is over all estimators , and the supremum is over all parameter values in . This is the best worst-case risk achievable by any estimator.
The minimax framework asks: what is the best any estimator can guarantee, when nature is adversarial? The inf-sup structure captures this game:
- The statistician chooses an estimator (inf)
- Nature chooses the worst-case parameter (sup)
- The minimax risk is the value of this game
A lower bound on means: no matter what estimator you use, there exists some where your expected loss is at least this large.
The Reduction to Testing
The fundamental idea behind all minimax lower bound techniques:
If you cannot test, you cannot estimate.
More precisely, if two parameter values and are far apart (in the loss metric) but the corresponding distributions and are close (in a statistical sense), then no estimator can reliably tell them apart, which means no estimator can accurately estimate .
This insight takes three forms, in order of increasing power:
- Le Cam (two hypotheses): reduce to a single binary test
- Fano (multiple hypotheses): reduce to an -ary test
- Assouad (hypercube): reduce to simultaneous binary tests
Le Cam Two-Point Method
Le Cam Two-Point Method
Statement
Let with . Then:
where is the total variation distance between the joint distributions of samples. We follow the convention of Yu (1997, Lemma 1) and Tsybakov (2009, Theorem 2.2): is the half-separation in the loss (assumption ), TV is normalized to , and the factor comes from the average-error bound .
Intuition
If the two distributions and are nearly indistinguishable (small total variation distance), then any estimator must err on at least one of them. Since and are far apart in loss, this error is of order .
The total variation distance controls how well you can distinguish the two hypotheses. If , you get a lower bound of at least .
Proof Sketch
Any estimator satisfies:
For any outcome, by the triangle-like property, . Define as a test. Then:
The last term is the average testing error, which is bounded below by by the Neyman-Pearson lemma.
Why It Matters
Le Cam is the simplest lower bound technique and often the first one to try. It reduces the estimation problem to binary hypothesis testing, which is the most basic statistical task. If even this simple test is hard, estimation must be hard.
Failure Mode
Le Cam only uses two hypotheses. When the parameter space is rich (high-dimensional), two points cannot capture the full difficulty of the problem. Fano and Assouad address this by using many hypotheses simultaneously.
Fano Method (Preview)
Fano extends Le Cam by using hypotheses instead of 2. The idea: construct parameter values that are pairwise well-separated in loss but whose distributions are mutually close (bounded mutual information).
Fano's inequality then gives:
where is the mutual information between the data and the uniform index over the hypotheses. The more hypotheses you can pack while keeping the mutual information small, the stronger the lower bound.
The full treatment of Fano's inequality is in the dedicated topic page.
Assouad Lemma
Assouad Lemma
Statement
Let the parameter space be indexed by , with loss . Then:
where are mixtures over all with the -th coordinate fixed. Convention: Yu (1997, Lemma 2), Tsybakov (2009, Theorem 2.12), Wainwright (2019, Proposition 15.5); is the per-coordinate separation in the loss and TV is normalized to . The factor arises from reducing each coordinate to a Le Cam two-point test.
Intuition
Assouad reduces the -dimensional estimation problem to independent binary testing problems, one per coordinate of the hypercube. Each coordinate contributes a Le Cam-style lower bound. The total lower bound is the sum over all coordinates.
Proof Sketch
The loss decomposes across coordinates. For each coordinate , estimating correctly is a binary testing problem between the mixture with and the mixture with . Apply Le Cam to each coordinate and sum the results.
Why It Matters
Assouad is the go-to method for lower bounds in high-dimensional problems. When the parameter is a -dimensional vector, Assouad naturally captures the "curse of dimensionality" by showing that each coordinate independently contributes to the difficulty. It is the standard tool for proving lower bounds in nonparametric estimation, where plays the role of the effective dimensionality.
Failure Mode
Assouad requires the loss to decompose (at least approximately) as a sum over coordinates. When the loss has a more complex structure (e.g., spectral norm for matrix estimation), Assouad may not apply directly, and you may need a modified version or a different technique.
How to Use These Tools
A practical guide to choosing the right lower bound method:
-
Start with Le Cam if the problem is one-dimensional or you only need a rough lower bound. Construct two hypotheses that are far apart in loss but close in total variation.
-
Use Fano when the parameter space is rich and you can construct many well-separated hypotheses. This is the most common method for proving minimax rates in statistics and learning theory.
-
Use Assouad for high-dimensional problems where the loss decomposes coordinate-wise. The hypercube construction naturally yields hypotheses from binary choices.
In all cases, the art is in the construction: choosing the right hypotheses (or packing) so that the separation is large and the distributions are hard to distinguish.
Canonical Examples
Le Cam for Gaussian mean estimation
Let with . Take and . The total variation distance satisfies:
by Pinsker's inequality applied to (KL for Gaussians with unit variance and mean gap ). Setting makes the TV distance bounded by a constant, and the lower bound gives for squared error loss. This matches the upper bound from the sample mean ().
Why lower bounds matter: nonparametric regression
For estimating a function in the Sobolev ball of smoothness in dimensions, the minimax rate for squared error is . Without the lower bound, you would not know whether this rate is optimal or whether a better estimator exists. The lower bound (typically proved via Assouad) shows this rate is tight -- no estimator can do better, and the curse of dimensionality ( in the exponent) is unavoidable.
Common Confusions
A lower bound does not mean all estimators achieve that rate
The minimax lower bound says: the best worst-case risk is at least this large. It does not say that every estimator achieves this rate. Most estimators are much worse than minimax-optimal. The lower bound is a statement about the best possible estimator, not about typical estimators.
Minimax optimality does not mean the estimator is good for your problem
Minimax theory is worst-case. An estimator that is minimax-optimal may be overly conservative for specific easy instances. Adaptive estimators try to achieve near-minimax rates while also exploiting favorable structure when it is present.
Total variation and KL divergence are different
Le Cam uses total variation distance. Fano uses mutual information (related to KL divergence). They are connected by Pinsker's inequality: . But they are not interchangeable. Using the wrong divergence in the wrong method gives incorrect bounds.
Summary
- Minimax risk = : the best worst-case performance
- Lower bounds prove that no estimator can beat a certain rate
- All methods reduce estimation to hypothesis testing: if you cannot test, you cannot estimate
- Le Cam: two hypotheses, uses total variation distance
- Fano: hypotheses, uses mutual information
- Assouad: hypotheses on a hypercube, decomposes into binary tests
- A matching upper and lower bound proves an estimator is rate-optimal
Exercises
Problem
Use Le Cam's two-point method to prove a lower bound for estimating the mean of a Bernoulli distribution. Let with . Show that for squared error loss, for some constant .
Problem
Why does Le Cam give a suboptimal lower bound for high-dimensional problems? Specifically, for estimating a -dimensional Gaussian mean with identity covariance, Le Cam with two points gives , while the true minimax rate for squared loss is . Where does the factor of come from?
Related Comparisons
References
Canonical:
- Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 2 -- the best textbook treatment
- Le Cam, "Convergence of Estimates Under Dimensionality Restrictions" (1973)
- Yu, "Assouad, Fano, and Le Cam" (1997) -- a unified exposition
Current:
-
Duchi, Information-Theoretic Methods in Statistics (lecture notes, Stanford) -- modern treatment
-
Wainwright, High-Dimensional Statistics (2019), Chapter 15
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
Next Topics
Deepening the lower bound toolkit:
- Fano inequality: the detailed mechanics of the most widely used lower bound technique
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A
Builds on This
- Fano InequalityLayer 3
- Robust Adversarial PoliciesLayer 4