Statistical Estimation
Sufficient Statistics and Exponential Families
Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.
Prerequisites
Why This Matters
Every time you compute a sample mean and sample variance from Gaussian data, you are using sufficient statistics without realizing it. The sample mean captures all the information the data has about the population mean. You could throw away the original data points and lose nothing.
Sufficient statistics tell you when data compression is lossless for inference. Exponential families are the class of distributions where sufficient statistics take a particularly clean form. These two ideas together explain why so many classical estimators have the structure they do, and they underlie the theoretical guarantees for MLE in parametric models.
Mental Model
You observe data points and want to estimate . A sufficient statistic is a function of the data that captures everything the data can tell you about . Given , the conditional distribution of the data does not depend on . So is a lossless summary for the purpose of inference.
The factorization theorem gives a simple test: the statistic is sufficient if and only if the joint density factors into a piece that depends on only through and a piece that does not depend on at all.
Formal Setup and Notation
Let be i.i.d. from where .
Sufficient Statistic
A statistic is sufficient for if the conditional distribution of given does not depend on :
Equivalently, captures all the information in about . Once you know , the remaining randomness in is pure noise with respect to .
Minimal Sufficient Statistic
A sufficient statistic is minimal sufficient if it is a function of every other sufficient statistic. That is, for any other sufficient statistic , there exists a function such that . A minimal sufficient statistic achieves the maximum data reduction possible without losing information about .
Main Theorems
Neyman-Fisher Factorization Theorem
Statement
A statistic is sufficient for if and only if the joint density (or pmf) can be factored as:
where depends on the data only through , and depends on the data but not on .
Intuition
The factorization says the likelihood splits into two parts. The part that depends on sees the data only through . The part that depends on the full data does not care about . So for the purpose of learning about , is all you need.
Proof Sketch
(Sufficiency implies factorization): Write . Since is sufficient, . Set .
(Factorization implies sufficiency): If , then . The numerator is and the denominator is . These cancel, giving , which does not depend on .
Why It Matters
The factorization theorem is the practical workhorse for finding sufficient statistics. You write down the likelihood, identify what functions of the data appear in the -dependent part, and those functions form a sufficient statistic. For exponential families, this immediately identifies the natural sufficient statistics.
Failure Mode
The factorization must hold for ALL values of simultaneously. A common mistake is to find a factorization that works for one specific value but not all. Also, the factorization depends on the support of the distribution: if the support depends on (e.g., Uniform), be careful with indicator functions.
Exponential Families
Exponential Family
A parametric family is an exponential family if the density can be written as:
where:
- is the sufficient statistic
- is the natural parameter
- is the log-partition function (ensures normalization)
- is the base measure
When the parameterization uses directly (i.e., is the free parameter), the family is in canonical form: .
Most distributions you encounter are exponential families: Gaussian, Bernoulli, Poisson, Exponential, Gamma, Beta, Multinomial, and Wishart. Notable exceptions: the Cauchy distribution, mixture models, and the Uniform distribution.
Key properties of exponential families:
- Sufficient statistics: is always sufficient (by factorization)
- MLE is unique when it exists: the log-likelihood is concave in (strictly concave when the family is minimal and of full rank), so there are no local optima. Existence can fail at the boundary of the natural parameter space. Canonical failure cases: all-success or all-failure Bernoulli samples (MLE for is ), all-zero Poisson samples (), and separated data in logistic regression. Existence typically requires the observed sufficient statistic to lie in the interior of the convex hull of its support
- Moment-generating properties: and . The log-partition function generates all the moments of
- Conjugate priors: every exponential family has a natural conjugate prior, making Bayesian inference tractable
Log-Partition Function
The log-partition function ensures normalization:
It is always convex in (because it is a log of an integral of exponentials). Its first derivative gives the expected sufficient statistic: . Its second derivative gives the variance: , which is the Fisher information in canonical form.
Completeness
Complete Statistic
A sufficient statistic is complete if for any function :
Completeness means there is no non-trivial function of that has mean zero for all . In exponential families with natural parameter space containing an open set, the natural sufficient statistic is always complete.
Completeness matters because it guarantees uniqueness: if is complete and sufficient, then any unbiased estimator based on is the unique best unbiased estimator (UMVUE). This connects to the Rao-Blackwell theorem below.
Rao-Blackwell Theorem
Rao-Blackwell Theorem
Statement
Let be any unbiased estimator of and let be a sufficient statistic. Define:
Then is:
- A function of alone (not of the full data)
- Unbiased for
- At least as good as : for all , with equality only if is already a function of .
Intuition
Conditioning on a sufficient statistic can only help (or not hurt) estimation. The sufficient statistic contains all the information about . Any remaining randomness in beyond what captures is pure noise. Conditioning on averages out this noise, reducing variance while preserving unbiasedness.
Proof Sketch
Unbiasedness: by the tower property.
Variance reduction: By the law of total variance: .
Since , we get .
Why It Matters
Rao-Blackwell says: never ignore a sufficient statistic. If you have any unbiased estimator, you can improve it (or at least not hurt it) by conditioning on a sufficient statistic. Combined with completeness, this gives the Lehmann-Scheffe theorem: if is complete and sufficient, then is the unique minimum-variance unbiased estimator (UMVUE).
Failure Mode
Rao-Blackwell improves unbiased estimators, but unbiasedness itself is not always desirable. Biased estimators (like the James-Stein estimator or ridge regression) can have lower MSE. The Rao-Blackwell theorem operates within the class of unbiased estimators and cannot compare across that boundary.
Canonical Examples
Sufficient statistic for Gaussian mean
Let with known . The joint density is:
Expanding the square: .
By factorization: where . The sample mean is sufficient for . This is an exponential family with natural parameter and sufficient statistic .
Exponential family form of the Poisson distribution
.
This is an exponential family with , , , and . For i.i.d. observations, is sufficient for .
Common Confusions
Sufficient does not mean minimal sufficient
The entire data vector is always trivially sufficient (the identity is a sufficient statistic). The interesting question is how much you can compress. Minimal sufficiency gives the maximum compression. For exponential families with -dimensional natural parameter, the minimal sufficient statistic is -dimensional, regardless of sample size .
Not all distributions are exponential families
Mixture distributions are not exponential families (the sufficient statistic dimension grows with ). The Cauchy distribution is not an exponential family. The Uniform is not (because the support depends on ). When you are outside exponential families, the clean theory of sufficient statistics and conjugate priors does not apply as neatly.
Summary
- A statistic is sufficient if the conditional distribution of given does not depend on
- Factorization theorem: characterizes sufficiency
- Exponential families:
- The log-partition function generates moments of : ,
- Completeness + sufficiency gives uniqueness of UMVUE
- Rao-Blackwell: condition on a sufficient statistic to improve any unbiased estimator
Exercises
Problem
Find the sufficient statistic for in the Bernoulli model: . Write the joint pmf in exponential family form and identify the natural parameter, sufficient statistic, and log-partition function.
Problem
Let . Show that is sufficient for but this is not an exponential family. Why does this matter for the MLE?
Problem
Prove that in a -parameter exponential family where the natural parameter space contains an open set, the natural sufficient statistic is complete. Why does this, combined with Rao-Blackwell, imply that any unbiased estimator based on is UMVUE?
References
Canonical:
- Casella & Berger, Statistical Inference (2nd ed., 2002), Chapters 6-7
- Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1-4
- Keener, Theoretical Statistics (2010), Chapters 3-4
Current:
-
Wasserman, All of Statistics (2004), Chapter 9
-
Wainwright & Jordan, "Graphical Models, Exponential Families, and Variational Inference" (2008)
-
van der Vaart, Asymptotic Statistics (1998), Chapters 2-8
Next Topics
Building on sufficient statistics and exponential families:
- Fisher information: the curvature of the log-likelihood, directly related to the log-partition function in exponential families
- Hypothesis testing for ML: using sufficient statistics to construct optimal tests
- EM algorithm: exploiting exponential family structure for latent variable models
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
Builds on This
- Basu's TheoremLayer 0B
- Rao-BlackwellizationLayer 2