Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Sufficient Statistics and Exponential Families

Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.

CoreTier 2Stable~60 min

Why This Matters

Every time you compute a sample mean and sample variance from Gaussian data, you are using sufficient statistics without realizing it. The sample mean captures all the information the data has about the population mean. You could throw away the original data points and lose nothing.

Sufficient statistics tell you when data compression is lossless for inference. Exponential families are the class of distributions where sufficient statistics take a particularly clean form. These two ideas together explain why so many classical estimators have the structure they do, and they underlie the theoretical guarantees for MLE in parametric models.

Mental Model

You observe nn data points and want to estimate θ\theta. A sufficient statistic T(X)T(X) is a function of the data that captures everything the data can tell you about θ\theta. Given T(X)T(X), the conditional distribution of the data does not depend on θ\theta. So T(X)T(X) is a lossless summary for the purpose of inference.

The factorization theorem gives a simple test: the statistic T(X)T(X) is sufficient if and only if the joint density factors into a piece that depends on θ\theta only through TT and a piece that does not depend on θ\theta at all.

Formal Setup and Notation

Let X=(X1,,Xn)X = (X_1, \ldots, X_n) be i.i.d. from p(xθ)p(x | \theta) where θΘ\theta \in \Theta.

Definition

Sufficient Statistic

A statistic T(X)T(X) is sufficient for θ\theta if the conditional distribution of XX given T(X)T(X) does not depend on θ\theta:

p(XT(X)=t,θ)=p(XT(X)=t)for all θp(X | T(X) = t, \theta) = p(X | T(X) = t) \quad \text{for all } \theta

Equivalently, T(X)T(X) captures all the information in XX about θ\theta. Once you know T(X)T(X), the remaining randomness in XX is pure noise with respect to θ\theta.

Definition

Minimal Sufficient Statistic

A sufficient statistic TT is minimal sufficient if it is a function of every other sufficient statistic. That is, for any other sufficient statistic UU, there exists a function gg such that T=g(U)T = g(U). A minimal sufficient statistic achieves the maximum data reduction possible without losing information about θ\theta.

Main Theorems

Theorem

Neyman-Fisher Factorization Theorem

Statement

A statistic T(X)T(X) is sufficient for θ\theta if and only if the joint density (or pmf) can be factored as:

p(x1,,xnθ)=g(T(x),θ)h(x)p(x_1, \ldots, x_n | \theta) = g(T(x), \theta) \cdot h(x)

where gg depends on the data only through T(x)T(x), and hh depends on the data but not on θ\theta.

Intuition

The factorization says the likelihood splits into two parts. The part that depends on θ\theta sees the data only through TT. The part that depends on the full data does not care about θ\theta. So for the purpose of learning about θ\theta, TT is all you need.

Proof Sketch

(Sufficiency implies factorization): Write p(xθ)=p(xT(x),θ)p(T(x)θ)p(x | \theta) = p(x | T(x), \theta) \cdot p(T(x) | \theta). Since TT is sufficient, p(xT(x),θ)=p(xT(x))=h(x)p(x | T(x), \theta) = p(x | T(x)) = h(x). Set g(T(x),θ)=p(T(x)θ)g(T(x), \theta) = p(T(x) | \theta).

(Factorization implies sufficiency): If p(xθ)=g(T(x),θ)h(x)p(x | \theta) = g(T(x), \theta) \cdot h(x), then p(xT(x)=t,θ)=p(xθ)/p(T(x)=tθ)p(x | T(x) = t, \theta) = p(x | \theta) / p(T(x) = t | \theta). The numerator is g(t,θ)h(x)g(t, \theta) h(x) and the denominator is g(t,θ)x:T(x)=th(x)g(t, \theta) \sum_{x': T(x')=t} h(x'). These cancel, giving h(x)/x:T(x)=th(x)h(x) / \sum_{x': T(x')=t} h(x'), which does not depend on θ\theta.

Why It Matters

The factorization theorem is the practical workhorse for finding sufficient statistics. You write down the likelihood, identify what functions of the data appear in the θ\theta-dependent part, and those functions form a sufficient statistic. For exponential families, this immediately identifies the natural sufficient statistics.

Failure Mode

The factorization must hold for ALL values of θ\theta simultaneously. A common mistake is to find a factorization that works for one specific θ\theta value but not all. Also, the factorization depends on the support of the distribution: if the support depends on θ\theta (e.g., Uniform(0,θ)(0, \theta)), be careful with indicator functions.

Exponential Families

Definition

Exponential Family

A parametric family is an exponential family if the density can be written as:

p(xθ)=h(x)exp ⁣(η(θ)T(x)A(θ))p(x | \theta) = h(x) \exp\!\left(\eta(\theta)^\top T(x) - A(\theta)\right)

where:

  • T(x)RkT(x) \in \mathbb{R}^k is the sufficient statistic
  • η(θ)Rk\eta(\theta) \in \mathbb{R}^k is the natural parameter
  • A(θ)A(\theta) is the log-partition function (ensures normalization)
  • h(x)0h(x) \geq 0 is the base measure

When the parameterization uses η\eta directly (i.e., η\eta is the free parameter), the family is in canonical form: p(xη)=h(x)exp(ηT(x)A(η))p(x | \eta) = h(x) \exp(\eta^\top T(x) - A(\eta)).

Most distributions you encounter are exponential families: Gaussian, Bernoulli, Poisson, Exponential, Gamma, Beta, Multinomial, and Wishart. Notable exceptions: the Cauchy distribution, mixture models, and the Uniform(0,θ)(0, \theta) distribution.

Key properties of exponential families:

  1. Sufficient statistics: T(X)T(X) is always sufficient (by factorization)
  2. MLE is unique when it exists: the log-likelihood is concave in η\eta (strictly concave when the family is minimal and of full rank), so there are no local optima. Existence can fail at the boundary of the natural parameter space. Canonical failure cases: all-success or all-failure Bernoulli samples (MLE for η=logit(p)\eta = \text{logit}(p) is ±\pm\infty), all-zero Poisson samples (η=logλ=\eta = \log\lambda = -\infty), and separated data in logistic regression. Existence typically requires the observed sufficient statistic to lie in the interior of the convex hull of its support
  3. Moment-generating properties: E[T(X)]=ηA(η)\mathbb{E}[T(X)] = \nabla_\eta A(\eta) and Cov(T(X))=η2A(η)\text{Cov}(T(X)) = \nabla^2_\eta A(\eta). The log-partition function generates all the moments of TT
  4. Conjugate priors: every exponential family has a natural conjugate prior, making Bayesian inference tractable
Definition

Log-Partition Function

The log-partition function ensures normalization:

A(η)=logh(x)exp(ηT(x))dxA(\eta) = \log \int h(x) \exp(\eta^\top T(x)) \, dx

It is always convex in η\eta (because it is a log of an integral of exponentials). Its first derivative gives the expected sufficient statistic: A(η)=Eη[T(X)]\nabla A(\eta) = \mathbb{E}_\eta[T(X)]. Its second derivative gives the variance: 2A(η)=Covη(T(X))\nabla^2 A(\eta) = \text{Cov}_\eta(T(X)), which is the Fisher information in canonical form.

Completeness

Definition

Complete Statistic

A sufficient statistic TT is complete if for any function gg:

Eθ[g(T)]=0 for all θ    g(T)=0 a.s.\mathbb{E}_\theta[g(T)] = 0 \text{ for all } \theta \implies g(T) = 0 \text{ a.s.}

Completeness means there is no non-trivial function of TT that has mean zero for all θ\theta. In exponential families with natural parameter space containing an open set, the natural sufficient statistic is always complete.

Completeness matters because it guarantees uniqueness: if TT is complete and sufficient, then any unbiased estimator based on TT is the unique best unbiased estimator (UMVUE). This connects to the Rao-Blackwell theorem below.

Rao-Blackwell Theorem

Theorem

Rao-Blackwell Theorem

Statement

Let UU be any unbiased estimator of τ(θ)\tau(\theta) and let TT be a sufficient statistic. Define:

U~=E[UT]\tilde{U} = \mathbb{E}[U | T]

Then U~\tilde{U} is:

  1. A function of TT alone (not of the full data)
  2. Unbiased for τ(θ)\tau(\theta)
  3. At least as good as UU: Varθ(U~)Varθ(U)\text{Var}_\theta(\tilde{U}) \leq \text{Var}_\theta(U) for all θ\theta, with equality only if UU is already a function of TT.

Intuition

Conditioning on a sufficient statistic can only help (or not hurt) estimation. The sufficient statistic contains all the information about θ\theta. Any remaining randomness in UU beyond what TT captures is pure noise. Conditioning on TT averages out this noise, reducing variance while preserving unbiasedness.

Proof Sketch

Unbiasedness: E[U~]=E[E[UT]]=E[U]=τ(θ)\mathbb{E}[\tilde{U}] = \mathbb{E}[\mathbb{E}[U|T]] = \mathbb{E}[U] = \tau(\theta) by the tower property.

Variance reduction: By the law of total variance: Var(U)=E[Var(UT)]+Var(E[UT])=E[Var(UT)]+Var(U~)\text{Var}(U) = \mathbb{E}[\text{Var}(U|T)] + \text{Var}(\mathbb{E}[U|T]) = \mathbb{E}[\text{Var}(U|T)] + \text{Var}(\tilde{U}).

Since E[Var(UT)]0\mathbb{E}[\text{Var}(U|T)] \geq 0, we get Var(U)Var(U~)\text{Var}(U) \geq \text{Var}(\tilde{U}).

Why It Matters

Rao-Blackwell says: never ignore a sufficient statistic. If you have any unbiased estimator, you can improve it (or at least not hurt it) by conditioning on a sufficient statistic. Combined with completeness, this gives the Lehmann-Scheffe theorem: if TT is complete and sufficient, then E[UT]\mathbb{E}[U|T] is the unique minimum-variance unbiased estimator (UMVUE).

Failure Mode

Rao-Blackwell improves unbiased estimators, but unbiasedness itself is not always desirable. Biased estimators (like the James-Stein estimator or ridge regression) can have lower MSE. The Rao-Blackwell theorem operates within the class of unbiased estimators and cannot compare across that boundary.

Canonical Examples

Example

Sufficient statistic for Gaussian mean

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2. The joint density is:

p(xμ)=(2πσ2)n/2exp ⁣(12σ2i(xiμ)2)p(x|\mu) = (2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma^2}\sum_i(x_i - \mu)^2\right)

Expanding the square: i(xiμ)2=ixi22μixi+nμ2\sum_i(x_i - \mu)^2 = \sum_i x_i^2 - 2\mu \sum_i x_i + n\mu^2.

By factorization: g(T,μ)=exp(12σ2(2μnxˉ+nμ2))g(T, \mu) = \exp(-\frac{1}{2\sigma^2}(-2\mu n\bar{x} + n\mu^2)) where T=Xˉ=1niXiT = \bar{X} = \frac{1}{n}\sum_i X_i. The sample mean is sufficient for μ\mu. This is an exponential family with natural parameter η=μ/σ2\eta = \mu/\sigma^2 and sufficient statistic T=ixiT = \sum_i x_i.

Example

Exponential family form of the Poisson distribution

p(xλ)=λxeλx!=1x!exp(xlogλλ)p(x | \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!} \exp(x \log\lambda - \lambda).

This is an exponential family with T(x)=xT(x) = x, η=logλ\eta = \log\lambda, A(η)=eη=λA(\eta) = e^\eta = \lambda, and h(x)=1/x!h(x) = 1/x!. For nn i.i.d. observations, T=iXiT = \sum_i X_i is sufficient for λ\lambda.

Common Confusions

Watch Out

Sufficient does not mean minimal sufficient

The entire data vector X=(X1,,Xn)X = (X_1, \ldots, X_n) is always trivially sufficient (the identity is a sufficient statistic). The interesting question is how much you can compress. Minimal sufficiency gives the maximum compression. For exponential families with kk-dimensional natural parameter, the minimal sufficient statistic is kk-dimensional, regardless of sample size nn.

Watch Out

Not all distributions are exponential families

Mixture distributions are not exponential families (the sufficient statistic dimension grows with nn). The Cauchy distribution is not an exponential family. The Uniform(0,θ)(0, \theta) is not (because the support depends on θ\theta). When you are outside exponential families, the clean theory of sufficient statistics and conjugate priors does not apply as neatly.

Summary

  • A statistic T(X)T(X) is sufficient if the conditional distribution of XX given TT does not depend on θ\theta
  • Factorization theorem: p(xθ)=g(T(x),θ)h(x)p(x|\theta) = g(T(x), \theta) \cdot h(x) characterizes sufficiency
  • Exponential families: p(xθ)=h(x)exp(η(θ)T(x)A(θ))p(x|\theta) = h(x) \exp(\eta(\theta)^\top T(x) - A(\theta))
  • The log-partition function A(η)A(\eta) generates moments of TT: E[T]=A\mathbb{E}[T] = \nabla A, Cov(T)=2A\text{Cov}(T) = \nabla^2 A
  • Completeness + sufficiency gives uniqueness of UMVUE
  • Rao-Blackwell: condition on a sufficient statistic to improve any unbiased estimator

Exercises

ExerciseCore

Problem

Find the sufficient statistic for θ\theta in the Bernoulli model: X1,,XnBernoulli(θ)X_1, \ldots, X_n \sim \text{Bernoulli}(\theta). Write the joint pmf in exponential family form and identify the natural parameter, sufficient statistic, and log-partition function.

ExerciseAdvanced

Problem

Let X1,,XnUniform(0,θ)X_1, \ldots, X_n \sim \text{Uniform}(0, \theta). Show that T=X(n)=maxiXiT = X_{(n)} = \max_i X_i is sufficient for θ\theta but this is not an exponential family. Why does this matter for the MLE?

ExerciseResearch

Problem

Prove that in a kk-parameter exponential family where the natural parameter space contains an open set, the natural sufficient statistic T(X)=i=1nT(Xi)T(X) = \sum_{i=1}^n T(X_i) is complete. Why does this, combined with Rao-Blackwell, imply that any unbiased estimator based on TT is UMVUE?

References

Canonical:

  • Casella & Berger, Statistical Inference (2nd ed., 2002), Chapters 6-7
  • Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1-4
  • Keener, Theoretical Statistics (2010), Chapters 3-4

Current:

  • Wasserman, All of Statistics (2004), Chapter 9

  • Wainwright & Jordan, "Graphical Models, Exponential Families, and Variational Inference" (2008)

  • van der Vaart, Asymptotic Statistics (1998), Chapters 2-8

Next Topics

Building on sufficient statistics and exponential families:

  • Fisher information: the curvature of the log-likelihood, directly related to the log-partition function in exponential families
  • Hypothesis testing for ML: using sufficient statistics to construct optimal tests
  • EM algorithm: exploiting exponential family structure for latent variable models

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics