Sufficient Statistics and Exponential Families

Sneiderman, Robby

Statistical Estimation

Sufficient Statistics and Exponential Families

Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.

CoreTier 2StableSupporting~60 min

Prerequisites

Maximum Likelihood Estimation

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

statistical-estimation | layer 0B | tier 2. This page has 1 direct prerequisite and 7 published dependents.

Open Atlas Prerequisites Leads to

What next

Fisher Information: Curvature, KL Geometry, and the Natural Gradient

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every time you compute a sample mean and sample variance from Gaussian data, you are using sufficient statistics without realizing it. When the variance is known, the sample mean captures all the information the data has about the population mean and you can throw away the original data points without loss of information about $\mu$ . When the variance is unknown, the minimal sufficient statistic for $(\mu, \sigma^2)$ is the pair $(\bar{X}, S^2)$ — the sample mean alone is no longer sufficient, because the spread of the data carries information about how precisely $\bar{X}$ pins down $\mu$ .

Sufficient statistics tell you when data compression is lossless for inference. Exponential families are the class of distributions where sufficient statistics take a particularly clean form. These two ideas together explain why so many classical estimators have the structure they do, and they underlie the theoretical guarantees for MLE in parametric models.

theorem visual

Lossless Statistic Compression

$A sufficient statistic collapses many raw datasets onto the same inference-relevant summary. Everything the likelihood still cares about flows through that statistic; the rest is nuisance detail.$

Model family

active family = Gaussian mean

Raw datasets

sample 1

1.82.02.22.0

sample 2

1.92.12.02.0

Sufficient statistic

T (X) = \overset{x}{ˉ}

sample-mean statistic

shared value = 2.0

Distinct raw samples land on the same statistic, so the likelihood sees them as inferentially equivalent.

Likelihood dependence

p (x ∣ μ) \propto exp (\frac{μ}{σ ^{2}} i \sum x_{i} - \frac{n μ ^{2}}{2 σ ^{2}}) h (x)

p (x ∣ μ) = g (i \sum x_{i}, μ) h (x)

Factorization test

p (x ∣ θ) = g (T (x), θ) h (x)

If the parameter touches the data only through $T (X)$ , then conditioning on the full sample adds no new information.

Exponential-family payoff

In canonical exponential families, the sufficient statistic is built directly into the exponent. That is why MLE, conjugacy, and moment identities all become algebraically clean.

Rao-Blackwell intuition

Once a statistic is sufficient, averaging any rough estimator against that statistic removes needless noise without losing signal about the parameter.

Mental Model

You observe $n$ data points and want to estimate $\theta$ . A sufficient statistic $T(X)$ is a function of the data that captures everything the data can tell you about $\theta$ . Given $T(X)$ , the conditional distribution of the data does not depend on $\theta$ . So $T(X)$ is a lossless summary for the purpose of inference.

The factorization theorem gives a simple test: the statistic $T(X)$ is sufficient if and only if the joint density factors into a piece that depends on $\theta$ only through $T$ and a piece that does not depend on $\theta$ at all.

Formal Setup and Notation

Let $X = (X_1, \ldots, X_n)$ be i.i.d. from $p(x | \theta)$ where $\theta \in \Theta$ .

Definition

Sufficient Statistic $T (X)$

A statistic $T(X)$ is sufficient for $\theta$ if and only if the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$ :

$p(X | T(X) = t, \theta) = p(X | T(X) = t) \quad \text{for all } \theta$

Equivalently, $T(X)$ captures all the information in $X$ about $\theta$ . Once you know $T(X)$ , the remaining randomness in $X$ is pure noise with respect to $\theta$ .

Definition

Minimal Sufficient Statistic

A sufficient statistic $T$ is minimal sufficient if and only if it is a function of every other sufficient statistic. That is, for any other sufficient statistic $U$ , there exists a function $g$ such that $T = g(U)$ . A minimal sufficient statistic achieves the maximum data reduction possible without losing information about $\theta$ .

Main Theorems

Theorem

Neyman-Fisher Factorization Theorem

Statement

A statistic $T(X)$ is sufficient for $\theta$ if and only if the joint density (or pmf) can be factored as:

$p(x_1, \ldots, x_n | \theta) = g(T(x), \theta) \cdot h(x)$

where $g$ depends on the data only through $T(x)$ , and $h$ depends on the data but not on $\theta$ .

Intuition

The factorization says the likelihood splits into two parts. The part that depends on $\theta$ sees the data only through $T$ . The part that depends on the full data does not care about $\theta$ . So for the purpose of learning about $\theta$ , $T$ is all you need.

Proof Sketch

(Factorization implies sufficiency): If $p(x | \theta) = g(T(x), \theta) \cdot h(x)$ , then $p(x | T(x) = t, \theta) = p(x | \theta) / p(T(x) = t | \theta)$ . The numerator is $g(t, \theta) h(x)$ and the denominator is $g(t, \theta) \sum_{x': T(x')=t} h(x')$ . These cancel, giving $h(x) / \sum_{x': T(x')=t} h(x')$ , which does not depend on $\theta$ .

Why It Matters

The factorization theorem is the practical workhorse for finding sufficient statistics. You write down the likelihood, identify what functions of the data appear in the $\theta$ -dependent part, and those functions form a sufficient statistic. For exponential families, this immediately identifies the natural sufficient statistics.

Failure Mode

The factorization must hold for ALL values of $\theta$ simultaneously. A common mistake is to find a factorization that works for one specific $\theta$ value but not all. Also, the factorization depends on the support of the distribution: if the support depends on $\theta$ (e.g., Uniform $(0, \theta)$ ), be careful with indicator functions.

report a correction →

Exponential Families

Definition

Exponential Family

A parametric family is an exponential family if and only if the density can be written as:

$p(x | \theta) = h(x) \exp\!\left(\eta(\theta)^\top T(x) - A(\theta)\right)$

where:

$T(x) \in \mathbb{R}^k$ is the sufficient statistic
$\eta(\theta) \in \mathbb{R}^k$ is the natural parameter
$A(\theta)$ is the log-partition function (ensures normalization)
$h(x) \geq 0$ is the base measure

When the parameterization uses $\eta$ directly (i.e., $\eta$ is the free parameter), the family is in canonical form: $p(x | \eta) = h(x) \exp(\eta^\top T(x) - A(\eta))$ .

Most distributions you encounter are exponential families: Gaussian, Bernoulli, Poisson, Exponential, Gamma, Beta, Multinomial, and Wishart. Notable exceptions: the Cauchy distribution, mixture models, and the Uniform $(0, \theta)$ distribution.

Key properties of exponential families:

Sufficient statistics: $T(X)$ is always sufficient (by factorization)
MLE is unique when it exists: the log-likelihood is concave in $\eta$ (strictly concave when the family is minimal and of full rank), so there are no local optima. Existence can fail at the boundary of the natural parameter space. Canonical failure cases: all-success or all-failure Bernoulli samples (MLE for $\eta = \text{logit}(p)$ is $\pm\infty$ ), all-zero Poisson samples ( $\eta = \log\lambda = -\infty$ ), and separated data in logistic regression. Existence typically requires the observed sufficient statistic to lie in the interior of the convex hull of its support
Moment-generating properties: $\mathbb{E}[T(X)] = \nabla_\eta A(\eta)$ and $\text{Cov}(T(X)) = \nabla^2_\eta A(\eta)$ . The log-partition function generates all the moments of $T$
Conjugate priors: every exponential family has a natural conjugate prior of the form $\pi(\eta \mid \tau, n_0) \propto \exp(\tau^\top \eta - n_0 A(\eta))$ , and posterior updates reduce to incrementing $(\tau, n_0)$ by the sufficient statistic and pseudo-count. The closed-form update is guaranteed; full tractability (computable normalizing constant, posterior moments, marginal likelihood, predictive density) follows for standard cases (Beta-Bernoulli, Normal-Normal, Gamma-Poisson, Dirichlet-Multinomial) but not automatically: the conjugate normalizing constant is itself an exponential-family integral and can lack closed form, and posterior expectations and predictive distributions may still require numerical integration even when the prior–posterior pair is conjugate

Definition

Log-Partition Function $A (η)$

The log-partition function ensures normalization:

$A(\eta) = \log \int h(x) \exp(\eta^\top T(x)) \, dx$

It is always convex in $\eta$ (because it is a log of an integral of exponentials). Its first derivative gives the expected sufficient statistic: $\nabla A(\eta) = \mathbb{E}_\eta[T(X)]$ . Its second derivative gives the variance: $\nabla^2 A(\eta) = \text{Cov}_\eta(T(X))$ , which is also the Fisher information in the natural parameterization: $I(\eta) = \nabla^2 A(\eta)$ .

Dual parameterization. Because $A$ is convex, the map $\mu = \nabla A(\eta)$ takes the natural parameter to the mean parameter $\mu = \mathbb{E}_\eta[T(X)]$ . This map is a bijection onto the interior of the marginal polytope (for minimal families), and its inverse is $\eta = \nabla A^*(\mu)$ , where $A^*(\mu) = \sup_\eta \{\eta^\top \mu - A(\eta)\}$ is the Legendre-Fenchel conjugate of $A$ . The pair $(A, A^*)$ generates the dual geometry studied in information geometry.

Conjugate Priors

For a canonical exponential-family likelihood $p(x | \eta) = h(x) \exp(\eta^\top T(x) - A(\eta))$ , the natural conjugate prior on the natural parameter $\eta$ has the form:

$\pi(\eta | \tau, n_0) \propto \exp\!\left(\tau^\top \eta - n_0 A(\eta)\right)$

with hyperparameters $\tau \in \mathbb{R}^k$ and $n_0 > 0$ . The hyperparameter $n_0$ acts as a pseudo-count and $\tau$ as a pseudo-sufficient-statistic. After observing $x$ with sufficient statistic $T(x)$ , the posterior is in the same family with updated hyperparameters:

$\tau \mapsto \tau + T(x), \qquad n_0 \mapsto n_0 + 1.$

For $n$ i.i.d. observations, the updates accumulate: $\tau \mapsto \tau + \sum_i T(x_i)$ and $n_0 \mapsto n_0 + n$ .

Example

Three canonical conjugate pairs

Beta-Bernoulli. If $X_i \sim \text{Bernoulli}(\theta)$ with prior $\theta \sim \text{Beta}(\alpha, \beta)$ , the posterior after observing $s = \sum_i x_i$ successes in $n$ trials is $\theta | X \sim \text{Beta}(\alpha + s, \beta + n - s)$ . Here $\tau$ tracks successes and $n_0$ tracks total trials.

Normal-Normal (known variance). If $X_i \sim \mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ and prior $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$ , the posterior mean is a precision-weighted average:

$\mu | X \sim \mathcal{N}\!\left(\frac{\sigma_0^{-2}\mu_0 + n\sigma^{-2}\bar{x}}{\sigma_0^{-2} + n\sigma^{-2}}, \, (\sigma_0^{-2} + n\sigma^{-2})^{-1}\right).$

Gamma-Poisson. If $X_i \sim \text{Poisson}(\lambda)$ with prior $\lambda \sim \text{Gamma}(\alpha, \beta)$ (shape-rate), the posterior is $\lambda | X \sim \text{Gamma}(\alpha + \sum_i x_i, \beta + n)$ .

The posterior predictive, marginal likelihood, and Bayes factor all reduce to closed-form functions of $A$ , which is what makes conjugate updating analytically clean.

KL Divergence as a Bregman Divergence

The KL divergence between two members of the same exponential family has a clean closed form in terms of the log-partition function:

$\mathrm{KL}(p_{\eta_1} \,\|\, p_{\eta_2}) = A(\eta_2) - A(\eta_1) - \nabla A(\eta_1)^\top (\eta_2 - \eta_1).$

This is exactly the Bregman divergence generated by the convex function $A$ , evaluated at $(\eta_2, \eta_1)$ . It measures the gap between $A(\eta_2)$ and its first-order Taylor approximation at $\eta_1$ , so it is non-negative and zero only when $\eta_1 = \eta_2$ . In the dual (mean) parameterization, the same KL equals the Bregman divergence generated by $A^*$ with arguments swapped. This dual Bregman structure is the starting point for information geometry and explains why natural-gradient updates, moment-matching projections, and variational bounds take the forms they do.

Completeness

Definition

Complete Statistic

A sufficient statistic $T$ is complete if and only if for any function $g$ :

$\mathbb{E}_\theta[g(T)] = 0 \text{ for all } \theta \implies g(T) = 0 \text{ a.s.}$

Completeness means there is no non-trivial function of $T$ that has mean zero for all $\theta$ . In a minimal (full-rank) $k$ -parameter exponential family whose natural parameter space contains a non-empty open set in $\mathbb{R}^k$ , the natural sufficient statistic is complete. The full-rank / open-set condition is essential: curved exponential families (e.g. $\mathcal{N}(\mu, \mu^2)$ ) have natural parameters constrained to a lower-dimensional manifold and their natural sufficient statistic is not complete, so Lehmann-Scheffe does not directly apply.

Completeness matters because it guarantees uniqueness: if $T$ is complete and sufficient, then any unbiased estimator based on $T$ is the unique best unbiased estimator (UMVUE). This connects to the Rao-Blackwell theorem below.

Rao-Blackwell Theorem

Theorem

Rao-Blackwell Theorem

Statement

Let $U$ be any unbiased estimator of $\tau(\theta)$ and let $T$ be a sufficient statistic. Define:

$\tilde{U} = \mathbb{E}[U | T]$

Then $\tilde{U}$ is:

A function of $T$ alone (not of the full data)
Unbiased for $\tau(\theta)$
At least as good as $U$ : $\text{Var}_\theta(\tilde{U}) \leq \text{Var}_\theta(U)$ for all $\theta$ , with equality only if $U$ is already a function of $T$ .

Intuition

Conditioning on a sufficient statistic can only help (or not hurt) estimation. The sufficient statistic contains all the information about $\theta$ . Any remaining randomness in $U$ beyond what $T$ captures is pure noise. Conditioning on $T$ averages out this noise, reducing variance while preserving unbiasedness.

Proof Sketch

Unbiasedness: $\mathbb{E}[\tilde{U}] = \mathbb{E}[\mathbb{E}[U|T]] = \mathbb{E}[U] = \tau(\theta)$ by the tower property.

Variance reduction: By the law of total variance: $\text{Var}(U) = \mathbb{E}[\text{Var}(U|T)] + \text{Var}(\mathbb{E}[U|T]) = \mathbb{E}[\text{Var}(U|T)] + \text{Var}(\tilde{U})$ .

Since $\mathbb{E}[\text{Var}(U|T)] \geq 0$ , we get $\text{Var}(U) \geq \text{Var}(\tilde{U})$ .

Why It Matters

Rao-Blackwell says: never ignore a sufficient statistic. If you have any unbiased estimator, you can improve it (or at least not hurt it) by conditioning on a sufficient statistic. Combined with completeness, this gives the Lehmann-Scheffe theorem: if $T$ is complete and sufficient, then $\mathbb{E}[U|T]$ is the unique minimum-variance unbiased estimator (UMVUE).

Failure Mode

Rao-Blackwell improves unbiased estimators, but unbiasedness itself is not always desirable. Biased estimators (like the James-Stein estimator or ridge regression) can have lower MSE. The Rao-Blackwell theorem operates within the class of unbiased estimators and cannot compare across that boundary.

report a correction →

Canonical Examples

Example

Sufficient statistic for Gaussian mean

Let $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ . The joint density is:

$p(x|\mu) = (2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma^2}\sum_i(x_i - \mu)^2\right)$

Expanding the square: $\sum_i(x_i - \mu)^2 = \sum_i x_i^2 - 2\mu \sum_i x_i + n\mu^2$ .

By factorization: $g(T, \mu) = \exp(-\frac{1}{2\sigma^2}(-2\mu n\bar{x} + n\mu^2))$ where $T = \bar{X} = \frac{1}{n}\sum_i X_i$ . The sample mean is sufficient for $\mu$ . This is an exponential family with natural parameter $\eta = \mu/\sigma^2$ and sufficient statistic $T = \sum_i x_i$ .

Example

Exponential family form of the Poisson distribution

$p(x | \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!} \exp(x \log\lambda - \lambda)$ .

This is an exponential family with $T(x) = x$ , $\eta = \log\lambda$ , $A(\eta) = e^\eta = \lambda$ , and $h(x) = 1/x!$ . For $n$ i.i.d. observations, $T = \sum_i X_i$ is sufficient for $\lambda$ .

Common Confusions

Watch Out

Sufficient does not mean minimal sufficient

The entire data vector $X = (X_1, \ldots, X_n)$ is always trivially sufficient (the identity is a sufficient statistic). The interesting question is how much you can compress. Minimal sufficiency gives the maximum compression. For a minimal (full-rank) $k$ -parameter exponential family, the minimal sufficient statistic is $k$ -dimensional regardless of sample size $n$ . In curved (non-full-rank) exponential families the parameter manifold has dimension less than $k$ , so the natural sufficient statistic is still $k$ -dimensional but it is no longer minimal: the MSS is typically of lower dimension and need not be complete (e.g. $\mathcal{N}(\mu, \mu^2)$ ).

Watch Out

Not all distributions are exponential families

Mixture distributions are not exponential families (the sufficient statistic dimension grows with $n$ ). The Cauchy distribution is not an exponential family. The Uniform $(0, \theta)$ is not (because the support depends on $\theta$ ). When you are outside exponential families, the clean theory of sufficient statistics and conjugate priors does not apply as neatly.

Summary

A statistic $T(X)$ is sufficient if and only if the conditional distribution of $X$ given $T$ does not depend on $\theta$
Factorization theorem: $p(x|\theta) = g(T(x), \theta) \cdot h(x)$ characterizes sufficiency
Exponential families: $p(x|\theta) = h(x) \exp(\eta(\theta)^\top T(x) - A(\theta))$
The log-partition function $A(\eta)$ generates moments of $T$ : $\mathbb{E}[T] = \nabla A$ , $\text{Cov}(T) = \nabla^2 A$
Completeness + sufficiency gives uniqueness of UMVUE
Rao-Blackwell: condition on a sufficient statistic to improve any unbiased estimator

Exercises

ExerciseCore

Problem

Find the sufficient statistic for $\theta$ in the Bernoulli model: $X_1, \ldots, X_n \sim \text{Bernoulli}(\theta)$ . Write the joint pmf in exponential family form and identify the natural parameter, sufficient statistic, and log-partition function.

ExerciseAdvanced

Problem

Let $X_1, \ldots, X_n \sim \text{Uniform}(0, \theta)$ . Show that $T = X_{(n)} = \max_i X_i$ is sufficient for $\theta$ but this is not an exponential family. Why does this matter for the MLE?

ExerciseResearch

Problem

Prove that in a $k$ -parameter exponential family where the natural parameter space contains an open set, the natural sufficient statistic $T(X) = \sum_{i=1}^n T(X_i)$ is complete. Why does this, combined with Rao-Blackwell, imply that any unbiased estimator based on $T$ is UMVUE?

References

Canonical:

Casella & Berger, Statistical Inference (2nd ed., 2002), Chapters 6-7
Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1-4
Keener, Theoretical Statistics (2010), Chapters 3-4

Current:

Wasserman, All of Statistics (2004), Chapter 9
Wainwright & Jordan, "Graphical Models, Exponential Families, and Variational Inference" (2008)
van der Vaart, Asymptotic Statistics (1998), Chapters 2-8

Next Topics

Building on sufficient statistics and exponential families:

Fisher information: the curvature of the log-likelihood, directly related to the log-partition function in exponential families
Hypothesis testing for ML: using sufficient statistics to construct optimal tests
EM algorithm: exploiting exponential family structure for latent variable models

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1

Derived topics

7

Conjugate Priorslayer 0B · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
The EM Algorithmlayer 2 · tier 1
Tweedie Distributionlayer 1 · tier 2
Hypothesis Testing for MLlayer 2 · tier 2

+2 more on the derived-topics page.

Graph-backed continuations

Fisher Information: Curvature, KL Geometry, and the Natural Gradient Hypothesis Testing for ML The EM Algorithm Basu's Theorem Rao-Blackwellization Conjugate Priors