Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Cramér-Rao Bound

The fundamental lower bound on the variance of any unbiased estimator: no unbiased estimator can have variance smaller than the reciprocal of the Fisher information.

CoreTier 1Stable~50 min

Prerequisites

Why This Matters

When you build an estimator, you want to know: how good can it possibly be? The Cramér-Rao bound answers this for unbiased estimators. It says the variance of any unbiased estimator of θ\theta is at least 1/(nI(θ))1/(nI(\theta)), where I(θ)I(\theta) is the Fisher information per observation. No cleverness in estimator design can beat this limit.

This bound tells you whether your estimator is efficient (achieves the bound) or whether there is room for improvement. It also connects estimation theory to information geometry: the Fisher information is the Riemannian metric on the statistical manifold.

High I(θ): sharp peakVar 1/I = smallLow I(θ): flat peakVar 1/I = largeθ (true)log L(θ)Parameter θVar(θ̂) 1/I(θ)I(θ) = curvature of log-likelihood

Formal Setup

Let X1,,XnX_1, \ldots, X_n be i.i.d. from p(xθ)p(x|\theta), where θΘR\theta \in \Theta \subseteq \mathbb{R}.

Definition

Score Function

The score function is the derivative of the log-likelihood:

s(x;θ)=θlogp(xθ)s(x; \theta) = \frac{\partial}{\partial \theta} \log p(x|\theta)

Key property: Eθ[s(X;θ)]=0\mathbb{E}_\theta[s(X; \theta)] = 0 for all θ\theta.

Definition

Fisher Information

The Fisher information is the variance of the score:

I(θ)=Varθ[s(X;θ)]=Eθ[s(X;θ)2]I(\theta) = \text{Var}_\theta[s(X; \theta)] = \mathbb{E}_\theta[s(X; \theta)^2]

Under regularity conditions, equivalently I(θ)=Eθ[2logp(Xθ)/θ2]I(\theta) = -\mathbb{E}_\theta[\partial^2 \log p(X|\theta) / \partial \theta^2].

Definition

Efficient Estimator

An unbiased estimator θ^\hat{\theta} is efficient if it achieves the Cramér-Rao bound with equality: Varθ(θ^)=1/(nI(θ))\text{Var}_\theta(\hat{\theta}) = 1/(nI(\theta)) for all θ\theta.

Main Theorems

Theorem

Cramér-Rao Lower Bound

Statement

For any unbiased estimator θ^\hat{\theta} of θ\theta:

Varθ(θ^)1nI(θ)\text{Var}_\theta(\hat{\theta}) \geq \frac{1}{n I(\theta)}

Equality holds if and only if the score function is a linear function of θ^\hat{\theta}: there exists a(θ)a(\theta) such that i=1ns(Xi;θ)=a(θ)(θ^θ)\sum_{i=1}^n s(X_i; \theta) = a(\theta)(\hat{\theta} - \theta) a.s.

Intuition

The score function measures sensitivity of the likelihood to θ\theta. High Fisher information means the data is very informative about θ\theta, so estimators can be more precise. The bound quantifies this: more information means lower achievable variance.

Proof Sketch

Apply the Cauchy-Schwarz inequality to the covariance of θ^\hat{\theta} and the total score Sn=i=1ns(Xi;θ)S_n = \sum_{i=1}^n s(X_i; \theta):

[Cov(θ^,Sn)]2Var(θ^)Var(Sn)[\text{Cov}(\hat{\theta}, S_n)]^2 \leq \text{Var}(\hat{\theta}) \cdot \text{Var}(S_n)

Since θ^\hat{\theta} is unbiased, differentiating E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta under the integral gives Cov(θ^,Sn)=1\text{Cov}(\hat{\theta}, S_n) = 1. Since SnS_n is a sum of i.i.d. terms, Var(Sn)=nI(θ)\text{Var}(S_n) = nI(\theta). Substituting: 1Var(θ^)nI(θ)1 \leq \text{Var}(\hat{\theta}) \cdot nI(\theta).

Why It Matters

This is the central result of classical estimation theory. It provides a universal benchmark: if your estimator's variance equals 1/(nI(θ))1/(nI(\theta)), you know no unbiased estimator can do better. The MLE achieves this bound asymptotically, which is one reason MLE is the default estimation method.

Failure Mode

The bound only applies to unbiased estimators. Biased estimators can have lower MSE than the Cramér-Rao bound (this is the James-Stein phenomenon). The regularity conditions matter: for uniform distributions U(0,θ)U(0, \theta), the support depends on θ\theta, the bound does not apply, and the MLE converges at rate 1/n1/n instead of 1/n1/\sqrt{n}.

Theorem

Multivariate Cramér-Rao Bound

Statement

For any unbiased estimator θ^\hat{\theta} of θRd\theta \in \mathbb{R}^d:

Covθ(θ^)1n[I(θ)]1\text{Cov}_\theta(\hat{\theta}) \succeq \frac{1}{n} [I(\theta)]^{-1}

where \succeq denotes the Loewner (positive semidefinite) ordering and I(θ)I(\theta) is the d×dd \times d Fisher information matrix with entries Ijk(θ)=E[jlogpklogp]I_{jk}(\theta) = \mathbb{E}[\partial_j \log p \cdot \partial_k \log p].

Intuition

The matrix inequality says that for any direction vRdv \in \mathbb{R}^d, the variance of vTθ^v^T \hat{\theta} is at least vT[nI(θ)]1vv^T [nI(\theta)]^{-1} v. Parameters that are poorly identified (low Fisher information in their direction) have high minimum variance.

Proof Sketch

Same Cauchy-Schwarz argument applied to each component, or equivalently, consider vTθ^v^T \hat{\theta} for arbitrary direction vv and apply the scalar Cramér-Rao bound to this one-dimensional projection.

Why It Matters

In practice, models have multiple parameters. The matrix version tells you which parameters are easy to estimate (high Fisher information) and which are hard. It also connects to the natural gradient: gradient descent in Fisher information geometry converges in fewer steps because it accounts for parameter curvature.

Failure Mode

Same as the scalar case: applies only to unbiased estimators, requires regularity conditions, and biased estimators can achieve lower MSE.

Canonical Examples

Example

Estimating the mean of a normal distribution

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim N(\mu, \sigma^2) with σ2\sigma^2 known. The Fisher information per observation is I(μ)=1/σ2I(\mu) = 1/\sigma^2. The Cramér-Rao bound gives Var(μ^)σ2/n\text{Var}(\hat{\mu}) \geq \sigma^2/n. The sample mean Xˉ\bar{X} has variance exactly σ2/n\sigma^2/n, so it is efficient.

Example

Estimating the rate of a Poisson distribution

Let X1,,XnPoisson(λ)X_1, \ldots, X_n \sim \text{Poisson}(\lambda). The Fisher information is I(λ)=1/λI(\lambda) = 1/\lambda. The Cramér-Rao bound gives Var(λ^)λ/n\text{Var}(\hat{\lambda}) \geq \lambda/n. The sample mean Xˉ\bar{X} has variance λ/n\lambda/n, so it is efficient.

Common Confusions

Watch Out

The Cramér-Rao bound does not apply to biased estimators

The bound Var1/(nI)\text{Var} \geq 1/(nI) is only for unbiased estimators. A biased estimator can have lower variance. The bias-variance tradeoff means that the optimal MSE estimator is often biased. The James-Stein estimator beats Xˉ\bar{X} in MSE when the dimension d3d \geq 3, despite Xˉ\bar{X} being efficient in the Cramér-Rao sense.

Watch Out

Achieving the bound does not mean achieving the best MSE

Efficiency (in the Cramér-Rao sense) means minimum variance among unbiased estimators. It does not mean minimum MSE among all estimators. For small samples, regularized or shrinkage estimators with some bias often have lower MSE than the efficient unbiased estimator.

Watch Out

Regularity conditions are not just technicalities

For XU(0,θ)X \sim U(0, \theta), the support depends on θ\theta. The Cramér-Rao bound does not apply. The MLE θ^=X(n)\hat{\theta} = X_{(n)} has variance of order θ2/n2\theta^2/n^2, which is much smaller than 1/(nI)1/(nI) would suggest. When regularity fails, the parametric rate can be faster than 1/n1/\sqrt{n}.

Summary

  • For unbiased estimators: Var(θ^)1/(nI(θ))\text{Var}(\hat{\theta}) \geq 1/(nI(\theta))
  • The proof is a single application of Cauchy-Schwarz
  • MLE is asymptotically efficient: it achieves the bound as nn \to \infty
  • Biased estimators can beat the bound in MSE
  • Regularity conditions matter: the bound fails for non-regular families

Exercises

ExerciseCore

Problem

Compute the Fisher information for XBernoulli(p)X \sim \text{Bernoulli}(p) and state the Cramér-Rao bound for estimating pp from nn i.i.d. observations.

ExerciseAdvanced

Problem

Show that for the exponential family p(xθ)=h(x)exp(η(θ)T(x)A(θ))p(x|\theta) = h(x)\exp(\eta(\theta) T(x) - A(\theta)), an efficient estimator exists if and only if T(x)T(x) is a linear function of the natural parameter η(θ)\eta(\theta).

Related Comparisons

References

Canonical:

  • Casella & Berger, Statistical Inference (2002), Chapter 7.3
  • Lehmann & Casella, Theory of Point Estimation (1998), Chapter 2
  • Schervish, Theory of Statistics (1995), Section 2.3 (information inequalities)

Current:

  • van der Vaart, Asymptotic Statistics (1998), Chapter 8
  • Cover & Thomas, Elements of Information Theory (2006), Chapter 11.10 (Fisher information and the Cramér-Rao bound)
  • Keener, Theoretical Statistics (2010), Chapter 3 (unbiased estimation and efficiency)

Next Topics

  • Asymptotic statistics: MLE achieves the Cramér-Rao bound asymptotically
  • Minimax lower bounds: going beyond unbiased estimators to minimax optimality

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics