Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Fisher Information

The Fisher information quantifies how much a sample tells you about an unknown parameter: it measures the curvature of the log-likelihood, sets the Cramér-Rao lower bound on estimator variance, and serves as a natural Riemannian metric on parameter space.

CoreTier 1Stable~55 min

Why This Matters

Fisher information = curvature of log-likelihood at the true parameter

true thetahigh curvatureI(theta) largeCramér-Rao: Var(T) ≥ 1/I(theta)Sharp peak → small variance boundBroad peak → large variance boundlog-likelihood-1-2-3-2-1012thetaHigh Fisher info (sigma = 0.5)Low Fisher info (sigma = 1.5)

Every time you fit a model to data, there is a fundamental limit on how precisely you can estimate the parameters. That limit is set by the Fisher information. It tells you: given the statistical model p(xθ)p(x|\theta), how much information does a single observation xx carry about θ\theta?

The Cramér-Rao bound says no unbiased estimator can have variance smaller than the inverse of the Fisher information. MLE achieves this bound asymptotically, which is why MLE is the default method for parametric estimation.

Beyond classical statistics, Fisher information appears as the natural Riemannian metric on parameter space. This gives rise to the natural gradient, which plays a role in modern optimization for neural networks and policy gradient methods.

Mental Model

Think of the log-likelihood (θ)=logp(xθ)\ell(\theta) = \log p(x|\theta) as a landscape over parameter space. The Fisher information measures the expected curvature of this landscape. If the curvature is high, the likelihood is sharply peaked around the true parameter, so even a small sample pins down θ\theta precisely. If the curvature is low, the likelihood is flat and you need many samples to distinguish nearby parameter values.

Formal Setup and Notation

Let XX be a random variable with density p(xθ)p(x|\theta) for θΘR\theta \in \Theta \subseteq \mathbb{R} (scalar case first, then we generalize).

Definition

Score Function

The score function is the derivative of the log-likelihood with respect to θ\theta:

s(θ;x)=θlogp(xθ)s(\theta; x) = \frac{\partial}{\partial \theta} \log p(x|\theta)

Under regularity conditions (interchange of differentiation and integration), the score has zero mean:

Eθ[s(θ;X)]=0\mathbb{E}_\theta[s(\theta; X)] = 0

Definition

Fisher Information (Scalar)

The Fisher information is the variance of the score function:

I(θ)=Eθ ⁣[(θlogp(Xθ))2]=Varθ[s(θ;X)]I(\theta) = \mathbb{E}_\theta\!\left[\left(\frac{\partial}{\partial \theta} \log p(X|\theta)\right)^2\right] = \text{Var}_\theta[s(\theta; X)]

The second equality uses the fact that the score has zero mean.

Core Definitions

The Fisher information has a second, equivalent form that is often more convenient for computation.

Theorem

Fisher Information as Expected Negative Hessian

Statement

Under standard regularity conditions:

I(θ)=Eθ ⁣[2θ2logp(Xθ)]I(\theta) = -\mathbb{E}_\theta\!\left[\frac{\partial^2}{\partial \theta^2} \log p(X|\theta)\right]

The Fisher information equals the expected curvature (negative second derivative) of the log-likelihood.

Intuition

High curvature means the log-likelihood changes rapidly as θ\theta varies. This makes the peak of the likelihood sharp, so the data pins down θ\theta precisely. Low curvature means a flat likelihood and poor identifiability.

Proof Sketch

Start from p(xθ)dx=1\int p(x|\theta) \, dx = 1. Differentiate both sides twice with respect to θ\theta. After interchanging differentiation and integration, the second-derivative identity yields the result when you take expectations.

Why It Matters

This identity lets you compute Fisher information by taking the second derivative of the log-likelihood (often easier than computing the variance of the score directly). It also reveals that Fisher information is curvature.

Failure Mode

The identity fails when the regularity conditions break down, for example when the support of p(xθ)p(x|\theta) depends on θ\theta (as in the Uniform(0,θ)\text{Uniform}(0, \theta) distribution). In such cases, interchange of differentiation and integration is invalid, so the score's mean-zero property fails. As a result, both identities break down: the score-variance quantity E[s2]\mathbb{E}[s^2] and the expected negative Hessian are no longer equal, and neither equals the Cramér-Rao denominator for unbiased estimators. The classical symptom is Uniform(0,θ)(0, \theta): the MLE θ^=maxiXi\hat{\theta} = \max_i X_i has variance of order 1/n21/n^2, not 1/n1/n, which would be impossible if a naive Fisher information plugged into the CRLB applied. The right treatment for such non-regular models uses different tools (order statistics, Hájek-Le Cam bounds); see Lehmann and Casella, Theory of Point Estimation §2.7.

Fisher Information for n Samples

If X1,,XnX_1, \ldots, X_n are i.i.d. from p(xθ)p(x|\theta), the Fisher information of the full sample is:

In(θ)=nI(θ)I_n(\theta) = n \cdot I(\theta)

This follows from the additivity of variances for independent random variables. More data means proportionally more information about θ\theta.

The Fisher Information Matrix

Definition

Fisher Information Matrix

For a parameter vector θRd\boldsymbol{\theta} \in \mathbb{R}^d, the Fisher information matrix is the d×dd \times d matrix:

[I(θ)]jk=Eθ ⁣[logp(Xθ)θjlogp(Xθ)θk][\mathbf{I}(\boldsymbol{\theta})]_{jk} = \mathbb{E}_{\boldsymbol{\theta}}\!\left[\frac{\partial \log p(X|\boldsymbol{\theta})}{\partial \theta_j} \cdot \frac{\partial \log p(X|\boldsymbol{\theta})}{\partial \theta_k}\right]

Equivalently, under regularity conditions:

[I(θ)]jk=Eθ ⁣[2logp(Xθ)θjθk][\mathbf{I}(\boldsymbol{\theta})]_{jk} = -\mathbb{E}_{\boldsymbol{\theta}}\!\left[\frac{\partial^2 \log p(X|\boldsymbol{\theta})}{\partial \theta_j \partial \theta_k}\right]

In matrix form: I(θ)=Cov[θlogp(Xθ)]=E[θ2logp(Xθ)]\mathbf{I}(\boldsymbol{\theta}) = \text{Cov}[\nabla_{\boldsymbol{\theta}} \log p(X|\boldsymbol{\theta})] = -\mathbb{E}[\nabla^2_{\boldsymbol{\theta}} \log p(X|\boldsymbol{\theta})].

The Fisher information matrix is always positive semidefinite (it is a covariance matrix). It is positive definite when the model is identifiable.

Main Theorems

Theorem

Cramér-Rao Lower Bound

Statement

Let θ^\hat{\theta} be any unbiased estimator of θ\theta based on nn i.i.d. observations from p(xθ)p(x|\theta). Then:

Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{n \, I(\theta)}

In the multivariate case, for any unbiased estimator θ^\hat{\boldsymbol{\theta}}:

Cov(θ^)1nI(θ)1\text{Cov}(\hat{\boldsymbol{\theta}}) \succeq \frac{1}{n} \mathbf{I}(\boldsymbol{\theta})^{-1}

where \succeq denotes the positive semidefinite ordering.

Intuition

The Cramér-Rao bound says: no matter how clever your estimator is, if it is unbiased, its variance cannot beat 1/(nI(θ))1/(nI(\theta)). High Fisher information means a tighter bound and better estimation is possible. Low Fisher information means even the best estimator will be noisy.

Proof Sketch

By definition, E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta. Differentiate both sides with respect to θ\theta (interchanging differentiation and integration) to get Cov(θ^,s(θ;X))=1\text{Cov}(\hat{\theta}, s(\theta; X)) = 1. Now apply the Cauchy-Schwarz inequality: 1=Cov(θ^,s)2Var(θ^)Var(s)=Var(θ^)I(θ)1 = |\text{Cov}(\hat{\theta}, s)|^2 \leq \text{Var}(\hat{\theta}) \cdot \text{Var}(s) = \text{Var}(\hat{\theta}) \cdot I(\theta). Rearrange.

Why It Matters

This is the fundamental impossibility result in estimation theory. It tells you the best possible precision for any unbiased estimator, and it lets you check whether a given estimator is efficient (achieves the bound).

Failure Mode

The bound applies only to unbiased estimators. Biased estimators (like ridge regression) can have lower mean squared error by trading bias for variance. The Cramér-Rao bound says nothing about such estimators.

Theorem

Asymptotic Efficiency of MLE

Statement

Under standard regularity conditions, the MLE θ^MLE\hat{\theta}_{\text{MLE}} is asymptotically efficient:

n(θ^MLEθ0)dN ⁣(0,1I(θ0))\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}\!\left(0, \frac{1}{I(\theta_0)}\right)

The asymptotic variance of the MLE equals the Cramér-Rao lower bound.

Intuition

MLE extracts all available information from the data. No other consistent estimator can do better asymptotically. This is why MLE is the default choice for parametric estimation.

Proof Sketch

Taylor-expand the score function is(θ;Xi)\sum_i s(\theta; X_i) around the true parameter θ0\theta_0. Set the expansion to zero (the MLE condition). By the law of large numbers, the second derivative term converges to I(θ0)-I(\theta_0). By the central limit theorem, the score sum is asymptotically normal with variance nI(θ0)nI(\theta_0). Combining gives the result.

Why It Matters

This theorem justifies MLE as the gold standard for parametric models. It means you cannot improve on MLE (in terms of asymptotic variance) without either using prior information or accepting bias.

Failure Mode

The efficiency result is asymptotic. In finite samples, MLE can be outperformed by biased estimators (James-Stein shrinkage) or by Bayesian estimators with informative priors. Also fails for non-regular models where the parameter is on the boundary of the parameter space.

Fisher Information and the Natural Gradient

The Fisher information matrix I(θ)\mathbf{I}(\boldsymbol{\theta}) defines a Riemannian metric on parameter space. In this geometry, the steepest-descent direction is not the ordinary gradient L(θ)\nabla L(\boldsymbol{\theta}) but the natural gradient:

~L(θ)=I(θ)1L(θ)\tilde{\nabla} L(\boldsymbol{\theta}) = \mathbf{I}(\boldsymbol{\theta})^{-1} \nabla L(\boldsymbol{\theta})

The natural gradient accounts for the curvature of the parameter space. Two parameter values that are close in Euclidean distance might produce very different distributions, and vice versa. The Fisher metric captures this.

In practice, computing I(θ)1\mathbf{I}(\boldsymbol{\theta})^{-1} is expensive for large models. Approximations like K-FAC (Kronecker-Factored Approximate Curvature) make natural gradient methods practical for deep learning. The natural policy gradient in reinforcement learning (used in TRPO) is a direct application.

Canonical Examples

Example

Fisher information for the Bernoulli distribution

Let XBernoulli(θ)X \sim \text{Bernoulli}(\theta). Then logp(xθ)=xlogθ+(1x)log(1θ)\log p(x|\theta) = x \log \theta + (1-x) \log(1-\theta). The score is s(θ;x)=xθ1x1θs(\theta; x) = \frac{x}{\theta} - \frac{1-x}{1-\theta}. Computing the variance: I(θ)=1θ(1θ)I(\theta) = \frac{1}{\theta(1-\theta)}

Near θ=0\theta = 0 or θ=1\theta = 1, the Fisher information is large (a single observation tells you a lot). Near θ=1/2\theta = 1/2, it is smallest (I=4I = 4): the most uncertain case is hardest to estimate.

Example

Fisher information for the Gaussian (known variance)

Let XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known. Then logp(xμ)=(xμ)22σ2+const\log p(x|\mu) = -\frac{(x-\mu)^2}{2\sigma^2} + \text{const}. The second derivative with respect to μ\mu is 1/σ2-1/\sigma^2, so:

I(μ)=1σ2I(\mu) = \frac{1}{\sigma^2}

The Cramér-Rao bound gives Var(μ^)σ2/n\text{Var}(\hat{\mu}) \geq \sigma^2/n. The sample mean Xˉ\bar{X} achieves this bound exactly (not just asymptotically), so Xˉ\bar{X} is a uniformly minimum variance unbiased estimator (UMVUE).

Common Confusions

Watch Out

Fisher information is not the same as observed information

The Fisher information I(θ)I(\theta) is an expectation over the data. The observed information J(θ)=2θ2(θ)J(\theta) = -\frac{\partial^2}{\partial \theta^2} \ell(\theta) (evaluated at the data you actually observed) is a random quantity. Asymptotically they agree, but in finite samples they can differ. Some statisticians prefer using J(θ)J(\theta) for inference because it conditions on the observed data.

Watch Out

The Cramér-Rao bound applies only to unbiased estimators

A common mistake is to claim that no estimator can beat the Cramér-Rao bound. This is false. Biased estimators can have lower MSE than the Cramér-Rao bound. Ridge regression is a classic example: it introduces bias but can dramatically reduce variance, yielding lower total MSE. The bound only applies to the class of unbiased estimators.

Summary

  • Fisher information I(θ)=Var[s(θ;X)]=E[(θ)]I(\theta) = \text{Var}[s(\theta; X)] = -\mathbb{E}[\ell''(\theta)] measures how informative data is about θ\theta
  • For nn i.i.d. samples, total Fisher information is nI(θ)nI(\theta)
  • Cramér-Rao bound: Var(θ^)1/(nI(θ))\text{Var}(\hat{\theta}) \geq 1/(nI(\theta)) for unbiased estimators
  • MLE achieves the Cramér-Rao bound asymptotically (efficiency)
  • Fisher information matrix defines the natural gradient: ~=I1\tilde{\nabla} = \mathbf{I}^{-1} \nabla
  • High curvature of log-likelihood means high Fisher information means precise estimation

Exercises

ExerciseCore

Problem

Compute the Fisher information I(λ)I(\lambda) for a single observation from a Poisson distribution with parameter λ\lambda. What is the Cramér-Rao lower bound for an unbiased estimator of λ\lambda based on nn i.i.d. observations?

ExerciseAdvanced

Problem

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) where both μ\mu and σ2\sigma^2 are unknown. Compute the 2×22 \times 2 Fisher information matrix I(μ,σ2)\mathbf{I}(\mu, \sigma^2). Is the matrix diagonal? What does this mean?

ExerciseResearch

Problem

The natural gradient update is θt+1=θtηI(θt)1L(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \, \mathbf{I}(\boldsymbol{\theta}_t)^{-1} \nabla L(\boldsymbol{\theta}_t). Show that the natural gradient is parameterization invariant: if you reparameterize θ=g(ϕ)\boldsymbol{\theta} = g(\boldsymbol{\phi}), the natural gradient step in ϕ\boldsymbol{\phi}-space produces the same update as in θ\boldsymbol{\theta}-space. Why does ordinary gradient descent not have this property?

References

Canonical:

  • Casella & Berger, Statistical Inference (2002), Chapter 7
  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 2-3

Current:

Next Topics

The natural next steps from Fisher information:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics