Statistical Estimation
Fisher Information
The Fisher information quantifies how much a sample tells you about an unknown parameter: it measures the curvature of the log-likelihood, sets the Cramér-Rao lower bound on estimator variance, and serves as a natural Riemannian metric on parameter space.
Prerequisites
Why This Matters
Fisher information = curvature of log-likelihood at the true parameter
Every time you fit a model to data, there is a fundamental limit on how precisely you can estimate the parameters. That limit is set by the Fisher information. It tells you: given the statistical model , how much information does a single observation carry about ?
The Cramér-Rao bound says no unbiased estimator can have variance smaller than the inverse of the Fisher information. MLE achieves this bound asymptotically, which is why MLE is the default method for parametric estimation.
Beyond classical statistics, Fisher information appears as the natural Riemannian metric on parameter space. This gives rise to the natural gradient, which plays a role in modern optimization for neural networks and policy gradient methods.
Mental Model
Think of the log-likelihood as a landscape over parameter space. The Fisher information measures the expected curvature of this landscape. If the curvature is high, the likelihood is sharply peaked around the true parameter, so even a small sample pins down precisely. If the curvature is low, the likelihood is flat and you need many samples to distinguish nearby parameter values.
Formal Setup and Notation
Let be a random variable with density for (scalar case first, then we generalize).
Score Function
The score function is the derivative of the log-likelihood with respect to :
Under regularity conditions (interchange of differentiation and integration), the score has zero mean:
Fisher Information (Scalar)
The Fisher information is the variance of the score function:
The second equality uses the fact that the score has zero mean.
Core Definitions
The Fisher information has a second, equivalent form that is often more convenient for computation.
Fisher Information as Expected Negative Hessian
Statement
Under standard regularity conditions:
The Fisher information equals the expected curvature (negative second derivative) of the log-likelihood.
Intuition
High curvature means the log-likelihood changes rapidly as varies. This makes the peak of the likelihood sharp, so the data pins down precisely. Low curvature means a flat likelihood and poor identifiability.
Proof Sketch
Start from . Differentiate both sides twice with respect to . After interchanging differentiation and integration, the second-derivative identity yields the result when you take expectations.
Why It Matters
This identity lets you compute Fisher information by taking the second derivative of the log-likelihood (often easier than computing the variance of the score directly). It also reveals that Fisher information is curvature.
Failure Mode
The identity fails when the regularity conditions break down, for example when the support of depends on (as in the distribution). In such cases, interchange of differentiation and integration is invalid, so the score's mean-zero property fails. As a result, both identities break down: the score-variance quantity and the expected negative Hessian are no longer equal, and neither equals the Cramér-Rao denominator for unbiased estimators. The classical symptom is Uniform: the MLE has variance of order , not , which would be impossible if a naive Fisher information plugged into the CRLB applied. The right treatment for such non-regular models uses different tools (order statistics, Hájek-Le Cam bounds); see Lehmann and Casella, Theory of Point Estimation §2.7.
Fisher Information for n Samples
If are i.i.d. from , the Fisher information of the full sample is:
This follows from the additivity of variances for independent random variables. More data means proportionally more information about .
The Fisher Information Matrix
Fisher Information Matrix
For a parameter vector , the Fisher information matrix is the matrix:
Equivalently, under regularity conditions:
In matrix form: .
The Fisher information matrix is always positive semidefinite (it is a covariance matrix). It is positive definite when the model is identifiable.
Main Theorems
Cramér-Rao Lower Bound
Statement
Let be any unbiased estimator of based on i.i.d. observations from . Then:
In the multivariate case, for any unbiased estimator :
where denotes the positive semidefinite ordering.
Intuition
The Cramér-Rao bound says: no matter how clever your estimator is, if it is unbiased, its variance cannot beat . High Fisher information means a tighter bound and better estimation is possible. Low Fisher information means even the best estimator will be noisy.
Proof Sketch
By definition, . Differentiate both sides with respect to (interchanging differentiation and integration) to get . Now apply the Cauchy-Schwarz inequality: . Rearrange.
Why It Matters
This is the fundamental impossibility result in estimation theory. It tells you the best possible precision for any unbiased estimator, and it lets you check whether a given estimator is efficient (achieves the bound).
Failure Mode
The bound applies only to unbiased estimators. Biased estimators (like ridge regression) can have lower mean squared error by trading bias for variance. The Cramér-Rao bound says nothing about such estimators.
Asymptotic Efficiency of MLE
Statement
Under standard regularity conditions, the MLE is asymptotically efficient:
The asymptotic variance of the MLE equals the Cramér-Rao lower bound.
Intuition
MLE extracts all available information from the data. No other consistent estimator can do better asymptotically. This is why MLE is the default choice for parametric estimation.
Proof Sketch
Taylor-expand the score function around the true parameter . Set the expansion to zero (the MLE condition). By the law of large numbers, the second derivative term converges to . By the central limit theorem, the score sum is asymptotically normal with variance . Combining gives the result.
Why It Matters
This theorem justifies MLE as the gold standard for parametric models. It means you cannot improve on MLE (in terms of asymptotic variance) without either using prior information or accepting bias.
Failure Mode
The efficiency result is asymptotic. In finite samples, MLE can be outperformed by biased estimators (James-Stein shrinkage) or by Bayesian estimators with informative priors. Also fails for non-regular models where the parameter is on the boundary of the parameter space.
Fisher Information and the Natural Gradient
The Fisher information matrix defines a Riemannian metric on parameter space. In this geometry, the steepest-descent direction is not the ordinary gradient but the natural gradient:
The natural gradient accounts for the curvature of the parameter space. Two parameter values that are close in Euclidean distance might produce very different distributions, and vice versa. The Fisher metric captures this.
In practice, computing is expensive for large models. Approximations like K-FAC (Kronecker-Factored Approximate Curvature) make natural gradient methods practical for deep learning. The natural policy gradient in reinforcement learning (used in TRPO) is a direct application.
Canonical Examples
Fisher information for the Bernoulli distribution
Let . Then . The score is . Computing the variance:
Near or , the Fisher information is large (a single observation tells you a lot). Near , it is smallest (): the most uncertain case is hardest to estimate.
Fisher information for the Gaussian (known variance)
Let with known. Then . The second derivative with respect to is , so:
The Cramér-Rao bound gives . The sample mean achieves this bound exactly (not just asymptotically), so is a uniformly minimum variance unbiased estimator (UMVUE).
Common Confusions
Fisher information is not the same as observed information
The Fisher information is an expectation over the data. The observed information (evaluated at the data you actually observed) is a random quantity. Asymptotically they agree, but in finite samples they can differ. Some statisticians prefer using for inference because it conditions on the observed data.
The Cramér-Rao bound applies only to unbiased estimators
A common mistake is to claim that no estimator can beat the Cramér-Rao bound. This is false. Biased estimators can have lower MSE than the Cramér-Rao bound. Ridge regression is a classic example: it introduces bias but can dramatically reduce variance, yielding lower total MSE. The bound only applies to the class of unbiased estimators.
Summary
- Fisher information measures how informative data is about
- For i.i.d. samples, total Fisher information is
- Cramér-Rao bound: for unbiased estimators
- MLE achieves the Cramér-Rao bound asymptotically (efficiency)
- Fisher information matrix defines the natural gradient:
- High curvature of log-likelihood means high Fisher information means precise estimation
Exercises
Problem
Compute the Fisher information for a single observation from a Poisson distribution with parameter . What is the Cramér-Rao lower bound for an unbiased estimator of based on i.i.d. observations?
Problem
Let where both and are unknown. Compute the Fisher information matrix . Is the matrix diagonal? What does this mean?
Problem
The natural gradient update is . Show that the natural gradient is parameterization invariant: if you reparameterize , the natural gradient step in -space produces the same update as in -space. Why does ordinary gradient descent not have this property?
References
Canonical:
- Casella & Berger, Statistical Inference (2002), Chapter 7
- Lehmann & Casella, Theory of Point Estimation (1998), Chapters 2-3
Current:
-
Amari, Information Geometry and Its Applications (2016)
-
Martens, "New insights and perspectives on the natural gradient method" (2020)
-
van der Vaart, Asymptotic Statistics (1998), Chapters 2-8
Next Topics
The natural next steps from Fisher information:
- Hypothesis testing for ML: Fisher information in likelihood ratio tests
- Minimax lower bounds: Fisher information sets local minimax rates
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A