Mathematical Infrastructure
Information Geometry
Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.
Prerequisites
Why This Matters
Standard gradient descent treats parameter space as Euclidean: a step of size in always means the same thing regardless of where you are. But a small change in can produce a huge change in the distribution in one region and a negligible change in another. The Fisher information metric measures distances in distribution space, not parameter space.
The natural gradient (gradient with respect to the Fisher metric) corrects for this. It is invariant to reparameterization: the update is the same whether you parameterize a Gaussian by or or . This invariance is why natural gradient methods often converge faster than vanilla SGD.
Formal Setup
Statistical Manifold
A statistical manifold is a family of probability distributions parameterized by , where the map is smooth and injective.
Fisher Information Metric
The Fisher information metric on is the Riemannian metric tensor:
where is the Fisher information matrix.
The Fisher metric is the unique (up to scale) Riemannian metric on that is invariant under sufficient statistics. This is the Chentsov theorem, originally proved for finite sample spaces (Markov morphisms between finite probability simplices); the extension to smooth families on continuous sample spaces is due to Ay, Jost, Lê, and Schwachhöfer (2015).
Main Theorems
Fisher Metric as KL Hessian
Statement
The Fisher information metric equals the Hessian of the KL divergence at :
Equivalently, for nearby parameters:
Intuition
The KL divergence is the natural "distance" between distributions (though not symmetric). Its local quadratic approximation is the Fisher metric. So the Fisher metric measures how fast distributions change in the KL sense as you move in parameter space.
Proof Sketch
Expand around to second order. The first-order term vanishes because the score has zero mean. The second-order term gives , and the Hessian of the negative log-likelihood equals the Fisher information under standard regularity conditions.
Why It Matters
This result connects three perspectives: the Riemannian metric on the statistical manifold, the curvature of the KL divergence, and the Fisher information from estimation theory. It is the foundation for natural gradient descent.
Failure Mode
The approximation is only valid for small . For large parameter changes, the KL divergence and the quadratic form can differ substantially. This is why trust region methods (like TRPO) enforce explicit constraints on the step size.
Natural Gradient Invariance
Statement
The natural gradient of a loss is:
If is a smooth reparameterization, then the natural gradient update in -coordinates produces the same distribution update as in -coordinates. That is, the trajectory in distribution space is invariant to the choice of parameterization.
Intuition
Euclidean gradient descent moves in the direction of steepest descent measured in norm on parameter space. Natural gradient descent moves in the direction of steepest descent measured in KL divergence on distribution space. Since KL divergence is a property of distributions (not parameters), the result does not depend on how you parameterize the family.
Proof Sketch
Under the reparameterization , the Fisher metric transforms as where . The gradient transforms as . Therefore , which is exactly the Jacobian-transformed version of the natural gradient in -coordinates.
Why It Matters
Parameterization invariance means you do not need to worry about choosing the "right" parameterization. For neural networks, this is significant because the loss landscape depends heavily on the parameterization. Adam and other adaptive methods approximate the natural gradient by using diagonal approximations to .
Failure Mode
Computing is for parameters, which is intractable for neural networks with millions of parameters. Practical approximations (K-FAC, diagonal Fisher, empirical Fisher) trade off accuracy for computational feasibility, and these approximations break the invariance property.
Exponential Families and Dual Flatness
Exponential families have a special place in information geometry. An exponential family admits two natural coordinate systems:
- Natural parameters (the canonical coordinates)
- Expectation parameters
These are related by the Legendre transform: . The manifold is dually flat: it is flat in both coordinate systems simultaneously, with a pair of dual affine connections.
The KL divergence takes a simple form called the Bregman divergence:
Connection to Mirror Descent
Mirror descent with the log-partition function as the mirror map is equivalent to natural gradient descent on an exponential family. The update:
in natural parameters is a mirror descent step with Bregman divergence . This unifies two important optimization frameworks:
- Natural gradient: use the Fisher metric to precondition gradients
- Mirror descent: use a Bregman divergence to define the proximity term
For exponential families, these are the same algorithm.
Common Confusions
The empirical Fisher is not the true Fisher
In practice, many implementations compute the "Fisher information matrix" using averaged over training data. This is the empirical Fisher, which equals the true Fisher only when the model is correctly specified (i.e., the true distribution is in the family). The empirical Fisher is still a PSD outer-product average, so it defines a (possibly degenerate) Riemannian form; the real issue is that under misspecification it converges to the wrong limit and is no longer invariant under sufficient statistics, so it is not the information-theoretically correct Fisher metric for the model family. See Kunstner, Hennig, and Balles (2019) and Martens (2020, §5).
Adam is not the natural gradient
Adam uses as a diagonal preconditioner. This resembles a diagonal approximation to , but Adam uses squared gradients of the loss, not squared score functions. The two coincide only for specific loss functions and model classes.
Canonical Examples
Natural gradient for a Gaussian mean
Consider with known variance. The Fisher information is . The natural gradient of a loss is . This rescales the gradient by the variance, taking larger steps when the distribution is broad (uncertain) and smaller steps when it is narrow (confident). Now reparameterize as . Standard gradient descent gives a different trajectory in distribution space, but natural gradient gives the same trajectory.
Key Takeaways
- The Fisher information metric is the Hessian of KL divergence at zero separation
- Natural gradient is invariant to reparameterization
- Exponential families are dually flat: natural parameters and expectation parameters are Legendre dual
- Mirror descent with the log-partition function equals natural gradient on exponential families
- Computing the exact natural gradient is ; practical methods use approximations
Exercises
Problem
For a Bernoulli distribution with , compute the Fisher information and write down the natural gradient update for minimizing a loss .
Problem
Show that for an exponential family with log-partition function , the Fisher information matrix equals the Hessian of : . Use this to prove that the natural gradient update in natural parameters is (i.e., the Fisher preconditioning cancels the Hessian of ).
References
Canonical:
- Amari, Information Geometry and Its Applications (2016), Chapters 1-4
- Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation 1998)
- Chentsov, Statistical Decision Rules and Optimal Inference (1982 English translation), AMS Translations Vol. 53 (finite sample space uniqueness of the Fisher metric)
Current:
- Martens, "New Insights and Perspectives on the Natural Gradient Method" (JMLR 2020), §5 on empirical Fisher
- Raskutti & Mukherjee, "The Information Geometry of Mirror Descent" (2015)
- Ay, Jost, Lê, and Schwachhöfer, "Information geometry and sufficient statistics" (Probability Theory and Related Fields, 2015), extending Chentsov to continuous sample spaces
- Kunstner, Hennig, and Balles, "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent" (NeurIPS 2019, arXiv:1905.12558)
- Nielsen, "An Elementary Introduction to Information Geometry" (Entropy 2020, arXiv:1808.08271)
Next Topics
- Optimizer theory (SGD, Adam, Muon): practical approximations to the natural gradient
- Mean field theory: information geometry in variational inference
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Fisher InformationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex DualityLayer 0B
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A