Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Mathematical Infrastructure

Information Geometry

Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.

AdvancedTier 3Stable~60 min
0

Why This Matters

Standard gradient descent treats parameter space as Euclidean: a step of size ϵ\epsilon in θ\theta always means the same thing regardless of where you are. But a small change in θ\theta can produce a huge change in the distribution pθp_\theta in one region and a negligible change in another. The Fisher information metric measures distances in distribution space, not parameter space.

The natural gradient (gradient with respect to the Fisher metric) corrects for this. It is invariant to reparameterization: the update is the same whether you parameterize a Gaussian by (μ,σ)(\mu, \sigma) or (μ,σ2)(\mu, \sigma^2) or (μ,1/σ2)(\mu, 1/\sigma^2). This invariance is why natural gradient methods often converge faster than vanilla SGD.

Formal Setup

Definition

Statistical Manifold

A statistical manifold M={pθ:θΘ}\mathcal{M} = \{p_\theta : \theta \in \Theta\} is a family of probability distributions parameterized by θΘRd\theta \in \Theta \subseteq \mathbb{R}^d, where the map θpθ\theta \mapsto p_\theta is smooth and injective.

Definition

Fisher Information Metric

The Fisher information metric on M\mathcal{M} is the Riemannian metric tensor:

gij(θ)=Expθ[logpθ(x)θilogpθ(x)θj]=[I(θ)]ijg_{ij}(\theta) = \mathbb{E}_{x \sim p_\theta}\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i} \frac{\partial \log p_\theta(x)}{\partial \theta_j}\right] = [I(\theta)]_{ij}

where I(θ)I(\theta) is the Fisher information matrix.

The Fisher metric is the unique (up to scale) Riemannian metric on M\mathcal{M} that is invariant under sufficient statistics. This is the Chentsov theorem, originally proved for finite sample spaces (Markov morphisms between finite probability simplices); the extension to smooth families on continuous sample spaces is due to Ay, Jost, Lê, and Schwachhöfer (2015).

Main Theorems

Theorem

Fisher Metric as KL Hessian

Statement

The Fisher information metric equals the Hessian of the KL divergence at θ=θ\theta' = \theta:

gij(θ)=2θiθjKL(pθpθ)θ=θg_{ij}(\theta) = \frac{\partial^2}{\partial \theta'_i \partial \theta'_j} \mathrm{KL}(p_\theta \| p_{\theta'}) \bigg|_{\theta' = \theta}

Equivalently, for nearby parameters:

KL(pθpθ+δ)12δI(θ)δ\mathrm{KL}(p_\theta \| p_{\theta + \delta}) \approx \frac{1}{2} \delta^\top I(\theta) \delta

Intuition

The KL divergence is the natural "distance" between distributions (though not symmetric). Its local quadratic approximation is the Fisher metric. So the Fisher metric measures how fast distributions change in the KL sense as you move in parameter space.

Proof Sketch

Expand logpθ(x)\log p_{\theta'}(x) around θ=θ\theta' = \theta to second order. The first-order term vanishes because the score has zero mean. The second-order term gives KL(pθpθ+δ)12δE[2(logpθ)]δ\mathrm{KL}(p_\theta \| p_{\theta + \delta}) \approx \frac{1}{2} \delta^\top \mathbb{E}[\nabla^2 (-\log p_\theta)] \delta, and the Hessian of the negative log-likelihood equals the Fisher information under standard regularity conditions.

Why It Matters

This result connects three perspectives: the Riemannian metric on the statistical manifold, the curvature of the KL divergence, and the Fisher information from estimation theory. It is the foundation for natural gradient descent.

Failure Mode

The approximation KL12δI(θ)δ\mathrm{KL} \approx \frac{1}{2}\delta^\top I(\theta)\delta is only valid for small δ\delta. For large parameter changes, the KL divergence and the quadratic form can differ substantially. This is why trust region methods (like TRPO) enforce explicit constraints on the step size.

Theorem

Natural Gradient Invariance

Statement

The natural gradient of a loss L(θ)L(\theta) is:

~θL=I(θ)1θL\tilde{\nabla}_\theta L = I(\theta)^{-1} \nabla_\theta L

If ϕ=f(θ)\phi = f(\theta) is a smooth reparameterization, then the natural gradient update in ϕ\phi-coordinates produces the same distribution update as in θ\theta-coordinates. That is, the trajectory in distribution space {pθt}\{p_{\theta_t}\} is invariant to the choice of parameterization.

Intuition

Euclidean gradient descent moves θ\theta in the direction of steepest descent measured in 2\ell^2 norm on parameter space. Natural gradient descent moves θ\theta in the direction of steepest descent measured in KL divergence on distribution space. Since KL divergence is a property of distributions (not parameters), the result does not depend on how you parameterize the family.

Proof Sketch

Under the reparameterization ϕ=f(θ)\phi = f(\theta), the Fisher metric transforms as Iϕ=JIθJ1I_\phi = J^{-\top} I_\theta J^{-1} where J=f/θJ = \partial f / \partial \theta. The gradient transforms as ϕL=JθL\nabla_\phi L = J^{-\top} \nabla_\theta L. Therefore Iϕ1ϕL=J(Iθ1θL)I_\phi^{-1} \nabla_\phi L = J (I_\theta^{-1} \nabla_\theta L), which is exactly the Jacobian-transformed version of the natural gradient in θ\theta-coordinates.

Why It Matters

Parameterization invariance means you do not need to worry about choosing the "right" parameterization. For neural networks, this is significant because the loss landscape depends heavily on the parameterization. Adam and other adaptive methods approximate the natural gradient by using diagonal approximations to I(θ)1I(\theta)^{-1}.

Failure Mode

Computing I(θ)1I(\theta)^{-1} is O(d3)O(d^3) for dd parameters, which is intractable for neural networks with millions of parameters. Practical approximations (K-FAC, diagonal Fisher, empirical Fisher) trade off accuracy for computational feasibility, and these approximations break the invariance property.

Exponential Families and Dual Flatness

Exponential families have a special place in information geometry. An exponential family pθ(x)=h(x)exp(θT(x)A(θ))p_\theta(x) = h(x) \exp(\theta^\top T(x) - A(\theta)) admits two natural coordinate systems:

  1. Natural parameters θ\theta (the canonical coordinates)
  2. Expectation parameters η=E[T(x)]=A(θ)\eta = \mathbb{E}[T(x)] = \nabla A(\theta)

These are related by the Legendre transform: A(η)=θηA(θ)A^*(\eta) = \theta^\top \eta - A(\theta). The manifold is dually flat: it is flat in both coordinate systems simultaneously, with a pair of dual affine connections.

The KL divergence takes a simple form called the Bregman divergence:

KL(pηpη)=DA(θθ)=A(θ)A(θ)A(θ)(θθ)\mathrm{KL}(p_\eta \| p_{\eta'}) = D_A(\theta' \| \theta) = A(\theta') - A(\theta) - \nabla A(\theta)^\top(\theta' - \theta)

Connection to Mirror Descent

Mirror descent with the log-partition function A(θ)A(\theta) as the mirror map is equivalent to natural gradient descent on an exponential family. The update:

θt+1=θtαθL\theta_{t+1} = \theta_t - \alpha \nabla_\theta L

in natural parameters is a mirror descent step with Bregman divergence DAD_A. This unifies two important optimization frameworks:

  • Natural gradient: use the Fisher metric to precondition gradients
  • Mirror descent: use a Bregman divergence to define the proximity term

For exponential families, these are the same algorithm.

Common Confusions

Watch Out

The empirical Fisher is not the true Fisher

In practice, many implementations compute the "Fisher information matrix" using logpθ(x)logpθ(x)\nabla \log p_\theta(x) \nabla \log p_\theta(x)^\top averaged over training data. This is the empirical Fisher, which equals the true Fisher only when the model is correctly specified (i.e., the true distribution is in the family). The empirical Fisher is still a PSD outer-product average, so it defines a (possibly degenerate) Riemannian form; the real issue is that under misspecification it converges to the wrong limit and is no longer invariant under sufficient statistics, so it is not the information-theoretically correct Fisher metric for the model family. See Kunstner, Hennig, and Balles (2019) and Martens (2020, §5).

Watch Out

Adam is not the natural gradient

Adam uses vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 as a diagonal preconditioner. This resembles a diagonal approximation to I(θ)1I(\theta)^{-1}, but Adam uses squared gradients of the loss, not squared score functions. The two coincide only for specific loss functions and model classes.

Canonical Examples

Example

Natural gradient for a Gaussian mean

Consider pθ(x)=N(x;θ,σ2)p_\theta(x) = \mathcal{N}(x; \theta, \sigma^2) with known variance. The Fisher information is I(θ)=1/σ2I(\theta) = 1/\sigma^2. The natural gradient of a loss L(θ)L(\theta) is σ2θL\sigma^2 \nabla_\theta L. This rescales the gradient by the variance, taking larger steps when the distribution is broad (uncertain) and smaller steps when it is narrow (confident). Now reparameterize as ϕ=θ3\phi = \theta^3. Standard gradient descent gives a different trajectory in distribution space, but natural gradient gives the same trajectory.

Key Takeaways

  • The Fisher information metric is the Hessian of KL divergence at zero separation
  • Natural gradient ~L=I(θ)1L\tilde{\nabla} L = I(\theta)^{-1} \nabla L is invariant to reparameterization
  • Exponential families are dually flat: natural parameters and expectation parameters are Legendre dual
  • Mirror descent with the log-partition function equals natural gradient on exponential families
  • Computing the exact natural gradient is O(d3)O(d^3); practical methods use approximations

Exercises

ExerciseCore

Problem

For a Bernoulli distribution pθ(x)=θx(1θ)1xp_\theta(x) = \theta^x (1-\theta)^{1-x} with θ(0,1)\theta \in (0,1), compute the Fisher information I(θ)I(\theta) and write down the natural gradient update for minimizing a loss L(θ)L(\theta).

ExerciseAdvanced

Problem

Show that for an exponential family with log-partition function A(θ)A(\theta), the Fisher information matrix equals the Hessian of AA: I(θ)=2A(θ)I(\theta) = \nabla^2 A(\theta). Use this to prove that the natural gradient update in natural parameters is θt+1=θtαθL\theta_{t+1} = \theta_t - \alpha \nabla_\theta L (i.e., the Fisher preconditioning cancels the Hessian of AA).

References

Canonical:

  • Amari, Information Geometry and Its Applications (2016), Chapters 1-4
  • Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation 1998)
  • Chentsov, Statistical Decision Rules and Optimal Inference (1982 English translation), AMS Translations Vol. 53 (finite sample space uniqueness of the Fisher metric)

Current:

  • Martens, "New Insights and Perspectives on the Natural Gradient Method" (JMLR 2020), §5 on empirical Fisher
  • Raskutti & Mukherjee, "The Information Geometry of Mirror Descent" (2015)
  • Ay, Jost, Lê, and Schwachhöfer, "Information geometry and sufficient statistics" (Probability Theory and Related Fields, 2015), extending Chentsov to continuous sample spaces
  • Kunstner, Hennig, and Balles, "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent" (NeurIPS 2019, arXiv:1905.12558)
  • Nielsen, "An Elementary Introduction to Information Geometry" (Entropy 2020, arXiv:1808.08271)

Next Topics

  • Optimizer theory (SGD, Adam, Muon): practical approximations to the natural gradient
  • Mean field theory: information geometry in variational inference

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics