Information Geometry

Sneiderman, Robby

Mathematical Infrastructure

Information Geometry

Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.

AdvancedTier 3StableSupporting~60 min

Prerequisites

Fisher Information Convex Duality Non Euclidean and Hyperbolic Geometry Whitening and Decorrelation

Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 3 | tier 3. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Optimizer Theory: SGD, Adam, and Muon

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard gradient descent treats parameter space as Euclidean: a step of size $\epsilon$ in $\theta$ always means the same thing regardless of where you are. But a small change in $\theta$ can produce a huge change in the distribution $p_\theta$ in one region and a negligible change in another. The Fisher information metric measures distances in distribution space, not parameter space.

The natural gradient (gradient with respect to the Fisher metric) corrects for this. It is invariant to reparameterization: the update is the same whether you parameterize a Gaussian by $(\mu, \sigma)$ or $(\mu, \sigma^2)$ or $(\mu, 1/\sigma^2)$ . This invariance is why natural gradient methods often converge faster than vanilla SGD.

theorem visual

Natural Gradient Invariance Explorer

$Natural gradient rescales the step by the Fisher metric, so the move is defined in local KL geometry rather than by the raw coordinate chart.$

Current location

mu = 1.35

Euclidean step

Natural step

Current distribution

mu chart

mu^3 chart

distribution-space readout

same distributional move across charts

E in μ

0.015

E in μ^{3}

0.000

N in μ

0.004

N in μ^{3}

0.004

Fisher metric

I (μ) = \frac{1}{σ ^{2}}

In this Gaussian mean family, the Fisher metric sets the local KL scale that natural gradient uses to size the step.

Local KL

KL (p_{μ} ∥ p_{μ + δ}) \approx \frac{1}{2} I (μ) δ^{2}

The metric is the local quadratic form of KL. Equal natural steps should therefore spend the same local KL budget.

Invariant move

I (θ)^{- 1} \nabla_{θ} L \leftrightarrow I (ϕ)^{- 1} \nabla_{ϕ} L

Euclidean arrows depend on the chart. The natural arrows induce the same move in distribution space, which is the whole point of the metric correction.

Formal Setup

Definition

Statistical Manifold $M$

A statistical manifold $\mathcal{M} = \{p_\theta : \theta \in \Theta\}$ is a family of probability distributions parameterized by $\theta \in \Theta \subseteq \mathbb{R}^d$ , where the map $\theta \mapsto p_\theta$ is smooth and injective.

Definition

Fisher Information Metric $g_{ij} (θ)$

The Fisher information metric on $\mathcal{M}$ is the Riemannian metric tensor:

$g_{ij}(\theta) = \mathbb{E}_{x \sim p_\theta}\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i} \frac{\partial \log p_\theta(x)}{\partial \theta_j}\right] = [I(\theta)]_{ij}$

where $I(\theta)$ is the Fisher information matrix.

The Fisher metric is the unique (up to scale) Riemannian metric on $\mathcal{M}$ that is invariant under sufficient statistics. This is the Chentsov theorem, originally proved for finite sample spaces (Markov morphisms between finite probability simplices); the extension to smooth families on continuous sample spaces is due to Ay, Jost, Lê, and Schwachhöfer (2015).

Main Theorems

Theorem

Fisher Metric as KL Hessian

Statement

The Fisher information metric equals the Hessian of the KL divergence at $\theta' = \theta$ :

$g_{ij}(\theta) = \frac{\partial^2}{\partial \theta'_i \partial \theta'_j} \mathrm{KL}(p_\theta \| p_{\theta'}) \bigg|_{\theta' = \theta}$

Equivalently, for nearby parameters:

$\mathrm{KL}(p_\theta \| p_{\theta + \delta}) \approx \frac{1}{2} \delta^\top I(\theta) \delta$

Intuition

The KL divergence is the natural "distance" between distributions (though not symmetric). Its local quadratic approximation is the Fisher metric. So the Fisher metric measures how fast distributions change in the KL sense as you move in parameter space.

Proof Sketch

Expand $\log p_{\theta'}(x)$ around $\theta' = \theta$ to second order. The first-order term vanishes because the score has zero mean. The second-order term gives $\mathrm{KL}(p_\theta \| p_{\theta + \delta}) \approx \frac{1}{2} \delta^\top \mathbb{E}[\nabla^2 (-\log p_\theta)] \delta$ , and the Hessian of the negative log-likelihood equals the Fisher information under standard regularity conditions.

Why It Matters

This result connects three perspectives: the Riemannian metric on the statistical manifold, the curvature of the KL divergence, and the Fisher information from estimation theory. It is the foundation for natural gradient descent.

Failure Mode

The approximation $\mathrm{KL} \approx \frac{1}{2}\delta^\top I(\theta)\delta$ is only valid for small $\delta$ . For large parameter changes, the KL divergence and the quadratic form can differ substantially. This is why trust region methods (like TRPO) enforce explicit constraints on the step size.

report a correction →

Theorem

Natural Gradient Invariance

Statement

The natural gradient of a loss $L(\theta)$ is:

$\tilde{\nabla}_\theta L = I(\theta)^{-1} \nabla_\theta L$

If $\phi = f(\theta)$ is a smooth reparameterization, then the natural gradient update in $\phi$ -coordinates produces the same distribution update as in $\theta$ -coordinates. That is, the trajectory in distribution space $\{p_{\theta_t}\}$ is invariant to the choice of parameterization.

Intuition

Euclidean gradient descent moves $\theta$ in the direction of steepest descent measured in $\ell^2$ norm on parameter space. Natural gradient descent moves $\theta$ in the direction of steepest descent measured in KL divergence on distribution space. Since KL divergence is a property of distributions (not parameters), the result does not depend on how you parameterize the family.

Proof Sketch

Under the reparameterization $\phi = f(\theta)$ , the Fisher metric transforms as $I_\phi = J^{-\top} I_\theta J^{-1}$ where $J = \partial f / \partial \theta$ . The gradient transforms as $\nabla_\phi L = J^{-\top} \nabla_\theta L$ . Therefore $I_\phi^{-1} \nabla_\phi L = J (I_\theta^{-1} \nabla_\theta L)$ , which is exactly the Jacobian-transformed version of the natural gradient in $\theta$ -coordinates.

Why It Matters

Parameterization invariance means you do not need to worry about choosing the "right" parameterization. For neural networks, this is significant because the loss landscape depends heavily on the parameterization. Adam and other adaptive methods approximate the natural gradient by using diagonal approximations to $I(\theta)^{-1}$ .

Failure Mode

Computing $I(\theta)^{-1}$ is $O(d^3)$ for $d$ parameters, which is intractable for neural networks with millions of parameters. Practical approximations (K-FAC, diagonal Fisher, empirical Fisher) trade off accuracy for computational feasibility, and these approximations break the invariance property.

report a correction →

Exponential Families and Dual Flatness

Exponential families have a special place in information geometry. An exponential family $p_\theta(x) = h(x) \exp(\theta^\top T(x) - A(\theta))$ admits two natural coordinate systems:

Natural parameters $\theta$ (the canonical coordinates)
Expectation parameters $\eta = \mathbb{E}[T(x)] = \nabla A(\theta)$

These are related by the Legendre transform: $A^*(\eta) = \theta^\top \eta - A(\theta)$ . The manifold is dually flat: it is flat in both coordinate systems simultaneously, with a pair of dual affine connections.

The KL divergence takes a simple form called the Bregman divergence:

$\mathrm{KL}(p_\eta \| p_{\eta'}) = D_A(\theta' \| \theta) = A(\theta') - A(\theta) - \nabla A(\theta)^\top(\theta' - \theta)$

Here $p_\eta$ and $p_\theta$ denote the same distribution written in the two coordinate systems, linked by the Legendre relation $\eta = \nabla A(\theta)$ . The switch from $\eta, \eta'$ on the left to $\theta, \theta'$ on the right is purely a change of coordinates: the KL divergence is a function of distributions, and can be written equivalently in natural or expectation parameters.

Connection to Mirror Descent

Mirror descent with the log-partition function $A(\theta)$ as the mirror map is equivalent to natural gradient descent on an exponential family. The defining feature of mirror descent is that the linear gradient step is applied in the dual coordinates $\eta = \nabla A(\theta)$ , not in the primal $\theta$ :

$\nabla A(\theta_{t+1}) = \nabla A(\theta_t) - \alpha \nabla_\theta L(\theta_t),$

equivalently $\theta_{t+1} = (\nabla A)^{-1}\!\big(\nabla A(\theta_t) - \alpha \nabla_\theta L(\theta_t)\big)$ . A naive update $\theta_{t+1} = \theta_t - \alpha \nabla_\theta L$ in natural parameters is not mirror descent and does not coincide with natural gradient on the exponential family — it is ordinary Euclidean gradient descent in $\theta$ . The dual-coordinate update is what makes the step a Bregman-proximal step with divergence $D_A$ , and what reproduces the natural-gradient direction $F(\theta)^{-1} \nabla L = (\nabla^2 A)^{-1}\nabla L$ for exponential families (since the Fisher information is $F(\theta) = \nabla^2 A(\theta)$ ).

This unifies two important optimization frameworks:

Natural gradient: use the Fisher metric to precondition gradients
Mirror descent: use a Bregman divergence to define the proximity term

For exponential families with mirror map $A$ , these are the same algorithm.

α-Divergences and α-Connections

Amari's $\alpha$ -divergences form a one-parameter family

$D_\alpha(p \| q) = \frac{4}{1-\alpha^2}\left(1 - \int p(x)^{(1-\alpha)/2} q(x)^{(1+\alpha)/2} \, dx\right)$

that interpolates between standard divergences. Taking the limits via L'Hôpital's rule on Amari's formula: $\alpha \to -1$ recovers the KL divergence $\mathrm{KL}(p \| q)$ , $\alpha \to +1$ recovers the reverse KL $\mathrm{KL}(q \| p)$ , and $\alpha = 0$ gives (twice) the squared Hellinger distance. Each $\alpha$ induces an affine connection $\nabla^{(\alpha)}$ on the statistical manifold. The $\alpha = 1$ connection (the e-connection) makes exponential families flat, while the $\alpha = -1$ connection (the m-connection) makes mixture families flat. The Fisher metric together with the dual pair $(\nabla^{(1)}, \nabla^{(-1)})$ is the core object of dually flat information geometry. Conventions for the sign of $\alpha$ vary across the literature (Amari's original $\alpha$ is the negative of Chentsov's); we follow Amari, Information Geometry and Its Applications (2016), Chapter 3.

Pythagorean Theorem for e- and m-Projections

Dually flat manifolds satisfy a Pythagorean identity for the KL divergence. Let $M$ be an m-flat submanifold (flat in the mixture connection) and $E$ an e-flat submanifold (flat in the exponential connection), intersecting at a single point $r$ . For any $p \in M$ and $q \in E$ with $M$ and $E$ orthogonal at $r$ under the Fisher metric,

$D_{\mathrm{KL}}(p \| q) = D_{\mathrm{KL}}(p \| r) + D_{\mathrm{KL}}(r \| q).$

The point $r$ is simultaneously the m-projection of $q$ onto $M$ and the e-projection of $p$ onto $E$ . This is foundational for EM-style alternating projection algorithms and for understanding variational inference as projection onto an e-flat family. See Amari, Information Geometry and Its Applications (2016), §2.8, or Amari and Nagaoka, Methods of Information Geometry (AMS, 2000).

Natural Gradient in Reinforcement Learning

Natural gradient methods are central to modern policy optimization. Kakade's Natural Policy Gradient (NPG) preconditions the policy gradient by the Fisher information of the policy distribution, giving parameterization-invariant updates on the policy manifold (Kakade, NeurIPS 2001). Trust Region Policy Optimization (TRPO) turns this into a practical algorithm by imposing a KL trust region $\mathrm{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \le \delta$ and solving the resulting constrained problem via a conjugate-gradient approximation to the Fisher inverse (Schulman, Levine, Abbeel, Jordan, and Moritz, arXiv:1502.05477, 2015). Proximal Policy Optimization (PPO) replaces the hard KL constraint with a clipped surrogate objective that is cheaper to optimize while retaining the trust-region intuition (Schulman, Wolski, Dhariwal, Radford, and Klimov, arXiv:1707.06347, 2017).

K-FAC and Shampoo as Practical Natural-Gradient Approximations

K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher information of a neural network with a block-diagonal matrix whose per-layer blocks factor as a Kronecker product of input and gradient covariance matrices (Martens and Grosse, arXiv:1503.05671, 2015). This makes inversion tractable and yields a natural-gradient-style preconditioner. Shampoo maintains per-layer left and right preconditioners computed from accumulated gradient statistics, applied as $L^{-1/4} G R^{-1/4}$ , and can be viewed as a structured approximation to full-matrix Adagrad closely related to block-diagonal natural gradient (Gupta, Koren, and Singer, arXiv:1802.09568, 2018). Both methods trade exact Fisher structure for computational feasibility and have been deployed in large-scale training.

Wasserstein Information Geometry

An alternative Riemannian structure on the space of probability measures comes from optimal transport: the 2-Wasserstein distance $W_2$ induces a formal Riemannian metric whose tangent space at $p$ consists of vector fields $v$ with inner product $\int \langle v_1, v_2 \rangle \, dp$ (the Otto calculus). Under this metric, the gradient flow of the KL divergence with respect to a reference measure is the Fokker-Planck equation, which connects optimal transport to diffusions and to entropic regularization. The Wasserstein inner product is distinct from the Fisher-Rao inner product: Fisher-Rao measures distances in distribution space using score statistics, while $W_2$ measures distances by how mass must be moved in the underlying sample space. See Otto, "The geometry of dissipative evolution equations: the porous medium equation" (Comm. PDE, 2001), and Villani, Optimal Transport: Old and New (Springer, 2009).

Common Confusions

Watch Out

The empirical Fisher is not the true Fisher

In practice, many implementations compute the "Fisher information matrix" using $\nabla \log p_\theta(x) \nabla \log p_\theta(x)^\top$ averaged over training data. This is the empirical Fisher, which equals the true Fisher only when the model is correctly specified (i.e., the true distribution is in the family). The empirical Fisher is still a PSD outer-product average, so it defines a (possibly degenerate) Riemannian form; the real issue is that under misspecification it converges to the wrong limit and is no longer invariant under sufficient statistics, so it is not the information-theoretically correct Fisher metric for the model family. See Kunstner, Hennig, and Balles (2019) and Martens (2020, §5).

Watch Out

Adam is not the natural gradient

Adam uses $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ as a diagonal preconditioner. This resembles a diagonal approximation to $I(\theta)^{-1}$ , but Adam uses squared gradients of the loss, not squared score functions. The two coincide only for specific loss functions and model classes.

Canonical Examples

Example

Natural gradient for a Gaussian mean

Consider $p_\theta(x) = \mathcal{N}(x; \theta, \sigma^2)$ with known variance. The Fisher information is $I(\theta) = 1/\sigma^2$ . The natural gradient of a loss $L(\theta)$ is $\sigma^2 \nabla_\theta L$ . This rescales the gradient by the variance, taking larger steps when the distribution is broad (uncertain) and smaller steps when it is narrow (confident). Now reparameterize as $\phi = \theta^3$ . Standard gradient descent gives a different trajectory in distribution space, but natural gradient gives the same trajectory.

Summary

The Fisher information metric is the Hessian of KL divergence at zero separation
Natural gradient $\tilde{\nabla} L = I(\theta)^{-1} \nabla L$ is invariant to reparameterization
Exponential families are dually flat: natural parameters and expectation parameters are Legendre dual
Mirror descent with the log-partition function equals natural gradient on exponential families
Computing the exact natural gradient is $O(d^3)$ ; practical methods use approximations

Exercises

ExerciseCore

Problem

For a Bernoulli distribution $p_\theta(x) = \theta^x (1-\theta)^{1-x}$ with $\theta \in (0,1)$ , compute the Fisher information $I(\theta)$ and write down the natural gradient update for minimizing a loss $L(\theta)$ .

ExerciseAdvanced

Problem

Show that for an exponential family with log-partition function $A(\theta)$ , the Fisher information matrix equals the Hessian of $A$ : $I(\theta) = \nabla^2 A(\theta)$ . Use this to prove the following equivalence: ordinary gradient descent in expectation parameters $\eta = \nabla A(\theta)$ corresponds exactly to natural gradient descent in the natural parameters $\theta$ .

Further directions

Fisher-Rao distance and geodesics
EM algorithm as alternating e- and m-projections
Stein variational gradient descent (Liu-Wang 2016)
Information geometry of mixture models (m-flat) vs exponential families (e-flat) as dual-flat structures
Interactive diagram: natural gradient vs ordinary gradient on a parametric family
Quiz

References

Canonical:

Amari, Information Geometry and Its Applications (2016), Chapters 1-4
Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation 1998)
Chentsov, Statistical Decision Rules and Optimal Inference (1982 English translation), AMS Translations Vol. 53 (finite sample space uniqueness of the Fisher metric)

Current:

Martens, "New Insights and Perspectives on the Natural Gradient Method" (JMLR 2020), §5 on empirical Fisher
Raskutti & Mukherjee, "The Information Geometry of Mirror Descent" (2015)
Ay, Jost, Lê, and Schwachhöfer, "Information geometry and sufficient statistics" (Probability Theory and Related Fields, 2015), extending Chentsov to continuous sample spaces
Kunstner, Hennig, and Balles, "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent" (NeurIPS 2019, arXiv:1905.12558)
Nielsen, "An Elementary Introduction to Information Geometry" (Entropy 2020, arXiv:1808.08271)

Next Topics

Optimizer theory (SGD, Adam, Muon): practical approximations to the natural gradient
Mean field theory: information geometry in variational inference

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Convex Dualitylayer 2 · tier 1
Non-Euclidean and Hyperbolic Geometrylayer 1 · tier 2
Whitening and Decorrelationlayer 2 · tier 2

Derived topics

2

Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
Mean Field Theorylayer 4 · tier 2

Graph-backed continuations

Optimizer Theory: SGD, Adam, and Muon Mean Field Theory