Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient

Sneiderman, Robby

Optimization Function Classes

Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient

Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.

AdvancedTier 2FrontierSupporting~55 min

Prerequisites

Convex Optimization Basics Fisher Information The Hessian Matrix Conjugate Gradient Methods

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

optimization-function-classes | layer 3 | tier 2. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Riemannian Optimization and Manifold Constraints

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Adam preconditions each parameter independently using running averages of squared gradients. This is diagonal preconditioning: each weight gets its own learning rate, but correlations between weights are ignored.

Preconditioned optimizers go further. They use the full (or approximated) curvature structure of the loss landscape to transform gradients before taking a step. The result: updates that account for how parameters interact, not just how large each gradient is. In practice, this means faster convergence on problems with strong parameter correlations, which includes most neural networks.

The cost is computational: maintaining and applying a preconditioner is more expensive per step than Adam. The question is whether the improved convergence compensates. For large-scale training (where each step costs thousands of GPU-hours), the answer is increasingly yes.

The Natural Gradient

Theorem

Natural Gradient Descent

Statement

The natural gradient of a loss $\mathcal{L}(\theta)$ is:

$\tilde{\nabla} \mathcal{L}(\theta) = F(\theta)^{-1} \nabla \mathcal{L}(\theta)$

where $F(\theta) = \mathbb{E}_{x \sim p(\cdot; \theta)}[\nabla \log p(x; \theta) \nabla \log p(x; \theta)^\top]$ is the Fisher information matrix.

Note: the definition above is for a generative model $p(x; \theta)$ . For supervised learning, the relevant object is the conditional-predictive Fisher $F(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim p(\cdot \mid x; \theta)}[\nabla_\theta \log p(y \mid x; \theta) \nabla_\theta \log p(y \mid x; \theta)^\top]$ , where the outer expectation is over the data distribution and the inner over the model's predictive distribution. K-FAC is explicitly designed as a Kronecker-factored approximation to this Fisher (or the closely related generalized Gauss-Newton matrix). Shampoo is not derived from the Fisher; it is built from accumulated outer products of the observed gradients, more in the spirit of full-matrix AdaGrad. The two are both block-Kronecker preconditioners but capture different second-order objects.

The natural gradient update is:

$\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t)$

This is steepest descent in the KL divergence geometry: it finds the direction that decreases the loss most per unit of distributional change $D_{\text{KL}}(p_{\theta + \delta} \| p_\theta)$ .

Intuition

Euclidean gradient descent treats all parameter directions equally: a step of size $\epsilon$ in weight 1 is the same as a step of size $\epsilon$ in weight 1000. But in a neural network, some parameters have much larger effects on the output distribution than others. The Fisher information matrix captures these sensitivities.

Multiplying by $F^{-1}$ rescales the gradient so that the update produces equal change in the output distribution in all directions. This is the same idea as Newton's method (which uses the Hessian instead of the Fisher), but adapted for probabilistic models.

Why It Matters

The natural gradient is parameterization-invariant: it gives the same update regardless of how you parameterize the model. Reparameterizing a neural network (e.g., scaling a weight matrix by a constant and dividing the next layer by the same constant) changes the Euclidean gradient but not the natural gradient. This is why natural gradient methods can be more robust to architecture choices.

Practically, natural gradient methods converge in fewer steps than Adam on problems with ill-conditioned Fisher matrices (which is most neural networks). The challenge is computing $F^{-1}$ , which costs $O(d^3)$ for $d$ parameters.

Failure Mode

The Fisher matrix has $d^2$ entries for $d$ parameters. For a model with 100M parameters, storing $F$ requires $10^{16}$ entries, which is impossible. Every practical preconditioned optimizer is an approximation to the natural gradient: diagonal (Adam), block-diagonal (K-FAC), Kronecker-factored (Shampoo), or low-rank.

report a correction →

K-FAC: Kronecker-Factored Approximate Curvature

K-FAC (Martens & Grosse, 2015) approximates the Fisher matrix for each layer using a Kronecker product factorization.

For a fully-connected layer with input $a$ and output pre-activation $s = Wa$ , the Fisher block for that layer's weights is:

$F_W \approx \mathbb{E}[aa^\top] \otimes \mathbb{E}[\nabla_s \mathcal{L} \nabla_s \mathcal{L}^\top] = A \otimes S$

where $A$ is the input covariance and $S$ is the gradient covariance. The Kronecker structure means:

$F_W^{-1} \text{vec}(G) \approx \text{vec}(S^{-1} G A^{-1})$

Inverting the full Fisher block for a single layer would cost $O((d_{\text{in}} d_{\text{out}})^3)$ . K-FAC reduces this to $O(d_{\text{in}}^3 + d_{\text{out}}^3)$ by inverting the two Kronecker factors separately. Storage drops from $O(d_{\text{in}}^2 d_{\text{out}}^2)$ for the full Fisher block to $O(d_{\text{in}}^2 + d_{\text{out}}^2)$ for the factors, a secondary but still significant benefit.

The Kronecker factorization is exact under independence of activations and pre-activation gradients (Martens and Grosse, 2015). This independence does not hold in general, even for linear networks, so K-FAC is an approximation in most settings. It is a good approximation empirically for most feedforward architectures and often exact up to small errors for linear networks under mild conditions.

Shampoo

Proposition

Shampoo Update Rule

Statement

Shampoo (Gupta, Koren, Singer, 2018) maintains left and right preconditioners for each weight matrix:

$L_t = (1 - \beta) L_{t-1} + \beta \, G_t G_t^\top \in \mathbb{R}^{m \times m}$ $R_t = (1 - \beta) R_{t-1} + \beta \, G_t^\top G_t \in \mathbb{R}^{n \times n}$

The preconditioned update is:

$\Delta W_t = L_t^{-1/4} \, G_t \, R_t^{-1/4}$

The $-1/4$ exponent comes from a Löwner-order (matrix AM-GM) bound, not from any vec-vs-matrix conversion. If $H_t = \sum_{s \le t} g_s g_s^\top$ is the full-matrix AdaGrad preconditioner on the vectorized gradient, and $L_t, R_t$ are obtained from partial traces of the $g_s g_s^\top$ terms, then Gupta, Koren, and Singer (2018) show $L_t \otimes R_t \succeq H_t$ in the Löwner order. By operator monotonicity of $X \mapsto X^{-1/2}$ , we get $(L_t \otimes R_t)^{-1/2} \preceq H_t^{-1/2}$ . Using $(R \otimes L)^{-1/2} = R^{-1/2} \otimes L^{-1/2}$ and the identity $(R^{a} \otimes L^{a}) \text{vec}(G) = \text{vec}(L^{a} G R^{a})$ , applying $(L_t \otimes R_t)^{-1/2}$ as a matrix-shaped update and symmetrizing between the two factors yields $\Delta W_t = L_t^{-1/4} G_t R_t^{-1/4}$ . The Kronecker vec identity preserves exponents; there is no square-root introduction from switching between matrix and vectorized form.

Scope caveat: the $1/4$ power is a design choice, not uniquely determined. The Löwner bound only establishes that $(L \otimes R)^{-1/2}$ upper-bounds the AdaGrad preconditioner; splitting this symmetrically across the two factors gives $1/4$ , but other exponents are defensible. Recent work (Morwani, Shah, Cohen, Mhaskar, 2024; Anil, Gupta, Koren, Regan, Singer, 2020 Distributed Shampoo; SOAP) often prefers $1/2$ , which has shown comparable or better results in large-scale deployments. Empirically, intermediate exponents $p \in (0, 1/2]$ trade off preconditioning strength.

Intuition

Shampoo is a full-matrix AdaGrad approximation, not a Fisher approximation. It directly accumulates $GG^\top$ (left) and $G^\top G$ (right) from the observed gradients themselves. The left preconditioner captures correlations between output rows of the weight matrix; the right captures correlations between input columns. The relevant object being approximated is the AdaGrad full-matrix preconditioner $\sum_s g_s g_s^\top$ on the vectorized gradient, upper-bounded in the Löwner order by $L \otimes R$ . K-FAC, by contrast, targets the Fisher / Gauss-Newton matrix using a different factorization (input-activation covariance Kronecker pre-activation-gradient covariance). Both are block-Kronecker preconditioners but they approximate different second-order quantities.

The matrix fourth root $L^{-1/4}$ arises from the operator-inequality chain described in the statement: the symmetric split of $(L \otimes R)^{-1/2}$ across the two factors. Empirically, intermediate exponents $p \in (0, 1/2]$ trade off preconditioning strength against stability, with $1/4$ and $1/2$ both in active use.

Why It Matters

Shampoo has shown strong empirical results on transformer training, sometimes matching or exceeding Adam with fewer steps (though each step is more expensive due to the matrix power computation). Google's Distributed Shampoo (Anil et al., 2020) is the production variant: preconditioner statistics $L$ and $R$ are sharded across data-parallel workers, the matrix powers $L^{-p}$ and $R^{-p}$ are recomputed every $k$ steps (typically $k = 100$ to $1000$ ) on dedicated CPU or GPU workers in parallel with training, and only the small matrix powers (not the full statistics) are communicated back. This amortizes the $O(m^3)$ eigendecomposition cost across many training steps and hides it behind the forward-backward pass. The connection to Riemannian optimization is direct: Shampoo's update can be interpreted as Riemannian gradient descent on the space of weight matrices with a specific metric.

Failure Mode

Computing $L^{-1/4}$ requires eigendecomposition of $L$ , costing $O(m^3)$ per step. For large layers ( $m = 4096$ ), this is expensive. Practical implementations amortize this cost by recomputing the preconditioner every $k$ steps (e.g., $k = 100$ ). The matrix power can also be approximated using Newton-Schulz iterations (same technique as Muon). Memory cost is $O(m^2 + n^2)$ per weight matrix for the preconditioners.

report a correction →

Comparison Table

Optimizer	Preconditioner	Cost per step	Memory overhead	Convergence	Best for
SGD	None ( $I$ )	$O(d)$	0	Slow on ill-conditioned	Convex, well-conditioned
Adam	Diagonal ( $\text{diag}(v_t)^{-1/2}$ )	$O(d)$	$2d$ (moments)	Good general-purpose	Default choice
K-FAC	Kronecker ( $A^{-1} \otimes S^{-1}$ )	$O(d_{\text{in}}^3 + d_{\text{out}}^3)$	$d_{\text{in}}^2 + d_{\text{out}}^2$	Fast on FC layers	Large FC networks
Shampoo	Full matrix ( $L^{-1/4} \cdot R^{-1/4}$ )	$O(m^3 + n^3)$	$m^2 + n^2$	Fast, especially on matrices	Transformers, large models
Muon	Newton-Schulz approx. of orthogonal polar factor of $G$	$O(mn)$ (Newton-Schulz)	$O(mn)$	Very fast on matrix weights	Transformer weights
Parallel Muon / Turbo Muon	Muon with sharded Newton-Schulz iterations across DP/TP groups	$O(mn / P)$ wall-clock for $P$ workers	$O(mn)$	Comparable to Muon, lower wall-clock per step	Large-scale distributed training
Natural gradient	Full Fisher ( $F^{-1}$ )	$O(d^3)$	$O(d^2)$	Optimal in local quadratic / KL-geometry sense for exponential families; no general non-convex guarantee	Infeasible for large $d$

Common Confusions

Watch Out

Shampoo is not just Adam with bigger matrices

Adam uses diagonal preconditioning: each parameter gets its own adaptive learning rate based on its own squared gradient history. Shampoo uses full-matrix preconditioning: the update to parameter $w_{ij}$ depends on the gradient history of all other parameters in the same weight matrix. This captures cross-parameter correlations that Adam misses entirely. The difference matters most when parameters are highly correlated (which they are in attention weight matrices).

Watch Out

The matrix fourth root comes from a Löwner bound, not from vec-vs-matrix form

A common incorrect story: "the $-1/4$ exponent appears because vectorizing a matrix update introduces an extra square root." This is wrong. The Kronecker vec identity $\text{vec}(L^{a} G R^{a}) = (R^{a} \otimes L^{a}) \text{vec}(G)$ preserves exponents exactly; switching between matrix and vectorized form introduces nothing.

The actual argument (Gupta, Koren, Singer, 2018) is a Löwner-order bound. The Kronecker approximation $L \otimes R$ upper-bounds the full AdaGrad preconditioner $H$ in the positive-definite order, so $(L \otimes R)^{-1/2} \preceq H^{-1/2}$ . Splitting this $-1/2$ symmetrically across the two factors yields the $-1/4$ exponent on each.

The exponent is also not uniquely determined. The Löwner bound fixes the total $-1/2$ , but the symmetric split to $-1/4$ on each factor is a design choice. Much of the recent literature (Distributed Shampoo, SOAP, Morwani et al. 2024) uses $-1/2$ on each factor, with comparable or better results at scale. Treat the exponent as a tunable hyperparameter in the range $p \in (0, 1/2]$ , not as a uniquely derived constant.

Watch Out

More preconditioning is not always better

Full natural gradient ( $F^{-1} \nabla \mathcal{L}$ ) is theoretically optimal per step, but each step is vastly more expensive. The wall-clock time to reach a target loss depends on both the number of steps and the cost per step. Adam is cheap per step. Shampoo is expensive per step but takes fewer steps. The crossover depends on model size, hardware, and how ill-conditioned the problem is. For small models, Adam wins on wall-clock. For large transformers with many GPU-hours per step, Shampoo can win.

Exercises

ExerciseCore

Problem

Adam maintains running averages $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$ where $g_t^2$ is elementwise. Explain why this is equivalent to a diagonal approximation of the Fisher matrix and why it misses cross-parameter correlations.

ExerciseAdvanced

Problem

For a weight matrix $W \in \mathbb{R}^{512 \times 768}$ , compute the memory cost of Shampoo's preconditioners ( $L$ and $R$ ) compared to Adam's two moment buffers. At what matrix size does Shampoo's memory overhead exceed Adam's by more than 2x?

Related Comparisons

Shampoo vs. Adam vs. Muon

References

Canonical:

Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation, 1998). The foundational paper on natural gradient.
Martens & Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (ICML 2015). K-FAC.
Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018). The original Shampoo paper.

Current:

Anil, Gupta, Koren, Regan, Singer, "Scalable Second Order Optimization for Deep Learning" (2020/2021). Distributed Shampoo; often uses $p = 1/2$ on each factor.
Morwani, Shah, Cohen, Mhaskar, "A New Perspective on Shampoo's Preconditioner" (2024). Re-derives Shampoo-style updates and argues for $1/2$ over $1/4$ exponents.
Vyas et al., "SOAP: Improving and Stabilizing Shampoo using Adam" (2024). Hybrid preconditioner using $1/2$ exponent on each factor.
Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon and the Newton-Schulz polar-factor connection.
Liu et al., "Parallel Muon for Scalable Distributed Pretraining" (2025), arXiv:2502.16982. Shards the Newton-Schulz iterations of Muon across data- and tensor-parallel groups so that the per-step optimizer cost matches Adam at large scale.
"Turbo Muon: Accelerating Muon via Polynomial-Iterate Reuse" (2025), arXiv:2512.04632. Reduces the constant factor of the Newton-Schulz polar-factor approximation by reusing intermediate iterates across consecutive optimizer steps.
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

Next Topics

Riemannian optimization: the geometric framework underlying manifold-constrained updates
Optimizer theory: SGD, Adam, Muon: how Muon approximates the orthogonal polar factor of the gradient via Newton-Schulz

Last reviewed: May 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

The Hessian Matrixlayer 0A · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Conjugate Gradient Methodslayer 2 · tier 2

Derived topics

2

Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
Riemannian Optimization and Manifold Constraintslayer 3 · tier 2

Graph-backed continuations

Riemannian Optimization and Manifold Constraints Optimizer Theory: SGD, Adam, and Muon