Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient

Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.

AdvancedTier 2Frontier~55 min
0

Why This Matters

Adam preconditions each parameter independently using running averages of squared gradients. This is diagonal preconditioning: each weight gets its own learning rate, but correlations between weights are ignored.

Preconditioned optimizers go further. They use the full (or approximated) curvature structure of the loss landscape to transform gradients before taking a step. The result: updates that account for how parameters interact, not just how large each gradient is. In practice, this means faster convergence on problems with strong parameter correlations, which includes most neural networks.

The cost is computational: maintaining and applying a preconditioner is more expensive per step than Adam. The question is whether the improved convergence compensates. For large-scale training (where each step costs thousands of GPU-hours), the answer is increasingly yes.

The Natural Gradient

Theorem

Natural Gradient Descent

Statement

The natural gradient of a loss L(θ)\mathcal{L}(\theta) is:

~L(θ)=F(θ)1L(θ)\tilde{\nabla} \mathcal{L}(\theta) = F(\theta)^{-1} \nabla \mathcal{L}(\theta)

where F(θ)=Exp(;θ)[logp(x;θ)logp(x;θ)]F(\theta) = \mathbb{E}_{x \sim p(\cdot; \theta)}[\nabla \log p(x; \theta) \nabla \log p(x; \theta)^\top] is the Fisher information matrix.

The natural gradient update is:

θt+1=θtηF(θt)1L(θt)\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t)

This is steepest descent in the KL divergence geometry: it finds the direction that decreases the loss most per unit of distributional change DKL(pθ+δpθ)D_{\text{KL}}(p_{\theta + \delta} \| p_\theta).

Intuition

Euclidean gradient descent treats all parameter directions equally: a step of size ϵ\epsilon in weight 1 is the same as a step of size ϵ\epsilon in weight 1000. But in a neural network, some parameters have much larger effects on the output distribution than others. The Fisher information matrix captures these sensitivities.

Multiplying by F1F^{-1} rescales the gradient so that the update produces equal change in the output distribution in all directions. This is the same idea as Newton's method (which uses the Hessian instead of the Fisher), but adapted for probabilistic models.

Why It Matters

The natural gradient is parameterization-invariant: it gives the same update regardless of how you parameterize the model. Reparameterizing a neural network (e.g., scaling a weight matrix by a constant and dividing the next layer by the same constant) changes the Euclidean gradient but not the natural gradient. This is why natural gradient methods can be more robust to architecture choices.

Practically, natural gradient methods converge in fewer steps than Adam on problems with ill-conditioned Fisher matrices (which is most neural networks). The challenge is computing F1F^{-1}, which costs O(d3)O(d^3) for dd parameters.

Failure Mode

The Fisher matrix has d2d^2 entries for dd parameters. For a model with 100M parameters, storing FF requires 101610^{16} entries, which is impossible. Every practical preconditioned optimizer is an approximation to the natural gradient: diagonal (Adam), block-diagonal (K-FAC), Kronecker-factored (Shampoo), or low-rank.

K-FAC: Kronecker-Factored Approximate Curvature

K-FAC (Martens & Grosse, 2015) approximates the Fisher matrix for each layer using a Kronecker product factorization.

For a fully-connected layer with input aa and output pre-activation s=Was = Wa, the Fisher block for that layer's weights is:

FWE[aa]E[sLsL]=ASF_W \approx \mathbb{E}[aa^\top] \otimes \mathbb{E}[\nabla_s \mathcal{L} \nabla_s \mathcal{L}^\top] = A \otimes S

where AA is the input covariance and SS is the gradient covariance. The Kronecker structure means:

FW1vec(G)vec(S1GA1)F_W^{-1} \text{vec}(G) \approx \text{vec}(S^{-1} G A^{-1})

This reduces the cost from O(din2dout2)O(d_{\text{in}}^2 d_{\text{out}}^2) to O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3), a massive saving. The Kronecker factorization is exact for linear networks and approximate for nonlinear networks.

Shampoo

Proposition

Shampoo Update Rule

Statement

Shampoo (Gupta, Koren, Singer, 2018) maintains left and right preconditioners for each weight matrix:

Lt=(1β)Lt1+βGtGtRm×mL_t = (1 - \beta) L_{t-1} + \beta \, G_t G_t^\top \in \mathbb{R}^{m \times m} Rt=(1β)Rt1+βGtGtRn×nR_t = (1 - \beta) R_{t-1} + \beta \, G_t^\top G_t \in \mathbb{R}^{n \times n}

The preconditioned update is:

ΔWt=Lt1/4GtRt1/4\Delta W_t = L_t^{-1/4} \, G_t \, R_t^{-1/4}

The 1/4-1/4 power (rather than 1/2-1/2) arises because the Kronecker approximation FLRF \approx L \otimes R means F1/2L1/2R1/2F^{-1/2} \approx L^{-1/2} \otimes R^{-1/2}, but applied to a matrix (not a vectorized matrix), this becomes L1/4GR1/4L^{-1/4} G R^{-1/4}.

Intuition

Shampoo is K-FAC for general weight matrices, simplified. Instead of separating the Fisher into input covariance and gradient covariance, it directly accumulates GGGG^\top (left) and GGG^\top G (right) from the gradient itself. The left preconditioner captures correlations between output neurons; the right captures correlations between input features. Together, they approximate the full Fisher.

The matrix fourth root L1/4L^{-1/4} is the geometric mean of the identity and the inverse square root. It provides a balance between no preconditioning (II) and full natural gradient (F1/2F^{-1/2}).

Why It Matters

Shampoo has shown strong empirical results on transformer training, sometimes matching or exceeding Adam with fewer steps (though each step is more expensive due to the matrix power computation). Google has used Shampoo-derived optimizers (Distributed Shampoo) for production training of large models. The connection to Riemannian optimization is direct: Shampoo's update can be interpreted as Riemannian gradient descent on the space of weight matrices with a specific metric.

Failure Mode

Computing L1/4L^{-1/4} requires eigendecomposition of LL, costing O(m3)O(m^3) per step. For large layers (m=4096m = 4096), this is expensive. Practical implementations amortize this cost by recomputing the preconditioner every kk steps (e.g., k=100k = 100). The matrix power can also be approximated using Newton-Schulz iterations (same technique as Muon). Memory cost is O(m2+n2)O(m^2 + n^2) per weight matrix for the preconditioners.

Comparison Table

OptimizerPreconditionerCost per stepMemory overheadConvergenceBest for
SGDNone (II)O(d)O(d)0Slow on ill-conditionedConvex, well-conditioned
AdamDiagonal (diag(vt)1/2\text{diag}(v_t)^{-1/2})O(d)O(d)2d2d (moments)Good general-purposeDefault choice
K-FACKronecker (A1S1A^{-1} \otimes S^{-1})O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3)din2+dout2d_{\text{in}}^2 + d_{\text{out}}^2Fast on FC layersLarge FC networks
ShampooFull matrix (L1/4R1/4L^{-1/4} \cdot R^{-1/4})O(m3+n3)O(m^3 + n^3)m2+n2m^2 + n^2Fast, especially on matricesTransformers, large models
MuonStiefel projectionO(mn)O(mn) (Newton-Schulz)O(mn)O(mn)Very fast on orthogonal-likeTransformer weights
Natural gradientFull Fisher (F1F^{-1})O(d3)O(d^3)O(d2)O(d^2)OptimalInfeasible for large dd

Common Confusions

Watch Out

Shampoo is not just Adam with bigger matrices

Adam uses diagonal preconditioning: each parameter gets its own adaptive learning rate based on its own squared gradient history. Shampoo uses full-matrix preconditioning: the update to parameter wijw_{ij} depends on the gradient history of all other parameters in the same weight matrix. This captures cross-parameter correlations that Adam misses entirely. The difference matters most when parameters are highly correlated (which they are in attention weight matrices).

Watch Out

The matrix fourth root is not arbitrary

The 1/4-1/4 exponent in Shampoo comes from the Kronecker factorization math, not from tuning. If FLRF \approx L \otimes R and you want F1/2F^{-1/2}, then (LR)1/2=L1/2R1/2(L \otimes R)^{-1/2} = L^{-1/2} \otimes R^{-1/2}. When applied to a matrix GG (not a vector), the Kronecker product becomes L1/2GR1/2L^{-1/2} G R^{-1/2} if GG is vectorized, but Shampoo operates on the matrix directly, which introduces a square root, giving L1/4GR1/4L^{-1/4} G R^{-1/4}. The exponent is derived, not chosen.

Watch Out

More preconditioning is not always better

Full natural gradient (F1LF^{-1} \nabla \mathcal{L}) is theoretically optimal per step, but each step is vastly more expensive. The wall-clock time to reach a target loss depends on both the number of steps and the cost per step. Adam is cheap per step. Shampoo is expensive per step but takes fewer steps. The crossover depends on model size, hardware, and how ill-conditioned the problem is. For small models, Adam wins on wall-clock. For large transformers with many GPU-hours per step, Shampoo can win.

Exercises

ExerciseCore

Problem

Adam maintains running averages vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 where gt2g_t^2 is elementwise. Explain why this is equivalent to a diagonal approximation of the Fisher matrix and why it misses cross-parameter correlations.

ExerciseAdvanced

Problem

For a weight matrix WR512×768W \in \mathbb{R}^{512 \times 768}, compute the memory cost of Shampoo's preconditioners (LL and RR) compared to Adam's two moment buffers. At what matrix size does Shampoo's memory overhead exceed Adam's by more than 2x?

Related Comparisons

References

Canonical:

  • Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation, 1998). The foundational paper on natural gradient.
  • Martens & Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (ICML 2015). K-FAC.
  • Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018). The original Shampoo paper.

Current:

  • Anil et al., "Scalable Second Order Optimization for Deep Learning" (2021). Distributed Shampoo at Google scale.

  • Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon and the Newton-Schulz connection.

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

Next Topics

  • Riemannian optimization: the geometric framework underlying manifold-constrained updates
  • Optimizer theory: SGD, Adam, Muon: how Muon approximates Stiefel projection via Newton-Schulz

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics