Optimization Function Classes
Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient
Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.
Why This Matters
Adam preconditions each parameter independently using running averages of squared gradients. This is diagonal preconditioning: each weight gets its own learning rate, but correlations between weights are ignored.
Preconditioned optimizers go further. They use the full (or approximated) curvature structure of the loss landscape to transform gradients before taking a step. The result: updates that account for how parameters interact, not just how large each gradient is. In practice, this means faster convergence on problems with strong parameter correlations, which includes most neural networks.
The cost is computational: maintaining and applying a preconditioner is more expensive per step than Adam. The question is whether the improved convergence compensates. For large-scale training (where each step costs thousands of GPU-hours), the answer is increasingly yes.
The Natural Gradient
Natural Gradient Descent
Statement
The natural gradient of a loss is:
where is the Fisher information matrix.
The natural gradient update is:
This is steepest descent in the KL divergence geometry: it finds the direction that decreases the loss most per unit of distributional change .
Intuition
Euclidean gradient descent treats all parameter directions equally: a step of size in weight 1 is the same as a step of size in weight 1000. But in a neural network, some parameters have much larger effects on the output distribution than others. The Fisher information matrix captures these sensitivities.
Multiplying by rescales the gradient so that the update produces equal change in the output distribution in all directions. This is the same idea as Newton's method (which uses the Hessian instead of the Fisher), but adapted for probabilistic models.
Why It Matters
The natural gradient is parameterization-invariant: it gives the same update regardless of how you parameterize the model. Reparameterizing a neural network (e.g., scaling a weight matrix by a constant and dividing the next layer by the same constant) changes the Euclidean gradient but not the natural gradient. This is why natural gradient methods can be more robust to architecture choices.
Practically, natural gradient methods converge in fewer steps than Adam on problems with ill-conditioned Fisher matrices (which is most neural networks). The challenge is computing , which costs for parameters.
Failure Mode
The Fisher matrix has entries for parameters. For a model with 100M parameters, storing requires entries, which is impossible. Every practical preconditioned optimizer is an approximation to the natural gradient: diagonal (Adam), block-diagonal (K-FAC), Kronecker-factored (Shampoo), or low-rank.
K-FAC: Kronecker-Factored Approximate Curvature
K-FAC (Martens & Grosse, 2015) approximates the Fisher matrix for each layer using a Kronecker product factorization.
For a fully-connected layer with input and output pre-activation , the Fisher block for that layer's weights is:
where is the input covariance and is the gradient covariance. The Kronecker structure means:
This reduces the cost from to , a massive saving. The Kronecker factorization is exact for linear networks and approximate for nonlinear networks.
Shampoo
Shampoo Update Rule
Statement
Shampoo (Gupta, Koren, Singer, 2018) maintains left and right preconditioners for each weight matrix:
The preconditioned update is:
The power (rather than ) arises because the Kronecker approximation means , but applied to a matrix (not a vectorized matrix), this becomes .
Intuition
Shampoo is K-FAC for general weight matrices, simplified. Instead of separating the Fisher into input covariance and gradient covariance, it directly accumulates (left) and (right) from the gradient itself. The left preconditioner captures correlations between output neurons; the right captures correlations between input features. Together, they approximate the full Fisher.
The matrix fourth root is the geometric mean of the identity and the inverse square root. It provides a balance between no preconditioning () and full natural gradient ().
Why It Matters
Shampoo has shown strong empirical results on transformer training, sometimes matching or exceeding Adam with fewer steps (though each step is more expensive due to the matrix power computation). Google has used Shampoo-derived optimizers (Distributed Shampoo) for production training of large models. The connection to Riemannian optimization is direct: Shampoo's update can be interpreted as Riemannian gradient descent on the space of weight matrices with a specific metric.
Failure Mode
Computing requires eigendecomposition of , costing per step. For large layers (), this is expensive. Practical implementations amortize this cost by recomputing the preconditioner every steps (e.g., ). The matrix power can also be approximated using Newton-Schulz iterations (same technique as Muon). Memory cost is per weight matrix for the preconditioners.
Comparison Table
| Optimizer | Preconditioner | Cost per step | Memory overhead | Convergence | Best for |
|---|---|---|---|---|---|
| SGD | None () | 0 | Slow on ill-conditioned | Convex, well-conditioned | |
| Adam | Diagonal () | (moments) | Good general-purpose | Default choice | |
| K-FAC | Kronecker () | Fast on FC layers | Large FC networks | ||
| Shampoo | Full matrix () | Fast, especially on matrices | Transformers, large models | ||
| Muon | Stiefel projection | (Newton-Schulz) | Very fast on orthogonal-like | Transformer weights | |
| Natural gradient | Full Fisher () | Optimal | Infeasible for large |
Common Confusions
Shampoo is not just Adam with bigger matrices
Adam uses diagonal preconditioning: each parameter gets its own adaptive learning rate based on its own squared gradient history. Shampoo uses full-matrix preconditioning: the update to parameter depends on the gradient history of all other parameters in the same weight matrix. This captures cross-parameter correlations that Adam misses entirely. The difference matters most when parameters are highly correlated (which they are in attention weight matrices).
The matrix fourth root is not arbitrary
The exponent in Shampoo comes from the Kronecker factorization math, not from tuning. If and you want , then . When applied to a matrix (not a vector), the Kronecker product becomes if is vectorized, but Shampoo operates on the matrix directly, which introduces a square root, giving . The exponent is derived, not chosen.
More preconditioning is not always better
Full natural gradient () is theoretically optimal per step, but each step is vastly more expensive. The wall-clock time to reach a target loss depends on both the number of steps and the cost per step. Adam is cheap per step. Shampoo is expensive per step but takes fewer steps. The crossover depends on model size, hardware, and how ill-conditioned the problem is. For small models, Adam wins on wall-clock. For large transformers with many GPU-hours per step, Shampoo can win.
Exercises
Problem
Adam maintains running averages where is elementwise. Explain why this is equivalent to a diagonal approximation of the Fisher matrix and why it misses cross-parameter correlations.
Problem
For a weight matrix , compute the memory cost of Shampoo's preconditioners ( and ) compared to Adam's two moment buffers. At what matrix size does Shampoo's memory overhead exceed Adam's by more than 2x?
Related Comparisons
References
Canonical:
- Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation, 1998). The foundational paper on natural gradient.
- Martens & Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (ICML 2015). K-FAC.
- Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018). The original Shampoo paper.
Current:
-
Anil et al., "Scalable Second Order Optimization for Deep Learning" (2021). Distributed Shampoo at Google scale.
-
Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon and the Newton-Schulz connection.
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
Next Topics
- Riemannian optimization: the geometric framework underlying manifold-constrained updates
- Optimizer theory: SGD, Adam, Muon: how Muon approximates Stiefel projection via Newton-Schulz
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Fisher InformationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- The Hessian MatrixLayer 0A