What Each Does
Adam tracks per-parameter running averages of the gradient () and squared gradient (). The update divides by , giving each parameter its own adaptive learning rate. This is diagonal preconditioning: it ignores correlations between parameters.
Shampoo tracks full left and right covariance matrices and for each weight matrix. The update applies , accounting for cross-parameter correlations. This is full-matrix Kronecker preconditioning: an approximation to the natural gradient.
Muon orthogonalizes the gradient update using Newton-Schulz iterations, which approximate the polar factor of the gradient matrix. This projects the update onto the Stiefel manifold of orthogonal matrices. This is manifold preconditioning: it constrains the update geometry rather than scaling it.
Side-by-Side Comparison
| Property | Adam | Shampoo | Muon |
|---|---|---|---|
| Preconditioner | Diagonal () | Kronecker () | Orthogonal (polar factor) |
| Curvature captured | Per-parameter scale | Cross-parameter correlations | Update geometry |
| Cost per step | per matrix | (NS iterations) | |
| Memory overhead | (two moments) | per weight matrix | per weight matrix |
| Steps to convergence | Baseline | Fewer (better conditioning) | Fewer on transformers |
| Wall-clock efficiency | Best for small models | Best for large, ill-conditioned models | Competitive on transformers |
| Theory connection | Diagonal natural gradient | Kronecker-factored natural gradient | Riemannian optimization on Stiefel manifold |
When Each Wins
Adam wins when:
- The model is small or the per-step compute budget is tight
- The loss landscape is well-conditioned (small models, simple architectures)
- You need a well-understood, debugged optimizer with decades of practical usage
- AdamW is the correct choice for most practitioners most of the time
Shampoo wins when:
- The model has large weight matrices with strong cross-parameter correlations
- The compute budget per step is not the bottleneck (e.g., data loading dominates)
- You can amortize the matrix power computation over many steps (e.g., recompute every 100 steps)
- Google-scale training where even small per-step improvements compound
Muon wins when:
- Training transformers specifically (where orthogonalized updates match the weight matrix structure)
- You want better conditioning than Adam without the cost of Shampoo
- The Newton-Schulz iterations (typically 5 iterations of each) are affordable
The Exponent in Shampoo
Adam uses : the inverse square root of the second moment. Shampoo uses and : the inverse fourth root. This is not arbitrary. The Fisher information matrix for a weight matrix factorizes as . The natural gradient uses . Since , and the matrix application on the vectorized weight becomes when applied to the matrix directly, the exponent is derived, not tuned.
Common Confusions
Muon is not just Adam with an extra step
Adam and Muon have structurally different update rules. Adam rescales each parameter independently. Muon orthogonalizes the entire gradient matrix, which changes the direction of the update, not just its scale. The orthogonalization step (Newton-Schulz) is the core innovation, not a post-processing trick.
Shampoo is not always better than Adam on wall-clock time
Shampoo typically converges in fewer steps, but each step is more expensive ( for the matrix power). On small models where Adam's per-step cost dominates, Adam finishes faster. On large models where the matrix power is amortized, Shampoo can win. The crossover point depends on model size, hardware, and amortization frequency.
References
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015)
- Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018)
- Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon.
- Anil et al., "Scalable Second Order Optimization for Deep Learning" (2021). Distributed Shampoo.