Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Shampoo vs. Adam vs. Muon

Three approaches to preconditioning gradients: Adam (diagonal, per-parameter), Shampoo (full-matrix Kronecker), and Muon (orthogonalized updates via Newton-Schulz). Each uses increasingly rich curvature information at increasing computational cost.

What Each Does

Adam tracks per-parameter running averages of the gradient (mtm_t) and squared gradient (vtv_t). The update divides by vt\sqrt{v_t}, giving each parameter its own adaptive learning rate. This is diagonal preconditioning: it ignores correlations between parameters.

Shampoo tracks full left and right covariance matrices Lt=GtGtL_t = \sum G_t G_t^\top and Rt=GtGtR_t = \sum G_t^\top G_t for each weight matrix. The update applies Lt1/4GtRt1/4L_t^{-1/4} G_t R_t^{-1/4}, accounting for cross-parameter correlations. This is full-matrix Kronecker preconditioning: an approximation to the natural gradient.

Muon orthogonalizes the gradient update using Newton-Schulz iterations, which approximate the polar factor of the gradient matrix. This projects the update onto the Stiefel manifold of orthogonal matrices. This is manifold preconditioning: it constrains the update geometry rather than scaling it.

Side-by-Side Comparison

PropertyAdamShampooMuon
PreconditionerDiagonal (diag(vt)1/2\text{diag}(v_t)^{-1/2})Kronecker (L1/4R1/4L^{-1/4} \cdot R^{-1/4})Orthogonal (polar factor)
Curvature capturedPer-parameter scaleCross-parameter correlationsUpdate geometry
Cost per stepO(d)O(d)O(m3+n3)O(m^3 + n^3) per m×nm \times n matrixO(mn)O(mn) (NS iterations)
Memory overhead2d2d (two moments)m2+n2m^2 + n^2 per weight matrixO(mn)O(mn) per weight matrix
Steps to convergenceBaselineFewer (better conditioning)Fewer on transformers
Wall-clock efficiencyBest for small modelsBest for large, ill-conditioned modelsCompetitive on transformers
Theory connectionDiagonal natural gradientKronecker-factored natural gradientRiemannian optimization on Stiefel manifold

When Each Wins

Adam wins when:

Shampoo wins when:

Muon wins when:

The 1/4-1/4 Exponent in Shampoo

Adam uses vt1/2v_t^{-1/2}: the inverse square root of the second moment. Shampoo uses L1/4L^{-1/4} and R1/4R^{-1/4}: the inverse fourth root. This is not arbitrary. The Fisher information matrix for a weight matrix WW factorizes as FLRF \approx L \otimes R. The natural gradient uses F1/2F^{-1/2}. Since (LR)1/2=L1/2R1/2(L \otimes R)^{-1/2} = L^{-1/2} \otimes R^{-1/2}, and the matrix application L1/2GR1/2L^{-1/2} G R^{-1/2} on the vectorized weight becomes L1/4GR1/4L^{-1/4} G R^{-1/4} when applied to the matrix directly, the exponent 1/4-1/4 is derived, not tuned.

Common Confusions

Watch Out

Muon is not just Adam with an extra step

Adam and Muon have structurally different update rules. Adam rescales each parameter independently. Muon orthogonalizes the entire gradient matrix, which changes the direction of the update, not just its scale. The orthogonalization step (Newton-Schulz) is the core innovation, not a post-processing trick.

Watch Out

Shampoo is not always better than Adam on wall-clock time

Shampoo typically converges in fewer steps, but each step is more expensive (O(m3)O(m^3) for the matrix power). On small models where Adam's per-step cost dominates, Adam finishes faster. On large models where the matrix power is amortized, Shampoo can win. The crossover point depends on model size, hardware, and amortization frequency.

References