Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Riemannian Optimization and Manifold Constraints

Optimization on curved spaces: the Stiefel manifold for orthogonal matrices, symmetric positive definite matrices, Riemannian gradient descent, retractions, and why flat-space intuitions break on manifolds. The geometric backbone of Shampoo, Muon, and constrained neural network training.

AdvancedTier 2Current~55 min
0

Why This Matters

Standard gradient descent assumes parameters live in flat Euclidean space Rd\mathbb{R}^d. But many ML problems have constraints that define curved spaces: weight matrices that must stay orthogonal, covariance matrices that must stay positive definite, probability distributions that must stay on the simplex, or embeddings that must stay on a sphere.

If you just project back to the constraint set after each gradient step, you waste computation and can get poor convergence. Riemannian optimization does better: it works directly on the curved surface (manifold), computing gradients that are tangent to the surface and taking steps that stay on it.

This is not esoteric geometry. The Muon optimizer uses Newton-Schulz iteration to approximate projection onto the Stiefel manifold. Shampoo preconditions gradients using the space of symmetric positive definite matrices. Understanding the manifold geometry explains why these optimizers work and where they fail.

Core Concepts

Definition

Smooth Manifold

A smooth manifold M\mathcal{M} is a space that locally looks like Rd\mathbb{R}^d but may be globally curved. Each point pMp \in \mathcal{M} has a tangent space TpMT_p\mathcal{M}, a vector space of directions you can move from pp while staying on M\mathcal{M} (infinitesimally).

For ML, the key manifolds are:

  • The Stiefel manifold St(n,p)={WRn×p:WW=Ip}\text{St}(n, p) = \{W \in \mathbb{R}^{n \times p} : W^\top W = I_p\}: matrices with orthonormal columns
  • The SPD manifold S++n\mathcal{S}_{++}^n: n×nn \times n symmetric positive definite matrices
  • The sphere Sd1={xRd:x=1}S^{d-1} = \{x \in \mathbb{R}^d : \|x\| = 1\}: unit vectors
  • The Grassmannian Gr(n,p)\text{Gr}(n, p): pp-dimensional subspaces of Rn\mathbb{R}^n
Definition

Riemannian Metric

A Riemannian metric assigns an inner product gp:TpM×TpMRg_p: T_p\mathcal{M} \times T_p\mathcal{M} \to \mathbb{R} to each tangent space, varying smoothly with pp. This defines distances, angles, and curvature on the manifold.

For the Stiefel manifold embedded in Rn×p\mathbb{R}^{n \times p}, the simplest metric is the inherited Euclidean one: gW(ξ,η)=tr(ξη)g_W(\xi, \eta) = \text{tr}(\xi^\top \eta) for tangent vectors ξ,ηTWSt(n,p)\xi, \eta \in T_W \text{St}(n, p).

Definition

Riemannian Gradient

The Riemannian gradient of f:MRf: \mathcal{M} \to \mathbb{R} at pp is the tangent vector grad f(p)TpM\text{grad } f(p) \in T_p\mathcal{M} such that:

gp(grad f(p),ξ)=Df(p)[ξ]ξTpMg_p(\text{grad } f(p), \xi) = Df(p)[\xi] \quad \forall \xi \in T_p\mathcal{M}

For a submanifold of Rd\mathbb{R}^d, this is the projection of the Euclidean gradient onto the tangent space:

grad f(p)=ΠTpM(f(p))\text{grad } f(p) = \Pi_{T_p\mathcal{M}}(\nabla f(p))

The Riemannian gradient points in the steepest descent direction on the manifold, ignoring directions that would leave the constraint surface.

The Stiefel Manifold

The Stiefel manifold is the most important manifold in ML optimization because orthogonality constraints appear in neural network weight matrices, embedding layers, and preconditioner updates.

Proposition

Tangent Space and Retraction on the Stiefel Manifold

Statement

The tangent space of the Stiefel manifold at WW is:

TWSt(n,p)={ξRn×p:Wξ+ξW=0}T_W \text{St}(n, p) = \{\xi \in \mathbb{R}^{n \times p} : W^\top \xi + \xi^\top W = 0\}

The Riemannian gradient of ff at WW is obtained by projecting the Euclidean gradient G=f(W)G = \nabla f(W):

grad f(W)=GWsym(WG)\text{grad } f(W) = G - W \cdot \text{sym}(W^\top G)

where sym(A)=(A+A)/2\text{sym}(A) = (A + A^\top)/2. A retraction (first-order approximation to the exponential map) is the QR-based retraction:

RW(ξ)=qf(W+ξ)R_W(\xi) = \text{qf}(W + \xi)

where qf\text{qf} extracts the QQ factor from QR decomposition.

Intuition

The tangent space condition Wξ+ξW=0W^\top \xi + \xi^\top W = 0 says that tangent directions must be "skew" relative to WW. Moving in direction ξ\xi from WW should keep WWW^\top W close to II to first order. The projection GWsym(WG)G - W \cdot \text{sym}(W^\top G) removes the component of the Euclidean gradient that would move WW off the manifold. The retraction maps back to the manifold after taking a step.

Why It Matters

This is exactly what Muon's Newton-Schulz iteration approximates. Instead of the expensive QR decomposition, Newton-Schulz uses polynomial iterations to approximate the orthogonal polar factor, which is another valid retraction. The choice of retraction affects computational cost but not the direction of descent (to first order).

Shampoo also uses the Stiefel manifold implicitly: its preconditioner updates maintain structure on the space of positive definite matrices, and the Kronecker factorization exploits the product structure of weight matrices.

Failure Mode

QR retraction costs O(np2)O(np^2), which is expensive for large matrices. Cayley retraction and polar retraction are alternatives with different cost-accuracy tradeoffs. For very large nn (e.g., transformer weight matrices with n=4096n = 4096), even the retraction cost can dominate training time, which is why approximations like Newton-Schulz are used in practice.

Riemannian Gradient Descent

Theorem

Riemannian Gradient Descent

Statement

Riemannian gradient descent updates:

pt+1=Rpt(ηgrad f(pt))p_{t+1} = R_{p_t}(-\eta \cdot \text{grad } f(p_t))

where RR is a retraction. Under geodesic LL-smoothness (the Riemannian analog of LL-smoothness), with step size η=1/L\eta = 1/L:

f(pT)f(p)Ld2(p0,p)2Tf(p_T) - f(p^*) \leq \frac{L \cdot d^2(p_0, p^*)}{2T}

where dd is the geodesic distance. This matches the O(1/T)O(1/T) Euclidean rate.

Intuition

Riemannian gradient descent is the direct analog of Euclidean gradient descent: compute the steepest descent direction (now on the manifold), take a step (now using the retraction instead of addition), and repeat. The convergence theory parallels the Euclidean theory, with geodesic convexity and smoothness replacing their Euclidean counterparts.

Why It Matters

This guarantees that constrained optimization on manifolds is not harder in rate than unconstrained optimization. The same O(1/T)O(1/T) rate holds, and with geodesic strong convexity you get the same O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) linear rate. The manifold structure is exploited, not fought against.

Failure Mode

Geodesic convexity is a stronger requirement than Euclidean convexity restricted to the manifold. Some functions that are convex in Rd\mathbb{R}^d are not geodesically convex on curved manifolds, and vice versa. The curvature of the manifold introduces additional terms in the convergence analysis. On manifolds with high curvature (small injectivity radius), the step size must be correspondingly small.

Why Low-Rank Does Not Mean Low Complexity

Watch Out

Low rank does not mean simple or easy to optimize

Low-rank approximation via SVD truncation is a Euclidean operation: find the closest rank-kk matrix in Frobenius norm. But the set of rank-kk matrices is not a convex set and not even a smooth manifold (it has singularities where the rank drops further). The smooth part (matrices of exactly rank kk) forms a manifold, but it has curvature that depends on the singular value gaps.

Optimizing over low-rank matrices is a non-convex problem. The landscape has saddle points corresponding to singular value crossings. Riemannian optimization on the fixed-rank manifold avoids the singularities and can converge provably, but it requires careful treatment of the tangent space (which involves both the column space and row space of the matrix).

Watch Out

The Stiefel manifold is not flat

The Stiefel manifold St(n,p)\text{St}(n, p) is a curved space embedded in Rn×p\mathbb{R}^{n \times p}. Straight-line steps in the ambient space leave the manifold. This is not just a technical nuisance: the curvature means that parallel transport (moving vectors from one tangent space to another) is path-dependent, and the shortest path between two orthogonal matrices (the geodesic) is not a straight line but a matrix exponential curve.

For the special case p=np = n (orthogonal group O(n)O(n)), the geodesics are W(t)=W0exp(tΩ)W(t) = W_0 \exp(t \Omega) where Ω\Omega is skew-symmetric. The curvature of O(n)O(n) is positive and equals 1/41/4 in appropriate normalization.

SPD Matrices and the Information Geometry Connection

The manifold of symmetric positive definite (SPD) matrices S++n\mathcal{S}_{++}^n appears in ML through covariance estimation, Fisher information matrices, kernel matrices, and preconditioner design.

The natural metric on S++n\mathcal{S}_{++}^n is the affine-invariant metric:

gP(ξ,η)=tr(P1ξP1η)g_P(\xi, \eta) = \text{tr}(P^{-1} \xi P^{-1} \eta)

Under this metric, the geodesic distance between two SPD matrices P,QP, Q is:

d(P,Q)=log(P1/2QP1/2)Fd(P, Q) = \|\log(P^{-1/2} Q P^{-1/2})\|_F

This connects directly to information geometry: the Fisher information metric on statistical models induces the affine-invariant metric on the space of Gaussian covariance matrices. Shampoo's preconditioner updates operate on this space.

Exercises

ExerciseCore

Problem

Verify that the tangent space condition Wξ+ξW=0W^\top \xi + \xi^\top W = 0 is necessary for ξ\xi to be tangent to St(n,p)\text{St}(n, p) at WW. Start from the constraint WW=IpW^\top W = I_p and differentiate.

ExerciseAdvanced

Problem

The polar retraction on the Stiefel manifold is RW(ξ)=(W+ξ)(Ip+ξξ)1/2R_W(\xi) = (W + \xi)(I_p + \xi^\top \xi)^{-1/2}. Show that this maps back to St(n,p)\text{St}(n, p), i.e., verify RW(ξ)RW(ξ)=IpR_W(\xi)^\top R_W(\xi) = I_p. Why is this preferred over QR retraction in some settings?

References

Canonical:

  • Absil, Mahony, Sepulchre, Optimization Algorithms on Matrix Manifolds (2008). The definitive reference.
  • Edelman, Arias, Smith, "The Geometry of Algorithms with Orthogonality Constraints" (SIMAX, 1998). Foundational paper on Stiefel/Grassmann optimization.

Current:

  • Boumal, An Introduction to Optimization on Smooth Manifolds (2023). Modern textbook, freely available.

  • Hu et al., "Brief Introduction to Manifold Optimization" (JMLR, 2020)

  • Becigneul & Ganea, "Riemannian Adaptive Optimization Methods" (ICLR 2019). Riemannian Adam.

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

Next Topics

  • Optimizer theory: SGD, Adam, Muon: how Muon uses Newton-Schulz to approximate Stiefel retraction
  • Second-order optimization: natural gradient and preconditioned methods that operate on manifold structure

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics