Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Optimizer Theory: SGD, Adam, and Muon

Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.

AdvancedTier 1Current~70 min

Why This Matters

Optimizer Update Rules ComparedOptimizerUpdate RuleTrajectory ShapeSGDθ -= η · gzigzagSGD + Momentumv = β·v + gθ -= η·vsmooth curveAdamm = β₁·m + (1-β₁)·gv = β₂·v + (1-β₂)·g²θ -= η·m̂ / (√v̂ + ε)adaptive stepMuonG̃ = orthogonalize(G)via Newton-Schulz iterationθ -= η·μ·G̃ + momentumnear-directg = gradient, η = learning rate, β = momentum coefficient, m̂/v̂ = bias-corrected estimates

The optimizer is the algorithm that actually finds the parameters of a neural network. Every model is found by some variant of stochastic gradient descent. Understanding optimizer theory means understanding convergence rates (how fast can you reach a good solution?), the role of stochasticity (why is noise helpful?), and the deep connection between optimization and generalization (why do some optimizers find solutions that generalize better?).

This topic covers the theoretical landscape from classical SGD convergence theory through modern adaptive methods (Adam) to recent optimizers. Muon (Keller Jordan et al. 2024) is an experimental optimizer that has shown strong results on specific benchmarks (NanoGPT speedrun, some post-training work). It is not yet a production-default for large-scale LLM training. Include it here because the mathematical ideas, steepest descent under a matrix-norm geometry, are worth understanding even if the practical standard remains AdamW.

Mental Model

Gradient descent walks downhill on the loss surface. Stochastic gradient descent uses a noisy estimate of the gradient (from a mini-batch) instead of the true gradient. This noise is both a curse (convergence is slower) and a blessing (the noise helps escape sharp minima and find flatter regions that generalize better).

Momentum methods add inertia: the optimizer remembers its recent direction and continues moving that way, smoothing out the noise. Adaptive methods (Adam) additionally scale the step size per-parameter based on gradient history. The art of optimization is combining these ideas to converge fast and find solutions that generalize.

SGD Convergence Theory

Definition

Stochastic Gradient Descent

At iteration tt, SGD updates:

θt+1=θtηtg^t\theta_{t+1} = \theta_t - \eta_t \hat{g}_t

where g^t=θ(θt;zt)\hat{g}_t = \nabla_\theta \ell(\theta_t; z_t) is the gradient on a random sample (or mini-batch) ztz_t, and ηt\eta_t is the learning rate.

The key assumption: g^t\hat{g}_t is an unbiased estimator of the true gradient: E[g^tθt]=f(θt)\mathbb{E}[\hat{g}_t | \theta_t] = \nabla f(\theta_t), with bounded variance E[g^tf(θt)2]σ2\mathbb{E}[\|\hat{g}_t - \nabla f(\theta_t)\|^2] \leq \sigma^2.

Theorem

SGD Convergence for Convex Functions

Statement

For a convex, LL-smooth function ff with stochastic gradients of variance σ2\sigma^2, SGD with constant learning rate η1/L\eta \leq 1/L achieves after TT iterations:

E[f(θˉT)]f(θ)θ0θ22ηT+ησ22\mathbb{E}[f(\bar{\theta}_T)] - f(\theta^*) \leq \frac{\|\theta_0 - \theta^*\|^2}{2\eta T} + \frac{\eta \sigma^2}{2}

where θˉT=1Tt=1Tθt\bar{\theta}_T = \frac{1}{T}\sum_{t=1}^{T} \theta_t is the averaged iterate.

With the optimal constant step size η=θ0θσT\eta = \frac{\|\theta_0 - \theta^*\|}{\sigma\sqrt{T}}:

E[f(θˉT)]f(θ)θ0θσT=O(1T)\mathbb{E}[f(\bar{\theta}_T)] - f(\theta^*) \leq \frac{\|\theta_0 - \theta^*\| \sigma}{\sqrt{T}} = O\left(\frac{1}{\sqrt{T}}\right)

Intuition

There are two competing forces. The first term θ0θ2/(2ηT)\|\theta_0 - \theta^*\|^2/(2\eta T) is the "optimization" term. It shrinks with more iterations and larger step sizes. The second term ησ2/2\eta\sigma^2/2 is the "noise" term. It grows with larger step sizes because the stochastic noise accumulates. The optimal step size balances these terms, giving the O(1/T)O(1/\sqrt{T}) rate.

This rate is optimal for convex stochastic optimization. no algorithm using only stochastic first-order information can do better in the worst case.

Proof Sketch

By the update rule and convexity:

θt+1θ2=θtθ22ηg^t(θtθ)+η2g^t2\|\theta_{t+1} - \theta^*\|^2 = \|\theta_t - \theta^*\|^2 - 2\eta\hat{g}_t^\top(\theta_t - \theta^*) + \eta^2\|\hat{g}_t\|^2

Taking expectations and using convexity (f(θt)(θtθ)f(θt)f(θ)\nabla f(\theta_t)^\top(\theta_t - \theta^*) \geq f(\theta_t) - f(\theta^*)):

E[θt+1θ2]E[θtθ2]2η(f(θt)f(θ))+η2(f(θt)2+σ2)\mathbb{E}[\|\theta_{t+1} - \theta^*\|^2] \leq \mathbb{E}[\|\theta_t - \theta^*\|^2] - 2\eta(f(\theta_t) - f(\theta^*)) + \eta^2(\|\nabla f(\theta_t)\|^2 + \sigma^2)

Using LL-smoothness to bound f2\|\nabla f\|^2 and telescoping the sum over TT steps gives the result.

Why It Matters

The O(1/T)O(1/\sqrt{T}) rate explains why training neural networks requires many iterations. To halve the suboptimality, you need 4x more iterations. This is fundamental to stochastic optimization, not a limitation of SGD specifically. The rate also explains the importance of variance reduction: if you can reduce σ2\sigma^2 (by using larger batches or variance-reduction techniques), you can use larger step sizes and converge faster.

Failure Mode

This bound assumes convexity, which neural network loss functions violate spectacularly. For non-convex problems, the analogous guarantee is convergence to a stationary point (fϵ\|\nabla f\| \leq \epsilon), not to the global minimum. The practical success of SGD on non-convex neural networks is not fully explained by the convex theory.

Theorem

SGD Convergence for Strongly Convex Functions

Statement

For a μ\mu-strongly convex, LL-smooth function with stochastic gradients of variance σ2\sigma^2, SGD with step size ηt=2μ(t+1)\eta_t = \frac{2}{\mu(t+1)} achieves:

E[f(θT)]f(θ)2Lσ2μ2T=O(1T)\mathbb{E}[f(\theta_T)] - f(\theta^*) \leq \frac{2L\sigma^2}{\mu^2 T} = O\left(\frac{1}{T}\right)

This is a faster rate than the O(1/T)O(1/\sqrt{T}) rate for convex functions, because strong convexity provides additional curvature that helps the optimizer converge.

Intuition

Strong convexity means the function curves upward at least quadratically near the optimum. This curvature provides a "restoring force" that pulls the iterates toward θ\theta^*. Even with stochastic noise, the curvature dominates, giving linear convergence up to a noise floor. The decreasing step size ηt1/t\eta_t \propto 1/t reduces the noise contribution over time.

Why It Matters

Regularized objectives (e.g., ridge regression) are strongly convex, and the O(1/T)O(1/T) rate applies directly. For neural networks, local strong convexity near a minimum can partially explain why SGD converges faster in practice than the worst-case convex rate suggests.

Mini-Batch SGD and Variance Reduction

Definition

Mini-Batch SGD

Instead of using a single sample, estimate the gradient from a mini-batch BtB_t of size bb:

g^t=1bzBt(θt;z)\hat{g}_t = \frac{1}{b}\sum_{z \in B_t} \nabla \ell(\theta_t; z)

The variance of the mini-batch gradient is Var(g^t)=σ2/b\text{Var}(\hat{g}_t) = \sigma^2/b. By the SGD convergence theorem, replacing σ2\sigma^2 with σ2/b\sigma^2/b gives:

E[f(θˉT)]f(θ)O(σbT)\mathbb{E}[f(\bar{\theta}_T)] - f(\theta^*) \leq O\left(\frac{\sigma}{\sqrt{bT}}\right)

Increasing batch size by a factor of 4 is equivalent to running 4x more iterations at the same convergence rate. However, it does not improve the rate. The total number of gradient evaluations bTbT is what matters.

When does batch size help? Larger batches reduce gradient variance, but the compute cost per iteration increases proportionally. The critical batch size b=σ2/f2b^* = \sigma^2 / \|\nabla f\|^2 is the batch size at which the gradient noise is comparable to the signal. Below bb^*, increasing batch size improves efficiency (you waste less compute on noise). Above bb^*, further increases yield diminishing returns because the gradient is already accurate.

Momentum Methods

Definition

Polyak Heavy Ball Momentum

SGD with heavy ball momentum maintains a velocity vector:

vt+1=βvt+g^tv_{t+1} = \beta v_t + \hat{g}_t θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

where β[0,1)\beta \in [0, 1) is the momentum coefficient (typically β=0.9\beta = 0.9).

The velocity vtv_t is an exponential moving average of past gradients. This smooths the update direction: instead of responding only to the current gradient, the optimizer continues in the direction it has been moving.

Theorem

Nesterov Accelerated Gradient Descent

Statement

Nesterov accelerated gradient descent evaluates the gradient at a "lookahead" point:

yt=θt+t1t+2(θtθt1)y_t = \theta_t + \frac{t-1}{t+2}(\theta_t - \theta_{t-1}) θt+1=yt1Lf(yt)\theta_{t+1} = y_t - \frac{1}{L}\nabla f(y_t)

For convex, LL-smooth functions, this achieves:

f(θT)f(θ)2Lθ0θ2T2=O(1T2)f(\theta_T) - f(\theta^*) \leq \frac{2L\|\theta_0 - \theta^*\|^2}{T^2} = O\left(\frac{1}{T^2}\right)

This is the optimal rate for first-order methods on convex smooth functions (by the Nemirovsky-Yudin lower bound). Standard gradient descent achieves only O(1/T)O(1/T).

Intuition

The key idea is to evaluate the gradient not at the current point but at where you expect to be after the momentum step. This "look ahead" allows the method to correct overshooting before it happens, rather than after. The momentum coefficient t1t+2\frac{t-1}{t+2} grows over time, gradually increasing the aggressiveness of the extrapolation.

Why It Matters

Nesterov acceleration demonstrates that momentum is not just a heuristic. it provably speeds up convergence for smooth convex optimization by a quadratic factor. In the stochastic setting, however, Nesterov acceleration does not improve the O(1/T)O(1/\sqrt{T}) rate because the noise dominates the acceleration benefit. This is why practical deep learning mostly uses Polyak momentum (which is simpler) rather than Nesterov momentum.

Failure Mode

The O(1/T2)O(1/T^2) rate requires exact gradients. With stochastic gradients, the benefit of acceleration is reduced because the noise variance is not affected by the momentum scheme. For stochastic optimization, Nesterov momentum offers at best a constant-factor improvement over Polyak momentum, not an asymptotic improvement.

Adam: Adaptive + Momentum

Adam combines first-moment estimation (momentum) with second-moment estimation (per-parameter adaptive learning rates). The full algorithm is covered in the Adam optimizer topic. Here we focus on the theoretical comparison with SGD.

Adam convergence: For convex problems, Adam with appropriate step size achieves O(1/T)O(1/\sqrt{T}) convergence, matching SGD. However, Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when vtv_t shrinks.

Adam vs SGD generalization: The key empirical finding (Wilson et al., 2017; Smith et al., 2021): SGD with momentum often finds solutions that generalize better than Adam, even when Adam converges faster on the training set.

Proposition

Flat Minima Generalize Better

Statement

The PAC-Bayes bound provides a generalization bound that depends on the "flatness" of a minimum. For a minimum at θ\theta^* with Hessian HH:

generalization gapO(tr(H)θ2n)\text{generalization gap} \leq O\left(\sqrt{\frac{\text{tr}(H) \cdot \|\theta^*\|^2}{n}}\right)

Flatter minima (smaller Hessian trace/eigenvalues) have tighter generalization bounds. An optimizer that consistently finds flatter minima produces models that generalize better.

Intuition

A flat minimum is one where the loss does not change much when you perturb the parameters. If the minimum is flat, then different training sets (drawn from the same distribution) will have their minima in roughly the same region, so the training minimum is a good predictor of the test minimum. A sharp minimum is "fragile": it works for exactly the training data but a slight distribution shift changes the optimal parameters dramatically.

Why It Matters

SGD with large learning rate and small batch size injects more noise into the optimization, which helps it escape sharp minima and settle in flatter ones. Adam, by adapting the learning rate per-parameter, can converge to sharper minima because it reduces the effective noise in each parameter direction. This is the leading theoretical explanation for why SGD often outperforms Adam in generalization on certain tasks (especially vision).

Failure Mode

The definition of "flat" is not well-defined: reparameterization can change the Hessian eigenvalues without changing the function. Dinh et al. (2017) showed that for any minimum, you can reparameterize to make it arbitrarily sharp without changing generalization. The PAC-Bayes framework resolves this by fixing a prior, but the philosophical status of the flat minima hypothesis remains debated.

The Muon Optimizer

Definition

Muon Optimizer

Muon (Momentum + Orthogonalization) is an optimizer designed specifically for hidden-layer weight matrices in neural networks. For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}:

  1. Compute the gradient Gt=WLG_t = \nabla_W \mathcal{L}
  2. Apply momentum: Mt=βMt1+GtM_t = \beta M_{t-1} + G_t
  3. Orthogonalize the update using Newton-Schulz iterations to approximate the matrix sign function:

UtMt(MtMt)1/2U_t \approx M_t (M_t^\top M_t)^{-1/2}

This projects the update onto the Stiefel manifold (the set of matrices with orthonormal columns/rows), ensuring that the update direction has equal magnitude across all singular value directions.

Muon applies this orthogonalized update to hidden-layer weights, while using Adam for embedding and normalization parameters.

Why orthogonalization? In standard optimizers, the update direction is dominated by the top singular vectors of the gradient. Low-magnitude singular directions. which may carry important structural information. are suppressed. By orthogonalizing, Muon ensures that all directions of the gradient are treated equally, regardless of their magnitude.

Empirical results: Muon achieved state-of-the-art results in the NanoGPT speedrun benchmark, reaching the target validation loss faster than AdamW. The key advantage appears to be in the early-to-mid phase of training, where Muon's orthogonalized updates help the model learn useful representations faster.

Newton-Schulz iterations: The orthogonalization step uses 5-10 iterations of Xk+1=32Xk12XkXkXkX_{k+1} = \frac{3}{2}X_k - \frac{1}{2}X_k X_k^\top X_k to approximate the polar decomposition M=USM = US where UU is orthogonal. This is cheaper than computing a full SVD and is numerically stable.

Practical Considerations

Muon is most beneficial for transformer hidden-layer weights (the large dmodel×dmodeld_{\text{model}} \times d_{\text{model}} matrices). It is not a drop-in replacement for Adam on every parameter shape: the original Muon recipe applies orthogonalized updates to hidden 2D weights and falls back to Adam for embeddings, normalization scales, and biases. The Newton-Schulz step is more expensive than a single Adam update, but the cost is small relative to the matmuls in a forward pass and is often outweighed by faster convergence in the early-to-mid phase of training.

The Muon literature is moving quickly. Variants that overlap Newton-Schulz with gradient all-reduce, that apply momentum corrections to the orthogonalized step, or that reformulate the iteration via the Gram matrix MMM^\top M have all been proposed. Specific citations are intentionally omitted here until the variants stabilize and are independently reproduced.

Learning Rate Schedules

The learning rate schedule is as important as the optimizer itself for LLM training:

Definition

Warmup + Cosine Decay Schedule

The standard LLM learning rate schedule has two phases:

Linear warmup (first TwT_w steps): ηt=ηmaxt/Tw\eta_t = \eta_{\max} \cdot t / T_w

Cosine decay (remaining steps): ηt=ηmin+12(ηmaxηmin)(1+cos(πtTwTTw))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi \cdot \frac{t - T_w}{T - T_w}))

Typical values: Tw0.01TT_w \approx 0.01T to 0.1T0.1T, ηmin=0.1ηmax\eta_{\min} = 0.1 \eta_{\max}.

Why warmup: In the first steps, Adam's second-moment estimate is unreliable (based on very few gradient samples). Large learning rates with noisy denominator estimates cause wild parameter updates. Warmup gives the second-moment estimate time to stabilize.

Why cosine decay: As training progresses and the model approaches a minimum, the gradient noise increasingly dominates the signal. Reducing the learning rate allows the optimizer to settle into the minimum rather than bouncing around it. The cosine shape provides a smooth decay that is empirically superior to linear or step decay for language models.

Common Confusions

Watch Out

SGD convergence theory does not directly apply to neural networks

The O(1/T)O(1/\sqrt{T}) and O(1/T)O(1/T) rates assume convexity (or strong convexity). Neural network loss functions are non-convex with saddle points, local minima, and plateaus. The practical success of SGD on neural networks is empirically well-established but not fully explained by classical convex theory. The theory gives useful intuition (larger batches = less noise, lower learning rate = smaller noise floor) but the quantitative bounds are vacuous for typical deep learning problems.

Watch Out

Momentum does not help asymptotically in the stochastic setting

Nesterov acceleration gives O(1/T2)O(1/T^2) vs O(1/T)O(1/T) for deterministic convex optimization. a provable speedup. But for stochastic optimization, both momentum and no-momentum SGD achieve O(1/T)O(1/\sqrt{T}), and this is optimal. Momentum helps with the constant factors (smoother trajectory, faster initial convergence) but does not improve the asymptotic rate. The practical benefits of momentum in deep learning are real but not captured by worst-case convergence theory.

Watch Out

Adam is not universally better than SGD

Adam converges faster on the training loss for most problems. But faster training convergence does not imply better generalization. For CNNs on image classification, SGD with momentum consistently generalizes better than Adam. For transformers on language modeling, Adam (specifically AdamW) is standard because the generalization gap is small and the faster convergence matters. The optimal optimizer depends on the architecture and task.

Summary

  • SGD convex rate: O(1/T)O(1/\sqrt{T}); strongly convex: O(1/T)O(1/T)
  • Nesterov acceleration: O(1/T2)O(1/T^2) for deterministic convex (optimal), but no help in stochastic setting asymptotically
  • Mini-batch reduces variance by 1/b1/b but does not improve the fundamental rate
  • Critical batch size b=σ2/f2b^* = \sigma^2/\|\nabla f\|^2: above this, larger batches waste compute
  • Adam = momentum + adaptive LR; can converge to sharper minima than SGD
  • Flat minima (small Hessian eigenvalues) correlate with better generalization
  • Muon: orthogonalize gradient updates for hidden layers; all singular directions treated equally
  • Learning rate schedule: warmup (stabilize Adam's second moment) + cosine decay (settle into minimum)
  • SGD generalizes better on vision; Adam/AdamW standard for transformers

Exercises

ExerciseCore

Problem

For SGD on a convex LL-smooth function with gradient variance σ2=1\sigma^2 = 1 and initial distance θ0θ=10\|\theta_0 - \theta^*\| = 10, what is the optimal constant learning rate to minimize the bound after T=10,000T = 10{,}000 iterations? What suboptimality does the bound guarantee?

ExerciseAdvanced

Problem

Prove that mini-batch SGD with batch size bb has gradient variance σ2/b\sigma^2/b. Then explain why doubling the batch size and halving the number of iterations does not change the convergence bound.

ExerciseResearch

Problem

The Muon optimizer orthogonalizes the gradient update using Newton-Schulz iterations to approximate (MM)1/2M(M^\top M)^{-1/2} M^\top. Why would this help compared to standard Adam updates? Consider a scenario where the gradient matrix has singular values [100,1,0.01][100, 1, 0.01] and analyze what happens with Adam vs Muon.

Related Comparisons

References

Canonical:

  • Polyak, "Some methods of speeding up the convergence of iteration methods" (1964). heavy ball momentum
  • Nesterov, "A method for unconstrained convex minimization problem with the rate of convergence O(1/k2)O(1/k^2)" (1983)
  • Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015)

Current:

  • Wilson et al., "The Marginal Value of Adaptive Gradient Methods in ML" (2017)
  • Jordan, "Muon: An optimizer for hidden layers" (2024). Muon optimizer (the modded-nanogpt write-up by Keller Jordan).
  • Smith et al., "On the Origin of Implicit Regularization in Stochastic Gradient Descent" (2021)

Next Topics

Optimizer theory connects to:

  • Scaling laws: how optimizer choice and learning rate schedule affect compute-optimal training
  • Preconditioned optimizers: Shampoo, K-FAC, and the full-matrix preconditioning story
  • Riemannian optimization: the geometric framework underlying Muon's orthogonalization

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics