LLM Construction
Optimizer Theory: SGD, Adam, and Muon
Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.
Prerequisites
Why This Matters
The optimizer is the algorithm that actually finds the parameters of a neural network. Every model is found by some variant of stochastic gradient descent. Understanding optimizer theory means understanding convergence rates (how fast can you reach a good solution?), the role of stochasticity (why is noise helpful?), and the deep connection between optimization and generalization (why do some optimizers find solutions that generalize better?).
This topic covers the theoretical landscape from classical SGD convergence theory through modern adaptive methods (Adam) to recent optimizers. Muon (Keller Jordan et al. 2024) is an experimental optimizer that has shown strong results on specific benchmarks (NanoGPT speedrun, some post-training work). It is not yet a production-default for large-scale LLM training. Include it here because the mathematical ideas, steepest descent under a matrix-norm geometry, are worth understanding even if the practical standard remains AdamW.
Mental Model
Gradient descent walks downhill on the loss surface. Stochastic gradient descent uses a noisy estimate of the gradient (from a mini-batch) instead of the true gradient. This noise is both a curse (convergence is slower) and a blessing (the noise helps escape sharp minima and find flatter regions that generalize better).
Momentum methods add inertia: the optimizer remembers its recent direction and continues moving that way, smoothing out the noise. Adaptive methods (Adam) additionally scale the step size per-parameter based on gradient history. The art of optimization is combining these ideas to converge fast and find solutions that generalize.
SGD Convergence Theory
Stochastic Gradient Descent
At iteration , SGD updates:
where is the gradient on a random sample (or mini-batch) , and is the learning rate.
The key assumption: is an unbiased estimator of the true gradient: , with bounded variance .
SGD Convergence for Convex Functions
Statement
For a convex, -smooth function with stochastic gradients of variance , SGD with constant learning rate achieves after iterations:
where is the averaged iterate.
With the optimal constant step size :
Intuition
There are two competing forces. The first term is the "optimization" term. It shrinks with more iterations and larger step sizes. The second term is the "noise" term. It grows with larger step sizes because the stochastic noise accumulates. The optimal step size balances these terms, giving the rate.
This rate is optimal for convex stochastic optimization. no algorithm using only stochastic first-order information can do better in the worst case.
Proof Sketch
By the update rule and convexity:
Taking expectations and using convexity ():
Using -smoothness to bound and telescoping the sum over steps gives the result.
Why It Matters
The rate explains why training neural networks requires many iterations. To halve the suboptimality, you need 4x more iterations. This is fundamental to stochastic optimization, not a limitation of SGD specifically. The rate also explains the importance of variance reduction: if you can reduce (by using larger batches or variance-reduction techniques), you can use larger step sizes and converge faster.
Failure Mode
This bound assumes convexity, which neural network loss functions violate spectacularly. For non-convex problems, the analogous guarantee is convergence to a stationary point (), not to the global minimum. The practical success of SGD on non-convex neural networks is not fully explained by the convex theory.
SGD Convergence for Strongly Convex Functions
Statement
For a -strongly convex, -smooth function with stochastic gradients of variance , SGD with step size achieves:
This is a faster rate than the rate for convex functions, because strong convexity provides additional curvature that helps the optimizer converge.
Intuition
Strong convexity means the function curves upward at least quadratically near the optimum. This curvature provides a "restoring force" that pulls the iterates toward . Even with stochastic noise, the curvature dominates, giving linear convergence up to a noise floor. The decreasing step size reduces the noise contribution over time.
Why It Matters
Regularized objectives (e.g., ridge regression) are strongly convex, and the rate applies directly. For neural networks, local strong convexity near a minimum can partially explain why SGD converges faster in practice than the worst-case convex rate suggests.
Mini-Batch SGD and Variance Reduction
Mini-Batch SGD
Instead of using a single sample, estimate the gradient from a mini-batch of size :
The variance of the mini-batch gradient is . By the SGD convergence theorem, replacing with gives:
Increasing batch size by a factor of 4 is equivalent to running 4x more iterations at the same convergence rate. However, it does not improve the rate. The total number of gradient evaluations is what matters.
When does batch size help? Larger batches reduce gradient variance, but the compute cost per iteration increases proportionally. The critical batch size is the batch size at which the gradient noise is comparable to the signal. Below , increasing batch size improves efficiency (you waste less compute on noise). Above , further increases yield diminishing returns because the gradient is already accurate.
Momentum Methods
Polyak Heavy Ball Momentum
SGD with heavy ball momentum maintains a velocity vector:
where is the momentum coefficient (typically ).
The velocity is an exponential moving average of past gradients. This smooths the update direction: instead of responding only to the current gradient, the optimizer continues in the direction it has been moving.
Nesterov Accelerated Gradient Descent
Statement
Nesterov accelerated gradient descent evaluates the gradient at a "lookahead" point:
For convex, -smooth functions, this achieves:
This is the optimal rate for first-order methods on convex smooth functions (by the Nemirovsky-Yudin lower bound). Standard gradient descent achieves only .
Intuition
The key idea is to evaluate the gradient not at the current point but at where you expect to be after the momentum step. This "look ahead" allows the method to correct overshooting before it happens, rather than after. The momentum coefficient grows over time, gradually increasing the aggressiveness of the extrapolation.
Why It Matters
Nesterov acceleration demonstrates that momentum is not just a heuristic. it provably speeds up convergence for smooth convex optimization by a quadratic factor. In the stochastic setting, however, Nesterov acceleration does not improve the rate because the noise dominates the acceleration benefit. This is why practical deep learning mostly uses Polyak momentum (which is simpler) rather than Nesterov momentum.
Failure Mode
The rate requires exact gradients. With stochastic gradients, the benefit of acceleration is reduced because the noise variance is not affected by the momentum scheme. For stochastic optimization, Nesterov momentum offers at best a constant-factor improvement over Polyak momentum, not an asymptotic improvement.
Adam: Adaptive + Momentum
Adam combines first-moment estimation (momentum) with second-moment estimation (per-parameter adaptive learning rates). The full algorithm is covered in the Adam optimizer topic. Here we focus on the theoretical comparison with SGD.
Adam convergence: For convex problems, Adam with appropriate step size achieves convergence, matching SGD. However, Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when shrinks.
Adam vs SGD generalization: The key empirical finding (Wilson et al., 2017; Smith et al., 2021): SGD with momentum often finds solutions that generalize better than Adam, even when Adam converges faster on the training set.
Flat Minima Generalize Better
Statement
The PAC-Bayes bound provides a generalization bound that depends on the "flatness" of a minimum. For a minimum at with Hessian :
Flatter minima (smaller Hessian trace/eigenvalues) have tighter generalization bounds. An optimizer that consistently finds flatter minima produces models that generalize better.
Intuition
A flat minimum is one where the loss does not change much when you perturb the parameters. If the minimum is flat, then different training sets (drawn from the same distribution) will have their minima in roughly the same region, so the training minimum is a good predictor of the test minimum. A sharp minimum is "fragile": it works for exactly the training data but a slight distribution shift changes the optimal parameters dramatically.
Why It Matters
SGD with large learning rate and small batch size injects more noise into the optimization, which helps it escape sharp minima and settle in flatter ones. Adam, by adapting the learning rate per-parameter, can converge to sharper minima because it reduces the effective noise in each parameter direction. This is the leading theoretical explanation for why SGD often outperforms Adam in generalization on certain tasks (especially vision).
Failure Mode
The definition of "flat" is not well-defined: reparameterization can change the Hessian eigenvalues without changing the function. Dinh et al. (2017) showed that for any minimum, you can reparameterize to make it arbitrarily sharp without changing generalization. The PAC-Bayes framework resolves this by fixing a prior, but the philosophical status of the flat minima hypothesis remains debated.
The Muon Optimizer
Muon Optimizer
Muon (Momentum + Orthogonalization) is an optimizer designed specifically for hidden-layer weight matrices in neural networks. For a weight matrix :
- Compute the gradient
- Apply momentum:
- Orthogonalize the update using Newton-Schulz iterations to approximate the matrix sign function:
This projects the update onto the Stiefel manifold (the set of matrices with orthonormal columns/rows), ensuring that the update direction has equal magnitude across all singular value directions.
Muon applies this orthogonalized update to hidden-layer weights, while using Adam for embedding and normalization parameters.
Why orthogonalization? In standard optimizers, the update direction is dominated by the top singular vectors of the gradient. Low-magnitude singular directions. which may carry important structural information. are suppressed. By orthogonalizing, Muon ensures that all directions of the gradient are treated equally, regardless of their magnitude.
Empirical results: Muon achieved state-of-the-art results in the NanoGPT speedrun benchmark, reaching the target validation loss faster than AdamW. The key advantage appears to be in the early-to-mid phase of training, where Muon's orthogonalized updates help the model learn useful representations faster.
Newton-Schulz iterations: The orthogonalization step uses 5-10 iterations of to approximate the polar decomposition where is orthogonal. This is cheaper than computing a full SVD and is numerically stable.
Practical Considerations
Muon is most beneficial for transformer hidden-layer weights (the large matrices). It is not a drop-in replacement for Adam on every parameter shape: the original Muon recipe applies orthogonalized updates to hidden 2D weights and falls back to Adam for embeddings, normalization scales, and biases. The Newton-Schulz step is more expensive than a single Adam update, but the cost is small relative to the matmuls in a forward pass and is often outweighed by faster convergence in the early-to-mid phase of training.
The Muon literature is moving quickly. Variants that overlap Newton-Schulz with gradient all-reduce, that apply momentum corrections to the orthogonalized step, or that reformulate the iteration via the Gram matrix have all been proposed. Specific citations are intentionally omitted here until the variants stabilize and are independently reproduced.
Learning Rate Schedules
The learning rate schedule is as important as the optimizer itself for LLM training:
Warmup + Cosine Decay Schedule
The standard LLM learning rate schedule has two phases:
Linear warmup (first steps):
Cosine decay (remaining steps):
Typical values: to , .
Why warmup: In the first steps, Adam's second-moment estimate is unreliable (based on very few gradient samples). Large learning rates with noisy denominator estimates cause wild parameter updates. Warmup gives the second-moment estimate time to stabilize.
Why cosine decay: As training progresses and the model approaches a minimum, the gradient noise increasingly dominates the signal. Reducing the learning rate allows the optimizer to settle into the minimum rather than bouncing around it. The cosine shape provides a smooth decay that is empirically superior to linear or step decay for language models.
Common Confusions
SGD convergence theory does not directly apply to neural networks
The and rates assume convexity (or strong convexity). Neural network loss functions are non-convex with saddle points, local minima, and plateaus. The practical success of SGD on neural networks is empirically well-established but not fully explained by classical convex theory. The theory gives useful intuition (larger batches = less noise, lower learning rate = smaller noise floor) but the quantitative bounds are vacuous for typical deep learning problems.
Momentum does not help asymptotically in the stochastic setting
Nesterov acceleration gives vs for deterministic convex optimization. a provable speedup. But for stochastic optimization, both momentum and no-momentum SGD achieve , and this is optimal. Momentum helps with the constant factors (smoother trajectory, faster initial convergence) but does not improve the asymptotic rate. The practical benefits of momentum in deep learning are real but not captured by worst-case convergence theory.
Adam is not universally better than SGD
Adam converges faster on the training loss for most problems. But faster training convergence does not imply better generalization. For CNNs on image classification, SGD with momentum consistently generalizes better than Adam. For transformers on language modeling, Adam (specifically AdamW) is standard because the generalization gap is small and the faster convergence matters. The optimal optimizer depends on the architecture and task.
Summary
- SGD convex rate: ; strongly convex:
- Nesterov acceleration: for deterministic convex (optimal), but no help in stochastic setting asymptotically
- Mini-batch reduces variance by but does not improve the fundamental rate
- Critical batch size : above this, larger batches waste compute
- Adam = momentum + adaptive LR; can converge to sharper minima than SGD
- Flat minima (small Hessian eigenvalues) correlate with better generalization
- Muon: orthogonalize gradient updates for hidden layers; all singular directions treated equally
- Learning rate schedule: warmup (stabilize Adam's second moment) + cosine decay (settle into minimum)
- SGD generalizes better on vision; Adam/AdamW standard for transformers
Exercises
Problem
For SGD on a convex -smooth function with gradient variance and initial distance , what is the optimal constant learning rate to minimize the bound after iterations? What suboptimality does the bound guarantee?
Problem
Prove that mini-batch SGD with batch size has gradient variance . Then explain why doubling the batch size and halving the number of iterations does not change the convergence bound.
Problem
The Muon optimizer orthogonalizes the gradient update using Newton-Schulz iterations to approximate . Why would this help compared to standard Adam updates? Consider a scenario where the gradient matrix has singular values and analyze what happens with Adam vs Muon.
Related Comparisons
References
Canonical:
- Polyak, "Some methods of speeding up the convergence of iteration methods" (1964). heavy ball momentum
- Nesterov, "A method for unconstrained convex minimization problem with the rate of convergence " (1983)
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015)
Current:
- Wilson et al., "The Marginal Value of Adaptive Gradient Methods in ML" (2017)
- Jordan, "Muon: An optimizer for hidden layers" (2024). Muon optimizer (the modded-nanogpt write-up by Keller Jordan).
- Smith et al., "On the Origin of Implicit Regularization in Stochastic Gradient Descent" (2021)
Next Topics
Optimizer theory connects to:
- Scaling laws: how optimizer choice and learning rate schedule affect compute-optimal training
- Preconditioned optimizers: Shampoo, K-FAC, and the full-matrix preconditioning story
- Riemannian optimization: the geometric framework underlying Muon's orthogonalization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Adam OptimizerLayer 2
- Gradient Descent VariantsLayer 1
- Stochastic Gradient Descent ConvergenceLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
Builds on This
- Distributed Training TheoryLayer 5