LLM Construction
Training Dynamics and Loss Landscapes
The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.
Prerequisites
Why This Matters
When you train a neural network, you are navigating a loss surface in millions or billions of dimensions. The geometry of this surface determines whether training succeeds, how fast it converges, and whether the solution generalizes. Strategies like curriculum learning exploit this structure by ordering training examples from easy to hard. Yet our intuitions from low-dimensional optimization are almost entirely wrong in high dimensions.
Understanding loss landscapes explains some of the deepest puzzles in deep learning: why overparameterized networks do not get stuck in bad local minima, why SGD generalizes better than full-batch gradient descent, and why flat minima correlate with good test performance.
Mental Model
Imagine a mountainous landscape. In 2D, you might worry about getting stuck in a valley that is not the deepest. In dimensions, the picture changes radically. At any critical point, the Hessian has billions of eigenvalues. A local minimum requires all of them to be positive. A saddle point only requires some to be negative. Combinatorially, saddle points are exponentially more common than local minima.
The practical consequence: gradient descent in high dimensions almost never gets stuck in bad local minima. It gets stuck at saddle points, which are escapable with noise or momentum.
Formal Setup and Notation
Let be the loss function, the parameters, the gradient, and the Hessian.
A critical point satisfies . It is a:
- Local minimum if all eigenvalues of are positive
- Saddle point if has both positive and negative eigenvalues
- Local maximum if all eigenvalues are negative
Index of a Critical Point
The index of a critical point is the number of negative eigenvalues of the Hessian . A local minimum has index 0. A saddle point has index , meaning there are directions of negative curvature.
Sharpness
The sharpness of a minimum is characterized by the largest eigenvalue of the Hessian:
Flat minima have small ; sharp minima have large . The trace (average eigenvalue) is sometimes used as an alternative sharpness measure.
Main Theorems
Saddle Points Dominate in High Dimensions
Statement
For a random smooth function with large, the expected number of critical points with index (i.e., negative Hessian eigenvalues) satisfies:
The number of local minima (index 0) is exponentially smaller than the number of saddle points (index ). Furthermore, high-index saddle points tend to have higher loss values, while low-index critical points (including local minima) tend to cluster at similar, low loss values.
Intuition
At a random critical point, each Hessian eigenvalue is roughly equally likely to be positive or negative. The probability that all eigenvalues are positive is approximately --- exponentially small. Most critical points have roughly negative eigenvalues (high-index saddle points). The few critical points with index near 0 tend to have similar loss values, explaining why "all local minima are equally good" in practice.
Proof Sketch
This result comes from random matrix theory applied to the Hessian of random Gaussian fields (Bray & Dean, 2007). The Kac-Rice formula counts critical points by integrating over the zero set of . The expected count factorizes into a term from the gradient (how often it vanishes) and a term from the Hessian determinant (which depends on the distribution of eigenvalue signs).
Why It Matters
This explains the empirical observation that deep networks, despite having highly non-convex loss landscapes, rarely get stuck in bad local minima. The problematic critical points are high-loss saddle points, which gradient descent with noise (SGD) escapes by moving along negative curvature directions.
Failure Mode
The random function model does not perfectly describe real neural network losses. Neural network losses have symmetries (permutation of hidden units), structured Hessians, and data-dependent correlations that deviate from the random model. The qualitative conclusion (saddle points dominate) holds empirically, but the quantitative predictions of the random model should be treated as approximate.
Flat Minima Generalize Better
Statement
(PAC-Bayes bound, simplified) For a posterior centered at a minimum with prior , the generalization gap satisfies:
If is in a flat region (the loss does not change much under Gaussian perturbation), then and the bound gives meaningful generalization guarantees. Sharp minima require smaller , increasing and loosening the bound.
Intuition
A flat minimum is one where perturbing the parameters slightly does not change the loss much. This means many parameter configurations near all achieve similar training loss. A large volume of near-optimal parameters is associated with generalization: the model is not relying on a fragile, specific parameter configuration that could break on new data.
Gradient Flow: The Continuous-Time Limit
Gradient flow is the continuous-time limit of gradient descent with infinitesimal learning rate:
This is an ODE whose solutions follow the steepest descent path on the loss surface. Gradient flow is mathematically cleaner than discrete gradient descent and provides insights about convergence and implicit bias.
Key properties of gradient flow:
- is monotonically decreasing:
- The trajectory converges to a critical point (under mild conditions)
- For linear models, gradient flow converges to the minimum-norm solution
This last property is the implicit bias of gradient methods: among all solutions that achieve zero training loss, gradient flow selects the one with minimum parameter norm. For deeper networks, the implicit bias is more complex and not fully understood.
Mode Connectivity
A striking empirical discovery: independently trained neural networks that achieve similar loss can be connected by simple curves in parameter space along which the loss remains low.
Linear mode connectivity: Two minima and are linearly mode-connected if for all .
This does not hold for arbitrary minima --- the linear path typically passes through a high-loss barrier. But after accounting for permutation symmetries of hidden units, many independently trained networks are linearly mode-connected. More generally, there exist smooth curves (quadratic Bezier paths) connecting minima without loss barriers.
Mode connectivity suggests the loss landscape has a connected low-loss manifold rather than isolated basins, explaining why different random initializations find solutions of similar quality.
Edge of Stability
The edge of stability phenomenon (Cohen et al., 2021) describes a surprising training dynamic:
- Early in training, the largest Hessian eigenvalue increases until it reaches , where is the learning rate
- At this point, classical convergence theory predicts divergence (learning rate exceeds )
- Instead, training enters a regime where and the loss continues to decrease non-monotonically
This means the optimizer self-tunes the curvature: if the landscape becomes too sharp, gradient descent destabilizes and moves to flatter regions, keeping at the stability threshold.
Canonical Examples
Saddle points in a 2D toy problem
Consider . The origin is a saddle point with . Gradient descent from most initializations escapes along the direction (negative curvature). In high dimensions, most critical points look like this but with many more negative directions.
Flat vs sharp minimum
Train the same architecture with SGD (large learning rate, small batch) and full-batch GD (small learning rate, all data). SGD typically converges to flatter minima (lower ) and achieves better test accuracy. The noise in SGD acts as implicit regularization, biasing toward flat regions of the loss landscape.
Common Confusions
Local minima are not the main obstacle
The folklore that gradient descent fails because it gets stuck in bad local minima is incorrect for overparameterized networks. The actual obstacles are: (1) saddle points that slow convergence (but do not trap), (2) flat regions (plateaus) where the gradient is small, and (3) ill-conditioning (large ratio of max to min Hessian eigenvalues) that makes some directions much harder to optimize than others.
Flat minima are not always better
The flat-minima-generalize hypothesis has caveats. The definition of flatness depends on the parameterization: reparameterizing the network (e.g., scaling weights by a constant and biases by its inverse) can change a flat minimum into a sharp one without changing the function. Sharpness measures must be reparameterization-invariant to be meaningful.
SGD noise is not Gaussian
The stochastic gradient noise in SGD is often modeled as Gaussian for theoretical convenience, but real mini-batch noise has heavier tails and is state-dependent. The noise covariance depends on the current parameters , which changes the dynamics qualitatively compared to additive Gaussian noise.
Summary
- In high dimensions, saddle points vastly outnumber local minima
- Low-index critical points (near-minima) cluster at similar loss values
- Flat minima generalize better (PAC-Bayes connection)
- Gradient flow has implicit bias toward simple solutions
- Mode connectivity: good minima are connected by low-loss paths
- Edge of stability: SGD self-regulates curvature at
Exercises
Problem
In , if each Hessian eigenvalue at a critical point is independently positive or negative with equal probability, what is the probability that the critical point is a local minimum?
Problem
Show that gradient flow monotonically decreases the loss. What additional condition ensures convergence to a critical point?
Problem
The edge-of-stability phenomenon shows that stabilizes near during GD training. Why does classical GD convergence theory predict divergence when , and what mechanism allows training to continue despite this?
References
Canonical:
- Dauphin et al., Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization (NeurIPS 2014)
- Keskar et al., On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (ICLR 2017)
Current:
- Cohen et al., Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability (ICLR 2021)
- Frankle et al., Linear Mode Connectivity and the Lottery Ticket Hypothesis (ICML 2020)
Next Topics
The natural next steps from training dynamics:
- Implicit bias and modern generalization: why overparameterized models generalize
- Optimizer theory (SGD, Adam, Muon): how different optimizers navigate the loss landscape
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- The Hessian MatrixLayer 0A