Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Training Dynamics and Loss Landscapes

The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.

AdvancedTier 2Current~60 min

Why This Matters

When you train a neural network, you are navigating a loss surface in millions or billions of dimensions. The geometry of this surface determines whether training succeeds, how fast it converges, and whether the solution generalizes. Strategies like curriculum learning exploit this structure by ordering training examples from easy to hard. Yet our intuitions from low-dimensional optimization are almost entirely wrong in high dimensions.

Understanding loss landscapes explains some of the deepest puzzles in deep learning: why overparameterized networks do not get stuck in bad local minima, why SGD generalizes better than full-batch gradient descent, and why flat minima correlate with good test performance.

Local minSaddleGlobal minLocal minSGD starts hereParameter space (1D slice). Real loss surfaces have millions of dimensions with many saddle points.

Mental Model

Imagine a mountainous landscape. In 2D, you might worry about getting stuck in a valley that is not the deepest. In 10910^9 dimensions, the picture changes radically. At any critical point, the Hessian has billions of eigenvalues. A local minimum requires all of them to be positive. A saddle point only requires some to be negative. Combinatorially, saddle points are exponentially more common than local minima.

The practical consequence: gradient descent in high dimensions almost never gets stuck in bad local minima. It gets stuck at saddle points, which are escapable with noise or momentum.

Formal Setup and Notation

Let L(θ)L(\theta) be the loss function, θRp\theta \in \mathbb{R}^p the parameters, L(θ)\nabla L(\theta) the gradient, and H(θ)=2L(θ)H(\theta) = \nabla^2 L(\theta) the Hessian.

A critical point satisfies L(θ)=0\nabla L(\theta) = 0. It is a:

  • Local minimum if all eigenvalues of H(θ)H(\theta) are positive
  • Saddle point if H(θ)H(\theta) has both positive and negative eigenvalues
  • Local maximum if all eigenvalues are negative
Definition

Index of a Critical Point

The index of a critical point θ\theta is the number of negative eigenvalues of the Hessian H(θ)H(\theta). A local minimum has index 0. A saddle point has index k>0k > 0, meaning there are kk directions of negative curvature.

Definition

Sharpness

The sharpness of a minimum θ\theta^* is characterized by the largest eigenvalue of the Hessian:

S(θ)=λmax(H(θ))S(\theta^*) = \lambda_{\max}(H(\theta^*))

Flat minima have small λmax\lambda_{\max}; sharp minima have large λmax\lambda_{\max}. The trace tr(H)\text{tr}(H) (average eigenvalue) is sometimes used as an alternative sharpness measure.

Main Theorems

Theorem

Saddle Points Dominate in High Dimensions

Statement

For a random smooth function L:RpRL: \mathbb{R}^p \to \mathbb{R} with pp large, the expected number of critical points with index kk (i.e., kk negative Hessian eigenvalues) satisfies:

E[number of index-k critical points](pk)\mathbb{E}[\text{number of index-}k\text{ critical points}] \propto \binom{p}{k}

The number of local minima (index 0) is exponentially smaller than the number of saddle points (index kp/2k \approx p/2). Furthermore, high-index saddle points tend to have higher loss values, while low-index critical points (including local minima) tend to cluster at similar, low loss values.

Intuition

At a random critical point, each Hessian eigenvalue is roughly equally likely to be positive or negative. The probability that all pp eigenvalues are positive is approximately 2p2^{-p} --- exponentially small. Most critical points have roughly p/2p/2 negative eigenvalues (high-index saddle points). The few critical points with index near 0 tend to have similar loss values, explaining why "all local minima are equally good" in practice.

Proof Sketch

This result comes from random matrix theory applied to the Hessian of random Gaussian fields (Bray & Dean, 2007). The Kac-Rice formula counts critical points by integrating detH(θ)|\det H(\theta)| over the zero set of L\nabla L. The expected count factorizes into a term from the gradient (how often it vanishes) and a term from the Hessian determinant (which depends on the distribution of eigenvalue signs).

Why It Matters

This explains the empirical observation that deep networks, despite having highly non-convex loss landscapes, rarely get stuck in bad local minima. The problematic critical points are high-loss saddle points, which gradient descent with noise (SGD) escapes by moving along negative curvature directions.

Failure Mode

The random function model does not perfectly describe real neural network losses. Neural network losses have symmetries (permutation of hidden units), structured Hessians, and data-dependent correlations that deviate from the random model. The qualitative conclusion (saddle points dominate) holds empirically, but the quantitative predictions of the random model should be treated as approximate.

Proposition

Flat Minima Generalize Better

Statement

(PAC-Bayes bound, simplified) For a posterior Q=N(θ,σ2I)Q = \mathcal{N}(\theta^*, \sigma^2 I) centered at a minimum θ\theta^* with prior P=N(0,σ02I)P = \mathcal{N}(0, \sigma_0^2 I), the generalization gap satisfies:

EθQ[R(θ)]EθQ[R^n(θ)]KL(QP)+log(2n/δ)2n\mathbb{E}_{\theta \sim Q}[R(\theta)] - \mathbb{E}_{\theta \sim Q}[\hat{R}_n(\theta)] \leq \sqrt{\frac{\text{KL}(Q \| P) + \log(2n/\delta)}{2n}}

If θ\theta^* is in a flat region (the loss does not change much under Gaussian perturbation), then EQ[R^n(θ)]R^n(θ)\mathbb{E}_Q[\hat{R}_n(\theta)] \approx \hat{R}_n(\theta^*) and the bound gives meaningful generalization guarantees. Sharp minima require smaller σ\sigma, increasing KL(QP)\text{KL}(Q \| P) and loosening the bound.

Intuition

A flat minimum is one where perturbing the parameters slightly does not change the loss much. This means many parameter configurations near θ\theta^* all achieve similar training loss. A large volume of near-optimal parameters is associated with generalization: the model is not relying on a fragile, specific parameter configuration that could break on new data.

Gradient Flow: The Continuous-Time Limit

Gradient flow is the continuous-time limit of gradient descent with infinitesimal learning rate:

dθdt=L(θ(t))\frac{d\theta}{dt} = -\nabla L(\theta(t))

This is an ODE whose solutions follow the steepest descent path on the loss surface. Gradient flow is mathematically cleaner than discrete gradient descent and provides insights about convergence and implicit bias.

Key properties of gradient flow:

  • L(θ(t))L(\theta(t)) is monotonically decreasing: dL/dt=L20dL/dt = -\|\nabla L\|^2 \leq 0
  • The trajectory converges to a critical point (under mild conditions)
  • For linear models, gradient flow converges to the minimum-norm solution θ=argminθ s.t. L(θ)=0\theta^* = \arg\min \|\theta\| \text{ s.t. } L(\theta) = 0

This last property is the implicit bias of gradient methods: among all solutions that achieve zero training loss, gradient flow selects the one with minimum parameter norm. For deeper networks, the implicit bias is more complex and not fully understood.

Mode Connectivity

A striking empirical discovery: independently trained neural networks that achieve similar loss can be connected by simple curves in parameter space along which the loss remains low.

Linear mode connectivity: Two minima θA\theta_A and θB\theta_B are linearly mode-connected if L((1t)θA+tθB)L(θA)L((1-t)\theta_A + t\theta_B) \approx L(\theta_A) for all t[0,1]t \in [0,1].

This does not hold for arbitrary minima --- the linear path typically passes through a high-loss barrier. But after accounting for permutation symmetries of hidden units, many independently trained networks are linearly mode-connected. More generally, there exist smooth curves (quadratic Bezier paths) connecting minima without loss barriers.

Mode connectivity suggests the loss landscape has a connected low-loss manifold rather than isolated basins, explaining why different random initializations find solutions of similar quality.

Edge of Stability

The edge of stability phenomenon (Cohen et al., 2021) describes a surprising training dynamic:

  1. Early in training, the largest Hessian eigenvalue λmax\lambda_{\max} increases until it reaches 2/η2/\eta, where η\eta is the learning rate
  2. At this point, classical convergence theory predicts divergence (learning rate exceeds 2/λmax2/\lambda_{\max})
  3. Instead, training enters a regime where λmax2/η\lambda_{\max} \approx 2/\eta and the loss continues to decrease non-monotonically

This means the optimizer self-tunes the curvature: if the landscape becomes too sharp, gradient descent destabilizes and moves to flatter regions, keeping λmax\lambda_{\max} at the stability threshold.

Canonical Examples

Example

Saddle points in a 2D toy problem

Consider L(θ1,θ2)=θ12θ22L(\theta_1, \theta_2) = \theta_1^2 - \theta_2^2. The origin is a saddle point with H=diag(2,2)H = \text{diag}(2, -2). Gradient descent from most initializations escapes along the θ2\theta_2 direction (negative curvature). In high dimensions, most critical points look like this but with many more negative directions.

Example

Flat vs sharp minimum

Train the same architecture with SGD (large learning rate, small batch) and full-batch GD (small learning rate, all data). SGD typically converges to flatter minima (lower λmax\lambda_{\max}) and achieves better test accuracy. The noise in SGD acts as implicit regularization, biasing toward flat regions of the loss landscape.

Common Confusions

Watch Out

Local minima are not the main obstacle

The folklore that gradient descent fails because it gets stuck in bad local minima is incorrect for overparameterized networks. The actual obstacles are: (1) saddle points that slow convergence (but do not trap), (2) flat regions (plateaus) where the gradient is small, and (3) ill-conditioning (large ratio of max to min Hessian eigenvalues) that makes some directions much harder to optimize than others.

Watch Out

Flat minima are not always better

The flat-minima-generalize hypothesis has caveats. The definition of flatness depends on the parameterization: reparameterizing the network (e.g., scaling weights by a constant and biases by its inverse) can change a flat minimum into a sharp one without changing the function. Sharpness measures must be reparameterization-invariant to be meaningful.

Watch Out

SGD noise is not Gaussian

The stochastic gradient noise in SGD is often modeled as Gaussian for theoretical convenience, but real mini-batch noise has heavier tails and is state-dependent. The noise covariance depends on the current parameters θ\theta, which changes the dynamics qualitatively compared to additive Gaussian noise.

Summary

  • In high dimensions, saddle points vastly outnumber local minima
  • Low-index critical points (near-minima) cluster at similar loss values
  • Flat minima generalize better (PAC-Bayes connection)
  • Gradient flow has implicit bias toward simple solutions
  • Mode connectivity: good minima are connected by low-loss paths
  • Edge of stability: SGD self-regulates curvature at λmax2/η\lambda_{\max} \approx 2/\eta

Exercises

ExerciseCore

Problem

In R100\mathbb{R}^{100}, if each Hessian eigenvalue at a critical point is independently positive or negative with equal probability, what is the probability that the critical point is a local minimum?

ExerciseAdvanced

Problem

Show that gradient flow dθ/dt=L(θ)d\theta/dt = -\nabla L(\theta) monotonically decreases the loss. What additional condition ensures convergence to a critical point?

ExerciseResearch

Problem

The edge-of-stability phenomenon shows that λmax(H)\lambda_{\max}(H) stabilizes near 2/η2/\eta during GD training. Why does classical GD convergence theory predict divergence when λmax>2/η\lambda_{\max} > 2/\eta, and what mechanism allows training to continue despite this?

References

Canonical:

  • Dauphin et al., Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization (NeurIPS 2014)
  • Keskar et al., On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (ICLR 2017)

Current:

  • Cohen et al., Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability (ICLR 2021)
  • Frankle et al., Linear Mode Connectivity and the Lottery Ticket Hypothesis (ICML 2020)

Next Topics

The natural next steps from training dynamics:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics