Training Dynamics and Loss Landscapes

Sneiderman, Robby

LLM Construction

Training Dynamics and Loss Landscapes

The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.

AdvancedTier 2CurrentSupporting~60 min

Prerequisites

Convex Optimization Basics The Hessian Matrix Stability and Optimization Dynamics

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 2. This page has 3 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Implicit Bias and Modern Generalization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

When you train a neural network, you are navigating a loss surface in millions or billions of dimensions. The geometry of this surface determines whether training succeeds, how fast it converges, and whether the solution generalizes. Strategies like curriculum learning exploit this structure by ordering training examples from easy to hard. Yet our intuitions from low-dimensional optimization are almost entirely wrong in high dimensions.

Understanding loss landscapes explains some of the deepest puzzles in deep learning: why overparameterized networks do not get stuck in bad local minima, why SGD generalizes better than full-batch gradient descent, and why flat minima correlate with good test performance.

Mental Model

Imagine a mountainous landscape. In 2D, you might worry about getting stuck in a valley that is not the deepest. In $10^9$ dimensions, the picture changes radically. At any critical point, the Hessian has billions of eigenvalues. A local minimum requires all of them to be positive. A saddle point only requires some to be negative. Combinatorially, saddle points are exponentially more common than local minima.

The practical consequence: gradient descent in high dimensions almost never gets stuck in bad local minima. It gets stuck at saddle points, which are escapable with noise or momentum.

Formal Setup and Notation

Let $L(\theta)$ be the loss function, $\theta \in \mathbb{R}^p$ the parameters, $\nabla L(\theta)$ the gradient, and $H(\theta) = \nabla^2 L(\theta)$ the Hessian.

A critical point satisfies $\nabla L(\theta) = 0$ . It is a:

Local minimum if all eigenvalues of $H(\theta)$ are positive
Saddle point if $H(\theta)$ has both positive and negative eigenvalues
Local maximum if all eigenvalues are negative

Definition

Index of a Critical Point $index (θ)$

The index of a critical point $\theta$ is the number of negative eigenvalues of the Hessian $H(\theta)$ . A local minimum has index 0. A saddle point has index $k > 0$ , meaning there are $k$ directions of negative curvature.

Definition

Sharpness $S (θ)$

The sharpness of a minimum $\theta^*$ is characterized by the largest eigenvalue of the Hessian:

$S(\theta^*) = \lambda_{\max}(H(\theta^*))$

Flat minima have small $\lambda_{\max}$ ; sharp minima have large $\lambda_{\max}$ . The trace $\text{tr}(H)$ (average eigenvalue) is sometimes used as an alternative sharpness measure.

Main Theorems

Theorem

Saddle Points Dominate in High Dimensions

Statement

For a random smooth function $L: \mathbb{R}^p \to \mathbb{R}$ with $p$ large, the expected number of critical points with index $k$ (i.e., $k$ negative Hessian eigenvalues) satisfies:

$\mathbb{E}[\text{number of index-}k\text{ critical points}] \propto \binom{p}{k}$

The number of local minima (index 0) is exponentially smaller than the number of saddle points (index $k \approx p/2$ ). Furthermore, high-index saddle points tend to have higher loss values, while low-index critical points (including local minima) tend to cluster at similar, low loss values.

Intuition

At a random critical point, each Hessian eigenvalue is roughly equally likely to be positive or negative. The probability that all $p$ eigenvalues are positive is approximately $2^{-p}$ , exponentially small. Most critical points have roughly $p/2$ negative eigenvalues (high-index saddle points). The few critical points with index near 0 tend to have similar loss values, explaining why "all local minima are equally good" in practice.

Proof Sketch

This result comes from random matrix theory applied to the Hessian of Gaussian random fields on the high-dimensional sphere (Bray, Dean 2007, Physical Review Letters, "Statistics of critical points of Gaussian fields on large-dimensional spaces"). The Kac-Rice formula counts critical points by integrating $|\det H(\theta)|$ over the zero set of $\nabla L$ . The expected count factorizes into a term from the gradient (how often it vanishes) and a term from the Hessian determinant (which depends on the distribution of eigenvalue signs). The Bray-Dean argument is a heuristic derived from a spin-glass toy model and does not formally prove anything about neural-network landscapes. It is suggestive rather than a theorem about real networks. Dauphin et al. 2014 and Choromanska et al. 2015 "The Loss Surfaces of Multilayer Networks" (AISTATS, arXiv:1412.0233) formalized the connection to neural nets under strong assumptions (independence of path activations, Gaussian weights). Empirical verification of the saddle-point structure remains the main justification.

Why It Matters

This explains the empirical observation that deep networks, despite having highly non-convex loss landscapes, rarely get stuck in bad local minima. The problematic critical points are high-loss saddle points, which gradient descent with noise (SGD) escapes by moving along negative curvature directions.

Failure Mode

The random function model does not perfectly describe real neural network losses. Neural network losses have symmetries (permutation of hidden units), structured Hessians, and data-dependent correlations that deviate from the random model. The qualitative conclusion (saddle points dominate) holds empirically, but the quantitative predictions of the random model should be treated as approximate.

report a correction →

Proposition

Flat Minima Generalize Better

Statement

(PAC-Bayes bound, simplified) For a posterior $Q = \mathcal{N}(\theta^*, \sigma^2 I)$ centered at a minimum $\theta^*$ with prior $P = \mathcal{N}(0, \sigma_0^2 I)$ , the generalization gap satisfies:

$\mathbb{E}_{\theta \sim Q}[R(\theta)] - \mathbb{E}_{\theta \sim Q}[\hat{R}_n(\theta)] \leq \sqrt{\frac{\text{KL}(Q \| P) + \log(2n/\delta)}{2n}}$

If $\theta^*$ is in a flat region (the loss does not change much under Gaussian perturbation), then $\mathbb{E}_Q[\hat{R}_n(\theta)] \approx \hat{R}_n(\theta^*)$ and the bound gives meaningful generalization guarantees. Sharp minima require smaller $\sigma$ , increasing $\text{KL}(Q \| P)$ and loosening the bound.

Intuition

A flat minimum is one where perturbing the parameters slightly does not change the loss much. This means many parameter configurations near $\theta^*$ all achieve similar training loss. A large volume of near-optimal parameters is associated with generalization: the model is not relying on a fragile, specific parameter configuration that could break on new data. Quantitatively, a flat basin admits a wider posterior $Q$ (larger $\sigma$ ) while keeping the training loss $\mathbb{E}_Q[\hat{R}_n]$ low, which shrinks $\text{KL}(Q \| P)$ and tightens the bound. Sharp minima force small $\sigma$ , inflating $\text{KL}(Q \| P)$ .

Failure Mode

Dinh, Pascanu, Bengio, Bengio 2017 ICML "Sharp Minima Can Generalize For Deep Nets" (arXiv:1703.04933) showed the flat-minima story is not reparameterization-invariant. For ReLU networks, rescaling weights $W_1 \to \alpha W_1$ and $W_2 \to \alpha^{-1} W_2$ leaves the function unchanged but can make a flat minimum arbitrarily sharp under standard Hessian-based measures. Any flat-minima generalization claim must either use a reparameterization-invariant measure (e.g., Fisher information) or acknowledge the ambiguity. PAC-Bayes bounds survive this objection because the posterior $Q$ and prior $P$ are specified in the same parameterization, so KL divergence is well-defined.

report a correction →

Gradient Flow: The Continuous-Time Limit

Gradient flow is the continuous-time limit of gradient descent with infinitesimal learning rate:

$\frac{d\theta}{dt} = -\nabla L(\theta(t))$

This is an ODE whose solutions follow the steepest descent path on the loss surface. Gradient flow is mathematically cleaner than discrete gradient descent and provides insights about convergence and implicit bias.

Key properties of gradient flow:

$L(\theta(t))$ is monotonically decreasing: $dL/dt = -\|\nabla L\|^2 \leq 0$
The trajectory converges to a critical point (under mild conditions)
For linear models, gradient flow converges to the minimum-norm solution $\theta^* = \arg\min \|\theta\| \text{ s.t. } L(\theta) = 0$

This last property is the implicit bias of gradient methods: among all solutions that achieve zero training loss, gradient flow selects the one with minimum parameter norm. For deeper networks, the implicit bias is more complex and not fully understood.

Mode Connectivity

A striking empirical discovery: independently trained neural networks that achieve similar loss can be connected by simple curves in parameter space along which the loss remains low.

Linear mode connectivity: Two minima $\theta_A$ and $\theta_B$ are linearly mode-connected if $L((1-t)\theta_A + t\theta_B) \approx L(\theta_A)$ for all $t \in [0,1]$ .

This does not hold for arbitrary minima. The linear path typically passes through a high-loss barrier. But after accounting for permutation symmetries of hidden units, many independently trained networks are linearly mode-connected. More generally, there exist smooth curves (quadratic Bezier paths) connecting minima without loss barriers.

Mode connectivity suggests the loss landscape has a connected low-loss manifold rather than isolated basins, explaining why different random initializations find solutions of similar quality.

Edge of Stability

The edge of stability phenomenon (Cohen et al., 2021) describes a surprising training dynamic:

Early in training, the largest Hessian eigenvalue $\lambda_{\max}$ increases until it reaches $2/\eta$ , where $\eta$ is the learning rate
At this point, classical convergence theory predicts divergence (learning rate exceeds $2/\lambda_{\max}$ )
Instead, training enters a regime where $\lambda_{\max} \approx 2/\eta$ and the loss continues to decrease non-monotonically

This means the optimizer self-tunes the curvature: if the landscape becomes too sharp, gradient descent destabilizes and moves to flatter regions, keeping $\lambda_{\max}$ at the stability threshold.

Neural Tangent Kernel and Lazy Training

In the infinite-width limit with appropriate initialization scaling, neural network training reduces to kernel regression with a fixed kernel. Jacot, Gabriel, Hongler 2018 NeurIPS "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (arXiv:1806.07572) showed that for a network $f(\theta, x)$ trained with gradient flow on the squared loss, the evolution of predictions is governed by the NTK:

$\Theta(x, x') = \left\langle \frac{\partial f(\theta_0, x)}{\partial \theta}, \frac{\partial f(\theta_0, x')}{\partial \theta} \right\rangle$

As width $\to \infty$ under NTK parameterization, $\Theta$ becomes deterministic at initialization and stays constant during training. The parameters move infinitesimally while predictions change by $O(1)$ . This is the lazy regime: features do not adapt.

Chizat, Bach 2018 "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (arXiv:1805.09545) identified the competing rich regime (mean-field parameterization, $\alpha$ -scaling) where features do evolve and the NTK changes during training. Real networks typically operate outside strict lazy training. Feature learning, not kernel behavior, is required to explain tasks like transfer learning and representation formation.

Grokking

Power, Burda, Edwards, Babuschkin, Misra 2022 "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (arXiv:2201.02177) observed that on small algorithmic tasks (modular arithmetic), transformers can achieve near-zero training loss at step $T_1$ but continue to improve test accuracy long after, reaching high test performance at step $T_2 \gg T_1$ . The delay can be orders of magnitude.

Grokking contradicts the standard picture that training loss and test loss track each other. During the interval $[T_1, T_2]$ , the network is memorizing yet reorganizing its internal representation. Nanda, Chan, Lieberum, Smith, Steinhardt 2023 "Progress Measures for Grokking via Mechanistic Interpretability" (arXiv:2301.05217) reverse-engineered the learned circuit on modular addition and showed the network transitions from a memorization solution to a Fourier-based generalizing circuit. Weight decay and the presence of an alternative low-complexity solution appear necessary for grokking to occur.

Sharpness-Aware Minimization

If flat minima generalize better, an explicit optimizer that targets them is natural. Foret, Kleiner, Mobahi, Neyshabur 2021 ICLR "Sharpness-Aware Minimization for Efficiently Improving Generalization" (arXiv:2010.01412) introduced SAM, which minimizes the worst-case loss in an $\ell_2$ ball of radius $\rho$ around $\theta$ :

$\min_\theta \max_{\|\epsilon\| \leq \rho} L(\theta + \epsilon)$

The inner maximization is approximated by a single gradient ascent step $\hat{\epsilon} = \rho \nabla L(\theta) / \|\nabla L(\theta)\|$ , and the outer step uses $\nabla L(\theta + \hat{\epsilon})$ . SAM consistently improves test accuracy on image classification and language tasks at the cost of roughly $2\times$ per-step compute (two forward-backward passes). The Dinh 2017 reparameterization caveat applies: SAM's sharpness proxy is not reparameterization-invariant, but empirically it still generalizes better than vanilla SGD.

Canonical Examples

Example

Saddle points in a 2D toy problem

Consider $L(\theta_1, \theta_2) = \theta_1^2 - \theta_2^2$ . The origin is a saddle point with $H = \text{diag}(2, -2)$ . Gradient descent from most initializations escapes along the $\theta_2$ direction (negative curvature). In high dimensions, most critical points look like this but with many more negative directions.

Example

Flat vs sharp minimum

Train the same architecture with SGD (large learning rate, small batch) and full-batch GD (small learning rate, all data). SGD typically converges to flatter minima (lower $\lambda_{\max}$ ) and achieves better test accuracy. The noise in SGD acts as implicit regularization, biasing toward flat regions of the loss landscape.

Common Confusions

Watch Out

Local minima are not the main obstacle

The folklore that gradient descent fails because it gets stuck in bad local minima is incorrect for overparameterized networks. The actual obstacles are: (1) saddle points that slow convergence (but do not trap), (2) flat regions (plateaus) where the gradient is small, and (3) ill-conditioning (large ratio of max to min Hessian eigenvalues) that makes some directions much harder to optimize than others.

Watch Out

Flat minima are not always better

The flat-minima-generalize hypothesis has caveats. The definition of flatness depends on the parameterization: reparameterizing the network (e.g., scaling weights by a constant and biases by its inverse) can change a flat minimum into a sharp one without changing the function. Sharpness measures must be reparameterization-invariant to be meaningful.

Watch Out

SGD noise is not Gaussian

The stochastic gradient noise in SGD is often modeled as Gaussian for theoretical convenience, but real mini-batch noise has heavier tails and is state-dependent. The noise covariance depends on the current parameters $\theta$ , which changes the dynamics qualitatively compared to additive Gaussian noise.

Summary

In high dimensions, saddle points vastly outnumber local minima
Low-index critical points (near-minima) cluster at similar loss values
Flat minima generalize better (PAC-Bayes connection)
Gradient flow has implicit bias toward simple solutions
Mode connectivity: good minima are connected by low-loss paths
Edge of stability: SGD self-regulates curvature at $\lambda_{\max} \approx 2/\eta$

Exercises

ExerciseCore

Problem

In $\mathbb{R}^{100}$ , if each Hessian eigenvalue at a critical point is independently positive or negative with equal probability, what is the probability that the critical point is a local minimum?

ExerciseAdvanced

Problem

Show that gradient flow $d\theta/dt = -\nabla L(\theta)$ monotonically decreases the loss. What additional condition ensures convergence to a critical point?

ExerciseResearch

Problem

The edge-of-stability phenomenon shows that $\lambda_{\max}(H)$ stabilizes near $2/\eta$ during GD training. Why does classical GD convergence theory predict divergence when $\lambda_{\max} > 2/\eta$ , and what mechanism allows training to continue despite this?

References

Canonical:

Bray, Dean, Statistics of Critical Points of Gaussian Fields on Large-Dimensional Spaces (Physical Review Letters, 2007). The spin-glass heuristic underlying saddle-point dominance.
Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization (NeurIPS 2014). Applied the saddle-point picture to neural networks.
Choromanska, Henaff, Mathieu, Ben Arous, LeCun, The Loss Surfaces of Multilayer Networks (AISTATS 2015, arXiv:1412.0233). Formalized the spin-glass connection under strong assumptions.
Keskar, Mudigere, Nocedal, Smelyanskiy, Tang, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (ICLR 2017).
McAllester, PAC-Bayesian Model Averaging (COLT 1999). Original PAC-Bayes bound used in flat-minima arguments.

PAC-Bayes and flat minima:

Dziugaite, Roy, Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data (UAI 2017, arXiv:1703.11008). First nonvacuous PAC-Bayes bounds for deep networks, formalizing the flat-minima posterior.
Dinh, Pascanu, Bengio, Bengio, Sharp Minima Can Generalize For Deep Nets (ICML 2017, arXiv:1703.04933). Reparameterization counterexample to naive sharpness measures.
Foret, Kleiner, Mobahi, Neyshabur, Sharpness-Aware Minimization for Efficiently Improving Generalization (ICLR 2021, arXiv:2010.01412). SAM optimizer.

Training dynamics theory:

Jacot, Gabriel, Hongler, Neural Tangent Kernel: Convergence and Generalization in Neural Networks (NeurIPS 2018, arXiv:1806.07572). Infinite-width lazy-training kernel.
Chizat, Bach, On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport (NeurIPS 2018, arXiv:1805.09545). Rich vs lazy regime distinction.
Cohen, Kaur, Li, Kolter, Talwalkar, Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability (ICLR 2021, arXiv:2103.00065).
Frankle, Dziugaite, Roy, Carbin, Linear Mode Connectivity and the Lottery Ticket Hypothesis (ICML 2020, arXiv:1912.05671).

Grokking:

Power, Burda, Edwards, Babuschkin, Misra, Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (arXiv:2201.02177, 2022).
Nanda, Chan, Lieberum, Smith, Steinhardt, Progress Measures for Grokking via Mechanistic Interpretability (ICLR 2023, arXiv:2301.05217).

Next Topics

The natural next steps from training dynamics:

Implicit bias and modern generalization: why overparameterized models generalize
Optimizer theory (SGD, Adam, Muon): how different optimizers navigate the loss landscape

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

The Hessian Matrixlayer 0A · tier 1
Convex Optimization Basicslayer 1 · tier 1
Stability and Optimization Dynamicslayer 2 · tier 2

Derived topics

3

Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
Implicit Bias and Modern Generalizationlayer 4 · tier 1
Neural Network Optimization Landscapelayer 4 · tier 2

Graph-backed continuations

Implicit Bias and Modern Generalization Optimizer Theory: SGD, Adam, and Muon Neural Network Optimization Landscape