Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Learning Rate Scheduling

Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.

CoreTier 1Stable~45 min

Why This Matters

0.0010
0e+03e-45e-48e-40.0010Learning rate02004006008001000Training step

Changing the learning rate schedule can matter more than changing the model architecture. A badly tuned constant learning rate either diverges (too high) or converges to a poor solution (too low). Every modern training recipe uses some form of learning rate scheduling.

Mental Model

The learning rate controls the step size in parameter space. Early in training, you want large steps to move away from random initialization. Late in training, you want small steps to settle into a good minimum. The schedule is the function η(t)\eta(t) mapping training step tt to learning rate. The theoretical foundations for convergence under decaying step sizes come from SGD convergence theory.

Formal Setup

Definition

Learning Rate Schedule

A learning rate schedule is a function η:{0,1,,T}R>0\eta: \{0, 1, \ldots, T\} \to \mathbb{R}_{>0} that determines the step size at each iteration of gradient descent:

θt+1=θtη(t)R^(θt)\theta_{t+1} = \theta_t - \eta(t) \nabla \hat{R}(\theta_t)

where R^\hat{R} is the empirical risk.

Common Schedules

Constant Learning Rate

The simplest schedule: η(t)=η0\eta(t) = \eta_0 for all tt. Rarely optimal, but useful as a baseline. For convex problems with Lipschitz gradients, constant LR with value η1/L\eta \leq 1/L guarantees convergence at rate O(1/T)O(1/T).

Step Decay

Drop the learning rate by a constant factor γ(0,1)\gamma \in (0,1) at fixed milestones:

η(t)=η0γt/s\eta(t) = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}

where ss is the step interval. Common choice: γ=0.1\gamma = 0.1 every 30 epochs. This was the default schedule for ResNet training.

Cosine Decay

The learning rate follows a cosine curve from ηmax\eta_{\max} to ηmin\eta_{\min}:

η(t)=ηmin+12(ηmaxηmin)(1+cos(πtT))\eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Cosine decay provides a smooth annealing that spends more iterations at moderate learning rates compared to step decay. Loshchilov and Hutter (2016) showed this consistently outperforms step decay across architectures.

Linear Warmup

Start from a small ηinit\eta_{\text{init}} and increase linearly to ηmax\eta_{\max} over TwT_w warmup steps:

η(t)=ηinit+tTw(ηmaxηinit),tTw\eta(t) = \eta_{\text{init}} + \frac{t}{T_w}(\eta_{\max} - \eta_{\text{init}}), \quad t \leq T_w

Cyclic Learning Rate

Alternate between ηmin\eta_{\min} and ηmax\eta_{\max} over a cycle of length 2C2C steps:

η(t)=ηmin+(ηmaxηmin)max(0,1t/C2t/(2C)1)\eta(t) = \eta_{\min} + (\eta_{\max} - \eta_{\min}) \cdot \max(0, 1 - |t/C - 2\lfloor t/(2C) \rfloor - 1|)

Smith (2017) showed that cycling can achieve equivalent accuracy in fewer epochs. The key insight: periodically increasing LR helps escape sharp minima. This interacts with the choice of gradient descent variant used.

1Cycle Policy

A single cycle: warmup from ηmin\eta_{\min} to ηmax\eta_{\max} over the first ~30% of training, then decay back to ηmin\eta_{\min} (or below) for the remaining 70%. Smith and Topin (2019) demonstrated "super-convergence" where 1cycle training reaches the same accuracy in 10x fewer iterations.

Why Warmup Matters for Transformers

Warmup is not optional for transformer training with Adam. The reason is specific to Adam's mechanics.

Adam divides the gradient by v^t+ϵ\sqrt{\hat{v}_t} + \epsilon, where v^t\hat{v}_t is the bias-corrected second moment estimate. At initialization, v0=0v_0 = 0. During the first few steps, v^t\hat{v}_t is estimated from very few gradient samples. This estimate has high variance, so the effective step sizes are unreliable.

With warmup, the small initial LR limits damage from these noisy early updates. By the time the LR reaches its peak, Adam's second moment estimates have stabilized.

Without warmup, transformers often diverge in the first few hundred steps. This failure mode is especially common with large batch sizes, where each gradient estimate is more precise but the second moment still needs time to accumulate.

Main Theorems

Theorem

Convergence Rate with Decaying Learning Rate

Statement

For gradient descent on an LL-smooth convex function ff with learning rate schedule η(t)\eta(t) satisfying t=0η(t)=\sum_{t=0}^{\infty} \eta(t) = \infty and t=0η(t)2<\sum_{t=0}^{\infty} \eta(t)^2 < \infty, the iterates satisfy:

mintTf(θt)2f(θ0)f+L2t=0Tη(t)2G2t=0Tη(t)\min_{t \leq T} \|\nabla f(\theta_t)\|^2 \leq \frac{f(\theta_0) - f^* + \frac{L}{2}\sum_{t=0}^{T}\eta(t)^2 G^2}{\sum_{t=0}^{T}\eta(t)}

where GG bounds the gradient norm and ff^* is the minimum value.

Intuition

The numerator stays bounded (the sum of squared learning rates converges). The denominator grows without bound (the sum of learning rates diverges). So the best gradient norm seen so far goes to zero. The schedule η(t)=c/t\eta(t) = c/\sqrt{t} satisfies both conditions.

Proof Sketch

Start from the descent lemma: f(θt+1)f(θt)η(t)f(θt)2+L2η(t)2G2f(\theta_{t+1}) \leq f(\theta_t) - \eta(t)\|\nabla f(\theta_t)\|^2 + \frac{L}{2}\eta(t)^2 G^2. Telescope the sum over TT steps. Rearrange to isolate η(t)f(θt)2\sum \eta(t)\|\nabla f(\theta_t)\|^2. Bound the minimum by the average.

Why It Matters

This theorem explains why the Robbins-Monro conditions (ηt=\sum \eta_t = \infty, ηt2<\sum \eta_t^2 < \infty) appear in every optimization textbook. They are the minimal conditions for a schedule to guarantee convergence.

Failure Mode

The theorem assumes convexity. Neural network loss landscapes are non-convex. In practice, schedules that violate the ηt2<\sum \eta_t^2 < \infty condition (like cyclic LR) can outperform theoretically valid ones. The theory does not explain super-convergence.

Common Confusions

Watch Out

Learning rate vs effective learning rate in Adam

When using Adam, the actual step size is η/(v^t+ϵ)\eta / (\sqrt{\hat{v}_t} + \epsilon), not η\eta alone. Changing the learning rate schedule changes the numerator of this fraction. The effective step also depends on the gradient history through v^t\hat{v}_t. This means the same η\eta schedule behaves differently with Adam than with SGD.

Watch Out

Warmup is not about the model, it is about the optimizer

Warmup is sometimes described as "letting the model learn the data distribution gradually." This is wrong. Warmup compensates for the optimizer state being poorly initialized. With SGD (which has no adaptive state), warmup is less critical. With Adam, it is nearly mandatory for large models.

Watch Out

Cosine decay is not a single schedule

Cosine decay requires choosing ηmax\eta_{\max}, ηmin\eta_{\min}, and TT. Restarting the cosine (SGDR) further adds restart period and multiplier parameters. The cosine shape itself is not the key; the smooth annealing is.

Key Takeaways

  • Learning rate schedule often matters more than architecture choices
  • Warmup is required for Adam on transformers because the second moment needs time to stabilize
  • Cosine decay outperforms step decay in most modern settings
  • Cyclic LR and 1cycle can achieve "super-convergence" with fewer iterations
  • The Robbins-Monro conditions (ηt=\sum \eta_t = \infty, ηt2<\sum \eta_t^2 < \infty) are necessary and sufficient for convergence in the convex case

Exercises

ExerciseCore

Problem

A cosine decay schedule runs for T=1000T = 1000 steps with ηmax=0.001\eta_{\max} = 0.001 and ηmin=0\eta_{\min} = 0. What is the learning rate at step t=500t = 500?

ExerciseAdvanced

Problem

Show that the schedule η(t)=c/t+1\eta(t) = c / \sqrt{t+1} satisfies the Robbins-Monro conditions, but η(t)=c/(t+1)\eta(t) = c / (t+1) converges too fast for practical use. Compute t=0Tη(t)\sum_{t=0}^{T} \eta(t) for both and compare.

References

Canonical:

  • Robbins & Monro, "A Stochastic Approximation Method" (1951)
  • Bottou, "Stochastic Gradient Descent Tricks" in Neural Networks: Tricks of the Trade, Chapter 1

Current:

  • Loshchilov & Hutter, "SGDR: Stochastic Gradient Descent with Warm Restarts" (2016)
  • Smith & Topin, "Super-Convergence" (2019)
  • Gotmare et al., "A Closer Look at Deep Learning Heuristics" (2019), Section 3

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics