Training Techniques
Learning Rate Scheduling
Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.
Prerequisites
Why This Matters
Changing the learning rate schedule can matter more than changing the model architecture. A badly tuned constant learning rate either diverges (too high) or converges to a poor solution (too low). Every modern training recipe uses some form of learning rate scheduling.
Mental Model
The learning rate controls the step size in parameter space. Early in training, you want large steps to move away from random initialization. Late in training, you want small steps to settle into a good minimum. The schedule is the function mapping training step to learning rate. The theoretical foundations for convergence under decaying step sizes come from SGD convergence theory.
Formal Setup
Learning Rate Schedule
A learning rate schedule is a function that determines the step size at each iteration of gradient descent:
where is the empirical risk.
Common Schedules
Constant Learning Rate
The simplest schedule: for all . Rarely optimal, but useful as a baseline. For convex problems with Lipschitz gradients, constant LR with value guarantees convergence at rate .
Step Decay
Drop the learning rate by a constant factor at fixed milestones:
where is the step interval. Common choice: every 30 epochs. This was the default schedule for ResNet training.
Cosine Decay
The learning rate follows a cosine curve from to :
Cosine decay provides a smooth annealing that spends more iterations at moderate learning rates compared to step decay. Loshchilov and Hutter (2016) showed this consistently outperforms step decay across architectures.
Linear Warmup
Start from a small and increase linearly to over warmup steps:
Cyclic Learning Rate
Alternate between and over a cycle of length steps:
Smith (2017) showed that cycling can achieve equivalent accuracy in fewer epochs. The key insight: periodically increasing LR helps escape sharp minima. This interacts with the choice of gradient descent variant used.
1Cycle Policy
A single cycle: warmup from to over the first ~30% of training, then decay back to (or below) for the remaining 70%. Smith and Topin (2019) demonstrated "super-convergence" where 1cycle training reaches the same accuracy in 10x fewer iterations.
Why Warmup Matters for Transformers
Warmup is not optional for transformer training with Adam. The reason is specific to Adam's mechanics.
Adam divides the gradient by , where is the bias-corrected second moment estimate. At initialization, . During the first few steps, is estimated from very few gradient samples. This estimate has high variance, so the effective step sizes are unreliable.
With warmup, the small initial LR limits damage from these noisy early updates. By the time the LR reaches its peak, Adam's second moment estimates have stabilized.
Without warmup, transformers often diverge in the first few hundred steps. This failure mode is especially common with large batch sizes, where each gradient estimate is more precise but the second moment still needs time to accumulate.
Main Theorems
Convergence Rate with Decaying Learning Rate
Statement
For gradient descent on an -smooth convex function with learning rate schedule satisfying and , the iterates satisfy:
where bounds the gradient norm and is the minimum value.
Intuition
The numerator stays bounded (the sum of squared learning rates converges). The denominator grows without bound (the sum of learning rates diverges). So the best gradient norm seen so far goes to zero. The schedule satisfies both conditions.
Proof Sketch
Start from the descent lemma: . Telescope the sum over steps. Rearrange to isolate . Bound the minimum by the average.
Why It Matters
This theorem explains why the Robbins-Monro conditions (, ) appear in every optimization textbook. They are the minimal conditions for a schedule to guarantee convergence.
Failure Mode
The theorem assumes convexity. Neural network loss landscapes are non-convex. In practice, schedules that violate the condition (like cyclic LR) can outperform theoretically valid ones. The theory does not explain super-convergence.
Common Confusions
Learning rate vs effective learning rate in Adam
When using Adam, the actual step size is , not alone. Changing the learning rate schedule changes the numerator of this fraction. The effective step also depends on the gradient history through . This means the same schedule behaves differently with Adam than with SGD.
Warmup is not about the model, it is about the optimizer
Warmup is sometimes described as "letting the model learn the data distribution gradually." This is wrong. Warmup compensates for the optimizer state being poorly initialized. With SGD (which has no adaptive state), warmup is less critical. With Adam, it is nearly mandatory for large models.
Cosine decay is not a single schedule
Cosine decay requires choosing , , and . Restarting the cosine (SGDR) further adds restart period and multiplier parameters. The cosine shape itself is not the key; the smooth annealing is.
Key Takeaways
- Learning rate schedule often matters more than architecture choices
- Warmup is required for Adam on transformers because the second moment needs time to stabilize
- Cosine decay outperforms step decay in most modern settings
- Cyclic LR and 1cycle can achieve "super-convergence" with fewer iterations
- The Robbins-Monro conditions (, ) are necessary and sufficient for convergence in the convex case
Exercises
Problem
A cosine decay schedule runs for steps with and . What is the learning rate at step ?
Problem
Show that the schedule satisfies the Robbins-Monro conditions, but converges too fast for practical use. Compute for both and compare.
References
Canonical:
- Robbins & Monro, "A Stochastic Approximation Method" (1951)
- Bottou, "Stochastic Gradient Descent Tricks" in Neural Networks: Tricks of the Trade, Chapter 1
Current:
- Loshchilov & Hutter, "SGDR: Stochastic Gradient Descent with Warm Restarts" (2016)
- Smith & Topin, "Super-Convergence" (2019)
- Gotmare et al., "A Closer Look at Deep Learning Heuristics" (2019), Section 3
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Stochastic Gradient Descent ConvergenceLayer 2
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A