Learning Rate Scheduling

Sneiderman, Robby

Training Techniques

Learning Rate Scheduling

Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.

CoreTier 1StableSupporting~45 min

Prerequisites

Stochastic Gradient Descent Convergence Adam Optimizer Batch Size and Learning Dynamics Gradient Descent Variants

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

training-techniques | layer 2 | tier 1. This page has 4 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Mixed Precision Training

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Base LR:0.0010

Changing the learning rate schedule can matter more than changing the model architecture. A badly tuned constant learning rate either diverges (too high) or converges to a poor solution (too low). Every modern training recipe uses some form of learning rate scheduling.

Mental Model

The learning rate controls the step size in parameter space. Early in training, you want large steps to move away from random initialization. Late in training, you want small steps to settle into a good minimum. The schedule is the function $\eta(t)$ mapping training step $t$ to learning rate. The theoretical foundations for convergence under decaying step sizes come from SGD convergence theory.

Formal Setup

Definition

Learning Rate Schedule $η (t)$

A learning rate schedule is a function $\eta: \{0, 1, \ldots, T\} \to \mathbb{R}_{>0}$ that determines the step size at each iteration of gradient descent:

$\theta_{t+1} = \theta_t - \eta(t) \nabla \hat{R}(\theta_t)$

where $\hat{R}$ is the empirical risk.

Common Schedules

Constant Learning Rate

The simplest schedule: $\eta(t) = \eta_0$ for all $t$ . Rarely optimal, but useful as a baseline. For convex problems with Lipschitz gradients, constant LR with value $\eta \leq 1/L$ guarantees convergence at rate $O(1/T)$ .

Step Decay

Drop the learning rate by a constant factor $\gamma \in (0,1)$ at fixed milestones:

$\eta(t) = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$

where $s$ is the step interval. Common choice: $\gamma = 0.1$ every 30 epochs. This was the default schedule for ResNet training.

Cosine Decay

The learning rate follows a cosine curve from $\eta_{\max}$ to $\eta_{\min}$ :

$\eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)$

Cosine decay provides a smooth annealing that spends more iterations at moderate learning rates compared to step decay. Loshchilov and Hutter (2016) showed this consistently outperforms step decay across architectures.

Linear Warmup

Start from a small $\eta_{\text{init}}$ and increase linearly to $\eta_{\max}$ over $T_w$ warmup steps:

$\eta(t) = \eta_{\text{init}} + \frac{t}{T_w}(\eta_{\max} - \eta_{\text{init}}), \quad t \leq T_w$

Cyclic Learning Rate

Alternate between $\eta_{\min}$ and $\eta_{\max}$ over a cycle of length $2C$ steps:

$\eta(t) = \eta_{\min} + (\eta_{\max} - \eta_{\min}) \cdot \max(0, 1 - |t/C - 2\lfloor t/(2C) \rfloor - 1|)$

Smith (2017) showed that cycling can achieve equivalent accuracy in fewer epochs. The key insight: periodically increasing LR helps escape sharp minima. This interacts with the choice of gradient descent variant used.

1Cycle Policy

A single cycle: warmup from $\eta_{\min}$ to $\eta_{\max}$ over the first ~30% of training, then decay back to $\eta_{\min}$ (or below) for the remaining 70%. Smith and Topin (2019) demonstrated "super-convergence" where 1cycle training reaches the same accuracy in 10x fewer iterations.

Why Warmup Matters for Transformers

Warmup is not optional for transformer training with Adam. The reason is specific to Adam's mechanics.

Adam divides the gradient by $\sqrt{\hat{v}_t} + \epsilon$ , where $\hat{v}_t$ is the bias-corrected second moment estimate. At initialization, $v_0 = 0$ . During the first few steps, $\hat{v}_t$ is estimated from very few gradient samples. This estimate has high variance, so the effective step sizes are unreliable.

With warmup, the small initial LR limits damage from these noisy early updates. By the time the LR reaches its peak, Adam's second moment estimates have stabilized.

Without warmup, transformers often diverge in the first few hundred steps. This failure mode is especially common with large batch sizes, where each gradient estimate is more precise but the second moment still needs time to accumulate.

Main Theorems

Theorem

Convergence Rate with Decaying Learning Rate

Statement

For gradient descent on an $L$ -smooth convex function $f$ with learning rate schedule $\eta(t)$ satisfying $\sum_{t=0}^{\infty} \eta(t) = \infty$ and $\sum_{t=0}^{\infty} \eta(t)^2 < \infty$ , the iterates satisfy:

$\min_{t \leq T} \|\nabla f(\theta_t)\|^2 \leq \frac{f(\theta_0) - f^* + \frac{L}{2}\sum_{t=0}^{T}\eta(t)^2 G^2}{\sum_{t=0}^{T}\eta(t)}$

where $G$ bounds the gradient norm and $f^*$ is the minimum value.

Intuition

The numerator stays bounded (the sum of squared learning rates converges). The denominator grows without bound (the sum of learning rates diverges). So the best gradient norm seen so far goes to zero. The schedule $\eta(t) = c/\sqrt{t}$ satisfies both conditions.

Proof Sketch

Start from the descent lemma: $f(\theta_{t+1}) \leq f(\theta_t) - \eta(t)\|\nabla f(\theta_t)\|^2 + \frac{L}{2}\eta(t)^2 G^2$ . Telescope the sum over $T$ steps. Rearrange to isolate $\sum \eta(t)\|\nabla f(\theta_t)\|^2$ . Bound the minimum by the average.

Why It Matters

This theorem explains why the Robbins-Monro conditions ( $\sum \eta_t = \infty$ , $\sum \eta_t^2 < \infty$ ) appear in every optimization textbook. They are the minimal conditions for a schedule to guarantee convergence.

Failure Mode

The theorem assumes convexity. Neural network loss landscapes are non-convex. In practice, schedules that violate the $\sum \eta_t^2 < \infty$ condition (like cyclic LR) can outperform theoretically valid ones. The theory does not explain super-convergence.

report a correction →

Common Confusions

Watch Out

Learning rate vs effective learning rate in Adam

When using Adam, the actual step size is $\eta / (\sqrt{\hat{v}_t} + \epsilon)$ , not $\eta$ alone. Changing the learning rate schedule changes the numerator of this fraction. The effective step also depends on the gradient history through $\hat{v}_t$ . This means the same $\eta$ schedule behaves differently with Adam than with SGD.

Watch Out

Warmup is not about the model, it is about the optimizer

Warmup is sometimes described as "letting the model learn the data distribution gradually." This is wrong. Warmup compensates for the optimizer state being poorly initialized. With SGD (which has no adaptive state), warmup is less critical. With Adam, it is nearly mandatory for large models.

Watch Out

Cosine decay is not a single schedule

Cosine decay requires choosing $\eta_{\max}$ , $\eta_{\min}$ , and $T$ . Restarting the cosine (SGDR) further adds restart period and multiplier parameters. The cosine shape itself is not the key; the smooth annealing is.

Summary

Learning rate schedule often matters more than architecture choices
Warmup is required for Adam on transformers because the second moment needs time to stabilize
Cosine decay outperforms step decay in most modern settings
Cyclic LR and 1cycle can achieve "super-convergence" with fewer iterations
The Robbins-Monro conditions ( $\sum \eta_t = \infty$ , $\sum \eta_t^2 < \infty$ ) are sufficient for convergence of stochastic approximation under standard regularity assumptions; they are not necessary for deterministic gradient descent on smooth convex objectives, where a constant $\eta \leq 1/L$ already converges

Exercises

ExerciseCore

Problem

A cosine decay schedule runs for $T = 1000$ steps with $\eta_{\max} = 0.001$ and $\eta_{\min} = 0$ . What is the learning rate at step $t = 500$ ?

ExerciseAdvanced

Problem

Show that the schedule $\eta(t) = c / \sqrt{t+1}$ satisfies the Robbins-Monro conditions, but $\eta(t) = c / (t+1)$ converges too fast for practical use. Compute $\sum_{t=0}^{T} \eta(t)$ for both and compare.

References

Canonical:

Robbins & Monro, "A Stochastic Approximation Method" (1951)
Bottou, "Stochastic Gradient Descent Tricks" in Neural Networks: Tricks of the Trade, Chapter 1

Current:

Smith, "Cyclical Learning Rates for Training Neural Networks" (2017)
Loshchilov & Hutter, "SGDR: Stochastic Gradient Descent with Warm Restarts" (2016)
Smith & Topin, "Super-Convergence" (2019)
Gotmare et al., "A Closer Look at Deep Learning Heuristics" (2019), Section 3

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Gradient Descent Variantslayer 1 · tier 1
Adam Optimizerlayer 2 · tier 1
Stochastic Gradient Descent Convergencelayer 2 · tier 1
Batch Size and Learning Dynamicslayer 2 · tier 2

Derived topics

1

Mixed Precision Traininglayer 3 · tier 2

Graph-backed continuations

Mixed Precision Training