Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Curriculum Learning

Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.

CoreTier 3Stable~35 min
0

Why This Matters

Humans learn arithmetic before calculus. We learn letters before words, words before sentences. The order in which examples are presented affects learning speed and final performance.

Curriculum learning applies this principle to machine learning: present training examples in order of increasing difficulty. Bengio et al. (2009) showed that this can both speed up convergence and lead to better local optima. The idea is simple. The difficulty is defining "easy" and implementing the schedule.

Formal Setup

Definition

Curriculum

A curriculum is a sequence of distributions D1,D2,,DTD_1, D_2, \ldots, D_T over training examples, where the support of DtD_t is typically a subset of the full training set, and difficulty increases with tt. The final distribution DTD_T is the uniform distribution over all training examples.

Formally, let w(x,t)0w(x, t) \geq 0 be a weighting function over examples xx at step tt. The curriculum defines a weighted empirical risk:

R^t(h)=1ni=1nw(xi,t)(h(xi),yi)\hat{R}_t(h) = \frac{1}{n} \sum_{i=1}^{n} w(x_i, t) \cdot \ell(h(x_i), y_i)

A curriculum schedules w(xi,t)w(x_i, t) so that easy examples have higher weight early in training.

Definition

Difficulty Score

A difficulty score d(xi)d(x_i) assigns a scalar to each training example measuring how hard it is to learn. Common choices:

  1. Loss-based: d(xi)=(h0(xi),yi)d(x_i) = \ell(h_0(x_i), y_i) where h0h_0 is an initial or pretrained model
  2. Confidence-based: d(xi)=1p(yixi;h0)d(x_i) = 1 - p(y_i \mid x_i; h_0)
  3. Human-defined: annotator agreement, label noise estimates
  4. Data complexity: input length, number of objects in an image

Why Curricula Can Help

Proposition

Curriculum as Continuation Method

Statement

Consider a sequence of objectives L1,L2,,LT=LL_1, L_2, \ldots, L_T = L where LtL_t is a smoothed version of the full loss (using only easy examples). If the minimizer of LtL_t lies in the basin of attraction of a good minimizer of Lt+1L_{t+1}, then optimizing the sequence L1L2LTL_1 \to L_2 \to \ldots \to L_T converges to a better local minimum of LL than optimizing LL directly.

Intuition

Easy examples define a simpler loss landscape with fewer, wider local minima. Starting optimization on this simpler landscape finds a broadly correct solution. Gradually adding harder examples refines this solution without getting trapped in poor local optima that would capture optimization starting from random initialization on the full loss.

Proof Sketch

This is an instance of a continuation method (homotopy method). Define Lλ=(1λ)Leasy+λLfullL_\lambda = (1-\lambda) L_{\text{easy}} + \lambda L_{\text{full}} and increase λ\lambda from 0 to 1. The trajectory of minimizers w(λ)w^*(\lambda) is continuous if the losses are smooth. The assumption that w(λt)w^*({\lambda_t}) is in the basin of attraction of w(λt+1)w^*(\lambda_{t+1}) ensures the optimizer follows this trajectory.

Why It Matters

Non-convex optimization is the central difficulty of training neural networks. Any method that reliably guides optimization toward better minima is valuable. Curriculum learning is one of the few such methods with both theoretical motivation (continuation methods) and empirical support.

Failure Mode

The assumption that smoothed objectives lead to the basin of a good final minimizer is not guaranteed. If easy and hard examples require qualitatively different features, the curriculum can guide the model to a local minimum that is good for easy examples but poor overall.

Self-Paced Learning

Kumar, Packer, and Koller (2010): instead of fixing the curriculum in advance, let the model decide what is easy. At each step, include examples that the current model can handle (low loss) and exclude examples it cannot.

Proposition

Self-Paced Learning as Joint Optimization

Statement

Self-paced learning solves:

minw,vi=1nvi(hw(xi),yi)λi=1nvi\min_{w, v} \sum_{i=1}^{n} v_i \ell(h_w(x_i), y_i) - \lambda \sum_{i=1}^{n} v_i

where vi[0,1]v_i \in [0, 1] are selection weights and λ>0\lambda > 0 is a pace parameter. For fixed ww, the optimal vi=1[(hw(xi),yi)<λ]v_i^* = \mathbf{1}[\ell(h_w(x_i), y_i) < \lambda]: include examples with loss below λ\lambda and exclude the rest. Increasing λ\lambda over training adds harder examples.

Intuition

The λvi-\lambda \sum v_i term rewards including examples (without it, setting all vi=0v_i = 0 trivially minimizes the objective). The balance between fitting included examples and the reward for inclusion yields a threshold: examples easier than λ\lambda are included, harder ones are excluded.

Proof Sketch

For fixed ww, minimize over viv_i independently. The objective for each ii is vi(iλ)v_i(\ell_i - \lambda), which is minimized by vi=1v_i = 1 if i<λ\ell_i < \lambda and vi=0v_i = 0 if i>λ\ell_i > \lambda. This gives the threshold rule.

Why It Matters

Self-paced learning removes the need for external difficulty labels. The model's own loss serves as the difficulty measure. This makes curriculum learning applicable even when human difficulty labels are unavailable.

Failure Mode

Self-paced learning can ignore hard but important examples indefinitely if λ\lambda increases too slowly. It can also create a feedback loop: the model never trains on hard examples, so they always have high loss, so they are never included.

Anti-Curriculum: Hard Examples First

Bengio et al. (2009) found curricula generally help, but subsequent work showed exceptions. Presenting hard or diverse examples first can sometimes work better, especially when:

  1. Hard examples contain the most information about decision boundaries.
  2. Easy examples are redundant (the model learns them quickly regardless).
  3. The difficulty ordering is inaccurate.

This connects to importance sampling: upweighting examples with high loss (hard examples) reduces the variance of the gradient estimator. The optimal importance sampling distribution is proportional to the per-example gradient norm, which correlates with difficulty.

Connection to Importance Sampling

The gradient of the empirical risk is L=1nii\nabla L = \frac{1}{n} \sum_i \nabla \ell_i. If we sample examples with probability piip_i \propto \|\nabla \ell_i\| instead of uniformly, the variance of the stochastic gradient decreases.

This is the opposite of curriculum learning: importance sampling upweights hard examples (high gradient norm) while curriculum learning downweights them. The resolution is that they solve different problems. Curriculum learning addresses non-convex optimization (finding good basins), while importance sampling addresses stochastic gradient variance (faster convergence within a basin).

Common Confusions

Watch Out

Curriculum learning is not the same as data augmentation

Data augmentation adds modified copies of examples. Curriculum learning changes the order or weighting of existing examples. They are complementary and can be combined.

Watch Out

Shuffling is not a curriculum

Standard training shuffles data randomly each epoch. A curriculum is a deliberate ordering. Random shuffling is the control against which curriculum benefits are measured.

Why Curriculum Learning Is Underused

Despite positive results, curriculum learning is rare in practice because:

  1. Difficulty is hard to define: loss-based difficulty changes during training, human annotations are expensive, data complexity metrics are domain-specific.
  2. Hyperparameter sensitivity: the pace schedule (how fast to add harder examples) requires tuning.
  3. Modern regularization works well: dropout, data augmentation, and large batch training often provide sufficient generalization without curricula.
  4. Conflicting evidence: results vary across tasks and architectures, with no clear consensus on when curricula help.

Exercises

ExerciseCore

Problem

In self-paced learning with pace parameter λ=0.5\lambda = 0.5, which examples are included if the per-example losses are =(0.1,0.8,0.3,0.6,0.2)\ell = (0.1, 0.8, 0.3, 0.6, 0.2)?

ExerciseAdvanced

Problem

Explain why curriculum learning and importance sampling give opposite prescriptions for example weighting. Under what conditions would you prefer each approach?

References

Canonical:

  • Bengio et al., "Curriculum Learning", ICML 2009
  • Kumar, Packer, Koller, "Self-Paced Learning for Latent Variable Models", NeurIPS 2010

Current:

  • Soviany et al., "Curriculum Learning: A Survey", IJCV 2022
  • Katharopoulos and Fleuret, "Not All Samples Are Created Equal: Deep Learning with Importance Sampling", ICML 2018

Last reviewed: April 2026