Training Techniques
Curriculum Learning
Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.
Why This Matters
Humans learn arithmetic before calculus. We learn letters before words, words before sentences. The order in which examples are presented affects learning speed and final performance.
Curriculum learning applies this principle to machine learning: present training examples in order of increasing difficulty. Bengio et al. (2009) showed that this can both speed up convergence and lead to better local optima. The idea is simple. The difficulty is defining "easy" and implementing the schedule.
Formal Setup
Curriculum
A curriculum is a sequence of distributions over training examples, where the support of is typically a subset of the full training set, and difficulty increases with . The final distribution is the uniform distribution over all training examples.
Formally, let be a weighting function over examples at step . The curriculum defines a weighted empirical risk:
A curriculum schedules so that easy examples have higher weight early in training.
Difficulty Score
A difficulty score assigns a scalar to each training example measuring how hard it is to learn. Common choices:
- Loss-based: where is an initial or pretrained model
- Confidence-based:
- Human-defined: annotator agreement, label noise estimates
- Data complexity: input length, number of objects in an image
Why Curricula Can Help
Curriculum as Continuation Method
Statement
Consider a sequence of objectives where is a smoothed version of the full loss (using only easy examples). If the minimizer of lies in the basin of attraction of a good minimizer of , then optimizing the sequence converges to a better local minimum of than optimizing directly.
Intuition
Easy examples define a simpler loss landscape with fewer, wider local minima. Starting optimization on this simpler landscape finds a broadly correct solution. Gradually adding harder examples refines this solution without getting trapped in poor local optima that would capture optimization starting from random initialization on the full loss.
Proof Sketch
This is an instance of a continuation method (homotopy method). Define and increase from 0 to 1. The trajectory of minimizers is continuous if the losses are smooth. The assumption that is in the basin of attraction of ensures the optimizer follows this trajectory.
Why It Matters
Non-convex optimization is the central difficulty of training neural networks. Any method that reliably guides optimization toward better minima is valuable. Curriculum learning is one of the few such methods with both theoretical motivation (continuation methods) and empirical support.
Failure Mode
The assumption that smoothed objectives lead to the basin of a good final minimizer is not guaranteed. If easy and hard examples require qualitatively different features, the curriculum can guide the model to a local minimum that is good for easy examples but poor overall.
Self-Paced Learning
Kumar, Packer, and Koller (2010): instead of fixing the curriculum in advance, let the model decide what is easy. At each step, include examples that the current model can handle (low loss) and exclude examples it cannot.
Self-Paced Learning as Joint Optimization
Statement
Self-paced learning solves:
where are selection weights and is a pace parameter. For fixed , the optimal : include examples with loss below and exclude the rest. Increasing over training adds harder examples.
Intuition
The term rewards including examples (without it, setting all trivially minimizes the objective). The balance between fitting included examples and the reward for inclusion yields a threshold: examples easier than are included, harder ones are excluded.
Proof Sketch
For fixed , minimize over independently. The objective for each is , which is minimized by if and if . This gives the threshold rule.
Why It Matters
Self-paced learning removes the need for external difficulty labels. The model's own loss serves as the difficulty measure. This makes curriculum learning applicable even when human difficulty labels are unavailable.
Failure Mode
Self-paced learning can ignore hard but important examples indefinitely if increases too slowly. It can also create a feedback loop: the model never trains on hard examples, so they always have high loss, so they are never included.
Anti-Curriculum: Hard Examples First
Bengio et al. (2009) found curricula generally help, but subsequent work showed exceptions. Presenting hard or diverse examples first can sometimes work better, especially when:
- Hard examples contain the most information about decision boundaries.
- Easy examples are redundant (the model learns them quickly regardless).
- The difficulty ordering is inaccurate.
This connects to importance sampling: upweighting examples with high loss (hard examples) reduces the variance of the gradient estimator. The optimal importance sampling distribution is proportional to the per-example gradient norm, which correlates with difficulty.
Connection to Importance Sampling
The gradient of the empirical risk is . If we sample examples with probability instead of uniformly, the variance of the stochastic gradient decreases.
This is the opposite of curriculum learning: importance sampling upweights hard examples (high gradient norm) while curriculum learning downweights them. The resolution is that they solve different problems. Curriculum learning addresses non-convex optimization (finding good basins), while importance sampling addresses stochastic gradient variance (faster convergence within a basin).
Common Confusions
Curriculum learning is not the same as data augmentation
Data augmentation adds modified copies of examples. Curriculum learning changes the order or weighting of existing examples. They are complementary and can be combined.
Shuffling is not a curriculum
Standard training shuffles data randomly each epoch. A curriculum is a deliberate ordering. Random shuffling is the control against which curriculum benefits are measured.
Why Curriculum Learning Is Underused
Despite positive results, curriculum learning is rare in practice because:
- Difficulty is hard to define: loss-based difficulty changes during training, human annotations are expensive, data complexity metrics are domain-specific.
- Hyperparameter sensitivity: the pace schedule (how fast to add harder examples) requires tuning.
- Modern regularization works well: dropout, data augmentation, and large batch training often provide sufficient generalization without curricula.
- Conflicting evidence: results vary across tasks and architectures, with no clear consensus on when curricula help.
Exercises
Problem
In self-paced learning with pace parameter , which examples are included if the per-example losses are ?
Problem
Explain why curriculum learning and importance sampling give opposite prescriptions for example weighting. Under what conditions would you prefer each approach?
References
Canonical:
- Bengio et al., "Curriculum Learning", ICML 2009
- Kumar, Packer, Koller, "Self-Paced Learning for Latent Variable Models", NeurIPS 2010
Current:
- Soviany et al., "Curriculum Learning: A Survey", IJCV 2022
- Katharopoulos and Fleuret, "Not All Samples Are Created Equal: Deep Learning with Importance Sampling", ICML 2018
Last reviewed: April 2026