AI Safety
Continual Learning and Forgetting
Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.
Prerequisites
Why This Matters
Real-world ML systems must learn new tasks over time. A deployed model needs to handle new product categories, new languages, or updated regulations without retraining from scratch on all historical data. Naive sequential training destroys old knowledge. Continual learning studies how to avoid this.
The problem is broader than catastrophic forgetting alone. Forgetting describes what goes wrong. Continual learning studies what to do about it: algorithms, architectures, and training protocols that allow sequential learning while preserving performance on earlier tasks.
Problem Formulation
Continual Learning Setting
A model encounters a sequence of tasks . For task , training data is available. After training on , data from previous tasks may be partially or fully unavailable. The goal is to find parameters that perform well on all tasks:
The constraint is that you cannot store or revisit all of .
Stability-Plasticity Tradeoff
Stability is the ability to retain performance on old tasks. Plasticity is the ability to learn new tasks. These are in tension: a model that never changes its weights is perfectly stable but has zero plasticity. A model that freely overwrites weights has maximum plasticity but zero stability. Every continual learning method navigates this tradeoff.
Method 1: Elastic Weight Consolidation (EWC)
EWC Objective via Laplace Approximation
Statement
After training on task with optimal parameters , the loss for training on task while preserving task performance is approximately:
where is the -th diagonal element of the Fisher information matrix computed on task data, and controls the strength of the constraint.
Intuition
The Fisher information measures how important weight is for task . If is large, changing significantly hurts task performance, so the penalty for deviating from is large. If is small, the weight can be freely repurposed for task .
Proof Sketch
Model the posterior over after task as . Apply a Laplace approximation: approximate this posterior as a Gaussian centered at with precision matrix equal to the Fisher information . The log posterior for training on becomes . Using the diagonal approximation to yields the EWC objective.
Why It Matters
EWC provides a principled, Bayesian-motivated approach to continual learning. It does not require storing old data (only the Fisher diagonal and old parameters). It shows that the key information for preventing forgetting is which weights matter, not what the old data looked like.
Failure Mode
The diagonal Fisher approximation is crude for deep networks where weights interact strongly. The Laplace approximation assumes a unimodal posterior, which is wrong for neural networks with many equivalent minima. EWC degrades when the number of tasks grows large because the accumulated penalties leave fewer free parameters. It also cannot handle tasks that require conflicting weight values.
Method 2: Progressive Neural Networks
Progressive networks side-step forgetting entirely by never modifying old weights. For each new task , a new "column" (a full network) is added. The new column receives lateral connections from all previous columns:
where is the activation at layer for column , and are lateral connection weights.
Advantage: zero forgetting by construction. Old columns are frozen.
Disadvantage: model size grows linearly with the number of tasks. After 100 tasks, you have 100 full networks plus lateral connections. This is impractical for long task sequences.
Method 3: Replay Methods
Store a small buffer of examples from previous tasks. When training on task , interleave buffer examples with current data:
Replay Buffer Convergence
Statement
If the replay buffer contains examples drawn uniformly from and each task loss is convex and -Lipschitz, then with probability at least :
where is the replay-trained model and is the joint optimum.
Intuition
A uniformly sampled buffer approximates training on all tasks jointly. The approximation quality scales as where is the buffer size. Larger buffers approach the joint-training baseline.
Proof Sketch
The buffer empirical loss is an unbiased estimator of the average population loss across all tasks. Apply standard uniform convergence arguments (Hoeffding or Rademacher bounds) to bound the gap between buffer loss and true joint loss. The convexity assumption ensures that minimizing the approximate loss stays close to minimizing the true loss.
Why It Matters
This justifies replay as a principled strategy: with a buffer of modest size, you can approximate joint training. In practice, even small buffers (a few hundred examples per task) substantially reduce forgetting.
Failure Mode
The convexity assumption does not hold for neural networks. In practice, replay still helps for non-convex models, but the theoretical guarantee breaks. Buffer selection matters: uniform sampling may waste capacity on easy examples. Privacy constraints may prohibit storing old data at all.
Generative replay replaces stored examples with a generative model trained on previous tasks. Train a generator on old data, then use to produce synthetic examples during new-task training. This avoids storing real data but adds the complexity of training and maintaining a generator.
Common Confusions
Continual learning is not multi-task learning
Multi-task learning trains on all tasks simultaneously with all data available. Continual learning trains sequentially with limited access to old data. The optimization landscapes are different: multi-task can find a joint optimum, while continual learning must navigate a sequence of changing objectives.
EWC does not prevent all forgetting
EWC slows forgetting, it does not eliminate it. The diagonal Fisher approximation is lossy, and the Gaussian posterior assumption is wrong for deep networks. On long task sequences (10+ tasks), EWC performance degrades substantially compared to joint training.
Canonical Examples
Permuted MNIST benchmark
Train on MNIST, then on MNIST with pixels permuted by a fixed permutation (a different "task" with identical statistics but different spatial structure). Naive fine-tuning on the permuted task drops original MNIST accuracy from 98% to about 50%. EWC retains about 95% accuracy on the original task while achieving 96% on the permuted task. This is a clean demonstration because both tasks have identical difficulty.
Summary
- Continual learning studies sequential task learning under limited access to old data
- The stability-plasticity tradeoff is the central tension
- EWC uses Fisher information to identify which weights to protect
- Progressive networks avoid forgetting by freezing old weights and growing the model
- Replay methods approximate joint training with a small buffer
- No method fully solves continual learning for long task sequences in deep networks
Exercises
Problem
In EWC, suppose a weight has Fisher information for task . What does this mean about the weight, and what happens to it during task training?
Problem
A progressive network has been trained on 50 tasks. Each task column has 10M parameters. Lateral connections between column and all previous columns at each layer add overhead. Estimate the total parameter count and explain why this approach does not scale to hundreds of tasks.
References
Canonical:
- Kirkpatrick et al., "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, 2017), Sections 1-4
- Rusu et al., "Progressive Neural Networks" (2016), Sections 1-3
Current:
- Rolnick et al., "Experience Replay for Continual Learning" (2019)
- van de Ven & Tolias, "Three Scenarios for Continual Learning" (2019)
- De Lange et al., "A Continual Learning Survey" (2021), Sections 2-5
Next Topics
- Methods build on the catastrophic forgetting analysis of why forgetting occurs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Catastrophic ForgettingLayer 4