Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Continual Learning and Forgetting

Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.

AdvancedTier 2Current~50 min
0

Why This Matters

Real-world ML systems must learn new tasks over time. A deployed model needs to handle new product categories, new languages, or updated regulations without retraining from scratch on all historical data. Naive sequential training destroys old knowledge. Continual learning studies how to avoid this.

The problem is broader than catastrophic forgetting alone. Forgetting describes what goes wrong. Continual learning studies what to do about it: algorithms, architectures, and training protocols that allow sequential learning while preserving performance on earlier tasks.

Problem Formulation

Definition

Continual Learning Setting

A model encounters a sequence of tasks T1,T2,,TKT_1, T_2, \ldots, T_K. For task TkT_k, training data Dk={(xi(k),yi(k))}\mathcal{D}_k = \{(x_i^{(k)}, y_i^{(k)})\} is available. After training on TkT_k, data from previous tasks T1,,Tk1T_1, \ldots, T_{k-1} may be partially or fully unavailable. The goal is to find parameters θ\theta that perform well on all tasks:

minθk=1KE(x,y)Dk[(fθ(x),y)]\min_\theta \sum_{k=1}^{K} \mathbb{E}_{(x,y) \sim \mathcal{D}_k}[\ell(f_\theta(x), y)]

The constraint is that you cannot store or revisit all of D1,,DK1\mathcal{D}_1, \ldots, \mathcal{D}_{K-1}.

Definition

Stability-Plasticity Tradeoff

Stability is the ability to retain performance on old tasks. Plasticity is the ability to learn new tasks. These are in tension: a model that never changes its weights is perfectly stable but has zero plasticity. A model that freely overwrites weights has maximum plasticity but zero stability. Every continual learning method navigates this tradeoff.

Method 1: Elastic Weight Consolidation (EWC)

Theorem

EWC Objective via Laplace Approximation

Statement

After training on task T1T_1 with optimal parameters θ1\theta_1^*, the loss for training on task T2T_2 while preserving task T1T_1 performance is approximately:

LEWC(θ)=LT2(θ)+λ2iFi(θiθ1,i)2\mathcal{L}_{\text{EWC}}(\theta) = \mathcal{L}_{T_2}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{1,i}^*)^2

where FiF_i is the ii-th diagonal element of the Fisher information matrix computed on task T1T_1 data, and λ\lambda controls the strength of the constraint.

Intuition

The Fisher information FiF_i measures how important weight θi\theta_i is for task T1T_1. If FiF_i is large, changing θi\theta_i significantly hurts task T1T_1 performance, so the penalty for deviating from θ1,i\theta_{1,i}^* is large. If FiF_i is small, the weight can be freely repurposed for task T2T_2.

Proof Sketch

Model the posterior over θ\theta after task T1T_1 as p(θD1)p(\theta \mid \mathcal{D}_1). Apply a Laplace approximation: approximate this posterior as a Gaussian centered at θ1\theta_1^* with precision matrix equal to the Fisher information FF. The log posterior for training on T2T_2 becomes logp(D2θ)+logp(θD1)LT2(θ)12(θθ1)TF(θθ1)\log p(\mathcal{D}_2 \mid \theta) + \log p(\theta \mid \mathcal{D}_1) \approx \mathcal{L}_{T_2}(\theta) - \frac{1}{2}(\theta - \theta_1^*)^T F (\theta - \theta_1^*). Using the diagonal approximation to FF yields the EWC objective.

Why It Matters

EWC provides a principled, Bayesian-motivated approach to continual learning. It does not require storing old data (only the Fisher diagonal and old parameters). It shows that the key information for preventing forgetting is which weights matter, not what the old data looked like.

Failure Mode

The diagonal Fisher approximation is crude for deep networks where weights interact strongly. The Laplace approximation assumes a unimodal posterior, which is wrong for neural networks with many equivalent minima. EWC degrades when the number of tasks grows large because the accumulated penalties leave fewer free parameters. It also cannot handle tasks that require conflicting weight values.

Method 2: Progressive Neural Networks

Progressive networks side-step forgetting entirely by never modifying old weights. For each new task TkT_k, a new "column" (a full network) is added. The new column receives lateral connections from all previous columns:

hi(k)=σ(Wi(k)hi1(k)+j=1k1Ui(k,j)hi1(j))h_i^{(k)} = \sigma\left(W_i^{(k)} h_{i-1}^{(k)} + \sum_{j=1}^{k-1} U_i^{(k,j)} h_{i-1}^{(j)}\right)

where hi(j)h_i^{(j)} is the activation at layer ii for column jj, and Ui(k,j)U_i^{(k,j)} are lateral connection weights.

Advantage: zero forgetting by construction. Old columns are frozen.

Disadvantage: model size grows linearly with the number of tasks. After 100 tasks, you have 100 full networks plus lateral connections. This is impractical for long task sequences.

Method 3: Replay Methods

Store a small buffer M\mathcal{M} of examples from previous tasks. When training on task TkT_k, interleave buffer examples with current data:

Lreplay(θ)=LTk(θ)+α1M(x,y)M(fθ(x),y)\mathcal{L}_{\text{replay}}(\theta) = \mathcal{L}_{T_k}(\theta) + \alpha \cdot \frac{1}{|\mathcal{M}|} \sum_{(x,y) \in \mathcal{M}} \ell(f_\theta(x), y)

Proposition

Replay Buffer Convergence

Statement

If the replay buffer M\mathcal{M} contains mm examples drawn uniformly from D1Dk1\mathcal{D}_1 \cup \cdots \cup \mathcal{D}_{k-1} and each task loss is convex and LL-Lipschitz, then with probability at least 1δ1 - \delta:

1kj=1kLTj(θ^)1kj=1kLTj(θ)O(Lm+Llog(1/δ)m)\frac{1}{k}\sum_{j=1}^{k} \mathcal{L}_{T_j}(\hat{\theta}) - \frac{1}{k}\sum_{j=1}^{k} \mathcal{L}_{T_j}(\theta^*) \leq O\left(\frac{L}{\sqrt{m}} + \frac{L\sqrt{\log(1/\delta)}}{\sqrt{m}}\right)

where θ^\hat{\theta} is the replay-trained model and θ\theta^* is the joint optimum.

Intuition

A uniformly sampled buffer approximates training on all tasks jointly. The approximation quality scales as 1/m1/\sqrt{m} where mm is the buffer size. Larger buffers approach the joint-training baseline.

Proof Sketch

The buffer empirical loss is an unbiased estimator of the average population loss across all tasks. Apply standard uniform convergence arguments (Hoeffding or Rademacher bounds) to bound the gap between buffer loss and true joint loss. The convexity assumption ensures that minimizing the approximate loss stays close to minimizing the true loss.

Why It Matters

This justifies replay as a principled strategy: with a buffer of modest size, you can approximate joint training. In practice, even small buffers (a few hundred examples per task) substantially reduce forgetting.

Failure Mode

The convexity assumption does not hold for neural networks. In practice, replay still helps for non-convex models, but the theoretical guarantee breaks. Buffer selection matters: uniform sampling may waste capacity on easy examples. Privacy constraints may prohibit storing old data at all.

Generative replay replaces stored examples with a generative model trained on previous tasks. Train a generator GG on old data, then use GG to produce synthetic examples during new-task training. This avoids storing real data but adds the complexity of training and maintaining a generator.

Common Confusions

Watch Out

Continual learning is not multi-task learning

Multi-task learning trains on all tasks simultaneously with all data available. Continual learning trains sequentially with limited access to old data. The optimization landscapes are different: multi-task can find a joint optimum, while continual learning must navigate a sequence of changing objectives.

Watch Out

EWC does not prevent all forgetting

EWC slows forgetting, it does not eliminate it. The diagonal Fisher approximation is lossy, and the Gaussian posterior assumption is wrong for deep networks. On long task sequences (10+ tasks), EWC performance degrades substantially compared to joint training.

Canonical Examples

Example

Permuted MNIST benchmark

Train on MNIST, then on MNIST with pixels permuted by a fixed permutation (a different "task" with identical statistics but different spatial structure). Naive fine-tuning on the permuted task drops original MNIST accuracy from 98% to about 50%. EWC retains about 95% accuracy on the original task while achieving 96% on the permuted task. This is a clean demonstration because both tasks have identical difficulty.

Summary

  • Continual learning studies sequential task learning under limited access to old data
  • The stability-plasticity tradeoff is the central tension
  • EWC uses Fisher information to identify which weights to protect
  • Progressive networks avoid forgetting by freezing old weights and growing the model
  • Replay methods approximate joint training with a small buffer
  • No method fully solves continual learning for long task sequences in deep networks

Exercises

ExerciseCore

Problem

In EWC, suppose a weight θi\theta_i has Fisher information Fi=0F_i = 0 for task T1T_1. What does this mean about the weight, and what happens to it during task T2T_2 training?

ExerciseAdvanced

Problem

A progressive network has been trained on 50 tasks. Each task column has 10M parameters. Lateral connections between column kk and all previous columns at each layer add overhead. Estimate the total parameter count and explain why this approach does not scale to hundreds of tasks.

References

Canonical:

  • Kirkpatrick et al., "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, 2017), Sections 1-4
  • Rusu et al., "Progressive Neural Networks" (2016), Sections 1-3

Current:

  • Rolnick et al., "Experience Replay for Continual Learning" (2019)
  • van de Ven & Tolias, "Three Scenarios for Continual Learning" (2019)
  • De Lange et al., "A Continual Learning Survey" (2021), Sections 2-5

Next Topics

  • Methods build on the catastrophic forgetting analysis of why forgetting occurs

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.