Continual Learning and Forgetting

Sneiderman, Robby

AI Safety

Continual Learning and Forgetting

Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

Optimizer Theory SGD Adam Muon

Prereq Map

Learning position

Read this page in the graph.

ai-safety | layer 3 | tier 2. This page has 1 direct prerequisite and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Real-world ML systems must learn new tasks over time. A deployed model needs to handle new product categories, new languages, or updated regulations without retraining from scratch on all historical data. Naive sequential training destroys old knowledge. Continual learning studies how to avoid this.

The problem is broader than catastrophic forgetting alone. Forgetting describes what goes wrong. Continual learning studies what to do about it: algorithms, architectures, and training protocols that allow sequential learning while preserving performance on earlier tasks.

Problem Formulation

Definition

Continual Learning Setting

A model encounters a sequence of tasks $T_1, T_2, \ldots, T_K$ . For task $T_k$ , training data $\mathcal{D}_k = \{(x_i^{(k)}, y_i^{(k)})\}$ is available. After training on $T_k$ , data from previous tasks $T_1, \ldots, T_{k-1}$ may be partially or fully unavailable. The goal is to find parameters $\theta$ that perform well on all tasks:

$\min_\theta \sum_{k=1}^{K} \mathbb{E}_{(x,y) \sim \mathcal{D}_k}[\ell(f_\theta(x), y)]$

The constraint is that you cannot store or revisit all of $\mathcal{D}_1, \ldots, \mathcal{D}_{K-1}$ .

Definition

Stability-Plasticity Tradeoff

Stability is the ability to retain performance on old tasks. Plasticity is the ability to learn new tasks. These are in tension: a model that never changes its weights is perfectly stable but has zero plasticity. A model that freely overwrites weights has maximum plasticity but zero stability. Every continual learning method navigates this tradeoff.

Method 1: Elastic Weight Consolidation (EWC)

Theorem

EWC Objective via Laplace Approximation

Statement

After training on task $T_1$ with optimal parameters $\theta_1^*$ , the loss for training on task $T_2$ while preserving task $T_1$ performance is approximately:

$\mathcal{L}_{\text{EWC}}(\theta) = \mathcal{L}_{T_2}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{1,i}^*)^2$

where $F_i$ is the $i$ -th diagonal element of the Fisher information matrix computed on task $T_1$ data, and $\lambda$ controls the strength of the constraint.

Intuition

The Fisher information $F_i$ measures how important weight $\theta_i$ is for task $T_1$ . If $F_i$ is large, changing $\theta_i$ significantly hurts task $T_1$ performance, so the penalty for deviating from $\theta_{1,i}^*$ is large. If $F_i$ is small, the weight can be freely repurposed for task $T_2$ .

Proof Sketch

Model the posterior over $\theta$ after task $T_1$ as $p(\theta \mid \mathcal{D}_1)$ . Apply a Laplace approximation: approximate this posterior as a Gaussian centered at $\theta_1^*$ with precision matrix equal to the Fisher information $F$ . The log posterior for training on $T_2$ becomes $\log p(\mathcal{D}_2 \mid \theta) + \log p(\theta \mid \mathcal{D}_1) \approx \mathcal{L}_{T_2}(\theta) - \frac{1}{2}(\theta - \theta_1^*)^T F (\theta - \theta_1^*)$ . Using the diagonal approximation to $F$ yields the EWC objective.

Why It Matters

EWC provides a principled, Bayesian-motivated approach to continual learning. It does not require storing old data (only the Fisher diagonal and old parameters). It shows that the key information for preventing forgetting is which weights matter, not what the old data looked like.

Failure Mode

The diagonal Fisher approximation is crude for deep networks where weights interact strongly. The Laplace approximation assumes a unimodal posterior, which is wrong for neural networks with many equivalent minima. EWC degrades when the number of tasks grows large because the accumulated penalties leave fewer free parameters. It also cannot handle tasks that require conflicting weight values.

report a correction →

Method 2: Progressive Neural Networks

Progressive networks side-step forgetting entirely by never modifying old weights. For each new task $T_k$ , a new "column" (a full network) is added. The new column receives lateral connections from all previous columns:

$h_i^{(k)} = \sigma\left(W_i^{(k)} h_{i-1}^{(k)} + \sum_{j=1}^{k-1} U_i^{(k,j)} h_{i-1}^{(j)}\right)$

where $h_i^{(j)}$ is the activation at layer $i$ for column $j$ , and $U_i^{(k,j)}$ are lateral connection weights.

Advantage: zero forgetting by construction. Old columns are frozen.

Disadvantage: model size grows linearly with the number of tasks. After 100 tasks, you have 100 full networks plus lateral connections. This is impractical for long task sequences.

Method 3: Replay Methods

Store a small buffer $\mathcal{M}$ of examples from previous tasks. When training on task $T_k$ , interleave buffer examples with current data:

$\mathcal{L}_{\text{replay}}(\theta) = \mathcal{L}_{T_k}(\theta) + \alpha \cdot \frac{1}{|\mathcal{M}|} \sum_{(x,y) \in \mathcal{M}} \ell(f_\theta(x), y)$

Proposition

Replay Buffer Convergence

Statement

If the replay buffer $\mathcal{M}$ contains $m$ examples drawn uniformly from $\mathcal{D}_1 \cup \cdots \cup \mathcal{D}_{k-1}$ and each task loss is convex and $L$ -Lipschitz, then with probability at least $1 - \delta$ :

$\frac{1}{k}\sum_{j=1}^{k} \mathcal{L}_{T_j}(\hat{\theta}) - \frac{1}{k}\sum_{j=1}^{k} \mathcal{L}_{T_j}(\theta^*) \leq O\left(\frac{L}{\sqrt{m}} + \frac{L\sqrt{\log(1/\delta)}}{\sqrt{m}}\right)$

where $\hat{\theta}$ is the replay-trained model and $\theta^*$ is the joint optimum.

Intuition

A uniformly sampled buffer approximates training on all tasks jointly. The approximation quality scales as $1/\sqrt{m}$ where $m$ is the buffer size. Larger buffers approach the joint-training baseline.

Proof Sketch

The buffer empirical loss is an unbiased estimator of the average population loss across all tasks. Apply standard uniform convergence arguments (Hoeffding or Rademacher bounds) to bound the gap between buffer loss and true joint loss. The convexity assumption ensures that minimizing the approximate loss stays close to minimizing the true loss.

Why It Matters

This justifies replay as a principled strategy: with a buffer of modest size, you can approximate joint training. In practice, even small buffers (a few hundred examples per task) substantially reduce forgetting.

Failure Mode

The convexity assumption does not hold for neural networks. In practice, replay still helps for non-convex models, but the theoretical guarantee breaks. Buffer selection matters: uniform sampling may waste capacity on easy examples. Privacy constraints may prohibit storing old data at all.

report a correction →

Generative replay replaces stored examples with a generative model trained on previous tasks. Train a generator $G$ on old data, then use $G$ to produce synthetic examples during new-task training. This avoids storing real data but adds the complexity of training and maintaining a generator.

Common Confusions

Watch Out

Continual learning is not multi-task learning

Multi-task learning trains on all tasks simultaneously with all data available. Continual learning trains sequentially with limited access to old data. The optimization landscapes are different: multi-task can find a joint optimum, while continual learning must navigate a sequence of changing objectives.

Watch Out

EWC does not prevent all forgetting

EWC slows forgetting, it does not eliminate it. The diagonal Fisher approximation is lossy, and the Gaussian posterior assumption is wrong for deep networks. On long task sequences (10+ tasks), EWC performance degrades substantially compared to joint training.

Canonical Examples

Example

Permuted MNIST benchmark

Train on MNIST, then on MNIST with pixels permuted by a fixed permutation (a different "task" with identical statistics but different spatial structure). Naive fine-tuning on the permuted task drops original MNIST accuracy from 98% to about 50%. EWC retains about 95% accuracy on the original task while achieving 96% on the permuted task. This is a clean demonstration because both tasks have identical difficulty.

Summary

Continual learning studies sequential task learning under limited access to old data
The stability-plasticity tradeoff is the central tension
EWC uses Fisher information to identify which weights to protect
Progressive networks avoid forgetting by freezing old weights and growing the model
Replay methods approximate joint training with a small buffer
No method fully solves continual learning for long task sequences in deep networks

Exercises

ExerciseCore

Problem

In EWC, suppose a weight $\theta_i$ has Fisher information $F_i = 0$ for task $T_1$ . What does this mean about the weight, and what happens to it during task $T_2$ training?

ExerciseAdvanced

Problem

A progressive network has been trained on 50 tasks. Each task column has 10M parameters. Lateral connections between column $k$ and all previous columns at each layer add overhead. Estimate the total parameter count and explain why this approach does not scale to hundreds of tasks.

References

Canonical:

Kirkpatrick et al., "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, 2017), Sections 1-4
Rusu et al., "Progressive Neural Networks" (2016), Sections 1-3

Current:

Rolnick et al., "Experience Replay for Continual Learning" (2019)
van de Ven & Tolias, "Three Scenarios for Continual Learning" (2019)
De Lange et al., "A Continual Learning Survey" (2021), Sections 2-5

Next Topics

Methods build on the catastrophic forgetting analysis of why forgetting occurs

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.