Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Catastrophic Forgetting

Fine-tuning a neural network on new data destroys knowledge of old data. Understanding the stability-plasticity dilemma and mitigation strategies: EWC, progressive networks, replay: is essential for continual learning and safe LLM fine-tuning.

AdvancedTier 2Current~50 min

Why This Matters

Live Demo: Scroll Down to See Forgetting

Neural networks learn by adjusting weights to minimize loss on training data.

Each weight encodes information about patterns in the data it was trained on.

When you fine-tune on new data, the weights shift to accommodate the new task.

But the old information was stored in those same weights.

As the weights change, the old knowledge is overwritten.

The network forgets what it previously knew.

This is catastrophic forgetting.

As you scroll, earlier sentences degrade. This is what happens to neural network weights during sequential fine-tuning.

Catastrophic forgetting is one of the most fundamental limitations of neural networks. When you fine-tune a model on task B, it forgets task A. This is not a minor degradation. Performance on the old task can drop to chance level.

This matters practically because:

  • Fine-tuning LLMs on new instructions can destroy prior capabilities
  • Models cannot learn continuously from a stream of tasks without explicit mitigation
  • Safety alignment via fine-tuning can be undone by subsequent fine-tuning on other data
  • The problem limits how we deploy and update AI systems

Mental Model

Imagine learning to ride a bicycle, then learning to play piano. For humans, learning piano does not make you forget how to ride a bicycle. Neural networks are not like this. Training a network on piano data overwrites the weights that encoded bicycle riding.

The root cause: all knowledge in a neural network is stored in shared weights. Learning task B modifies the same weights that encoded task A via gradient descent. If the two tasks need different weight configurations, learning B destroys A.

Formal Setup and Notation

Consider a neural network with parameters θ\theta. Suppose we have two tasks, A and B, with data DA\mathcal{D}_A and DB\mathcal{D}_B and losses LA(θ)\mathcal{L}_A(\theta) and LB(θ)\mathcal{L}_B(\theta).

Sequential training: First train on DA\mathcal{D}_A to get θA\theta_A^*. Then fine-tune on DB\mathcal{D}_B starting from θA\theta_A^* to get θAB\theta_{AB}.

Catastrophic forgetting: LA(θAB)LA(θA)\mathcal{L}_A(\theta_{AB}) \gg \mathcal{L}_A(\theta_A^*). Performance on task A degrades severely after learning task B.

Core Definitions

Definition

Stability-Plasticity Dilemma

A learning system must balance two competing objectives:

  • Plasticity: The ability to learn new information from new data
  • Stability: The ability to retain previously learned knowledge

Maximum plasticity (unconstrained fine-tuning) causes catastrophic forgetting. Maximum stability (frozen weights) prevents learning anything new. Every continual learning method is a different point on this tradeoff.

Definition

Continual Learning (Lifelong Learning)

A learning setting where a model must learn from a sequence of tasks T1,T2,,TK\mathcal{T}_1, \mathcal{T}_2, \ldots, \mathcal{T}_K without access to data from previous tasks when learning the current task. The model must perform well on all tasks after training on the full sequence.

Definition

Task-Incremental vs. Class-Incremental

  • Task-incremental: The model knows which task it is being evaluated on (e.g., receives a task ID). Easier because the model can route to task-specific components.
  • Class-incremental: The model must distinguish between all classes from all tasks without knowing the task ID. Much harder because old classes must compete with new classes in a shared output space.

Why Forgetting Happens: The Weight-Level View

Consider the loss landscape. After training on task A, the parameters θA\theta_A^* sit in a region of low LA\mathcal{L}_A. When we optimize LB\mathcal{L}_B starting from θA\theta_A^*, the gradient of LB\mathcal{L}_B pushes θ\theta away from θA\theta_A^* toward regions where LB\mathcal{L}_B is low. In general, these regions have high LA\mathcal{L}_A.

The severity of forgetting depends on:

  1. Task similarity: If tasks A and B are similar, the low-loss regions overlap and forgetting is mild. If they are dissimilar, the regions are far apart and forgetting is catastrophic.
  2. Network capacity: Larger networks can potentially represent both tasks in different subnetworks, reducing interference.
  3. Training duration: The longer you train on B, the further θ\theta moves from θA\theta_A^*, and the worse forgetting becomes.

Mitigation Strategies

Regularization-Based: Elastic Weight Consolidation (EWC)

Proposition

Elastic Weight Consolidation

Statement

After learning task A with optimal parameters θA\theta_A^*, the EWC objective for learning task B is:

LEWC(θ)=LB(θ)+λ2iFi(θiθA,i)2\mathcal{L}_{\text{EWC}}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{A,i}^*)^2

where FiF_i is the diagonal of the Fisher information matrix evaluated at θA\theta_A^*:

Fi=ExDA ⁣[(logp(yx,θA)θi)2]F_i = \mathbb{E}_{x \sim \mathcal{D}_A}\!\left[\left(\frac{\partial \log p(y|x, \theta_A^*)}{\partial \theta_i}\right)^2\right]

Intuition

EWC penalizes moving away from θA\theta_A^*, but not equally for all parameters. Parameters that are important for task A (high Fisher information) get a strong penalty. Parameters that are unimportant for task A (low Fisher information) are free to change for task B.

The Fisher information measures how much the output distribution changes when you perturb a parameter. If the Fisher information for θi\theta_i is large, then θi\theta_i strongly determines task A predictions, and changing it will damage task A performance. If it is small, the parameter is redundant for task A and can be repurposed for task B.

Why It Matters

EWC is the most important theoretical contribution to continual learning. It provides a principled, Bayesian justification for the penalty: the Fisher information is the precision (inverse variance) of the approximate Gaussian posterior over parameters after observing task A data. The EWC penalty is equivalent to using this posterior as a prior for task B.

Failure Mode

EWC assumes a diagonal approximation to the Fisher information, ignoring parameter correlations. For large networks, this is a severe approximation. EWC also accumulates penalties as the number of tasks grows: after KK tasks, the penalty has KK terms, and eventually all parameters are heavily constrained, leaving no room for new learning. This is called the "ossification" problem.

Replay-Based: Experience Replay

Store a small buffer of examples from previous tasks. When learning task B, mix examples from the buffer with task B data. This directly prevents forgetting by keeping old data in the training stream.

Variants:

  • Exact replay: Store and replay actual examples from old tasks. Simple but requires storage and may raise privacy concerns.
  • Generative replay: Train a generative model to produce synthetic examples from old tasks. No storage needed, but the generative model itself can suffer from forgetting.
  • Gradient episodic memory (GEM): Use stored examples to constrain gradients. project the task B gradient to ensure it does not increase loss on stored examples from task A.

Architecture-Based: Progressive Neural Networks

Allocate a separate set of parameters for each task. Old parameters are frozen; new parameters are added with lateral connections to old ones.

  • Progressive networks (Rusu et al., 2016): Add a new column of layers for each task, with lateral connections from all previous columns. Forgetting is impossible (old weights are frozen), but the model grows linearly with the number of tasks.
  • PackNet: Prune the network after each task, freeing up capacity for new tasks. Avoids the growth problem but requires a pruning strategy.
  • Supermasks / lottery tickets: Find different binary masks over a shared set of weights for different tasks. Each task uses a different subnet.

Connection to LLM Fine-Tuning

Catastrophic forgetting has direct implications for LLM deployment:

Instruction tuning destroys base capabilities. When you fine-tune a base LLM on instruction-following data, the model can lose some of its original capabilities (e.g., factual knowledge, reasoning). This is why careful data mixing and evaluation on diverse benchmarks is essential during fine-tuning.

Safety alignment can be undone. If a model is aligned via RLHF or DPO to refuse harmful requests, subsequent fine-tuning on non-safety data can undo this alignment. This is a security concern: an adversary who can fine-tune a model can potentially remove safety guardrails. This motivates research on robust alignment that survives fine-tuning.

LoRA and parameter-efficient fine-tuning (PEFT). Methods like LoRA (Low-Rank Adaptation) mitigate forgetting by only modifying a small number of parameters (low-rank updates to attention matrices). The base model weights are frozen, so forgetting of base capabilities is reduced. This is a practical form of the architecture-based approach.

Common Confusions

Watch Out

Forgetting is not the same as overfitting

Overfitting means the model memorizes training data and fails to generalize to test data from the same distribution. Forgetting means the model loses performance on a different task it previously learned. A model can overfit to task B while simultaneously forgetting task A, but these are distinct phenomena with different causes and remedies.

Watch Out

Larger models do not automatically solve forgetting

While larger models have more capacity and can potentially represent multiple tasks, standard fine-tuning still causes forgetting in large models. The issue is not lack of capacity but the optimization process: gradients from the new task push shared weights away from the old optimum. LLMs with hundreds of billions of parameters still suffer from catastrophic forgetting during fine-tuning.

Watch Out

Multitask learning is not the same as continual learning

In multitask learning, you have simultaneous access to data from all tasks and train on them together. This avoids forgetting entirely but requires all data upfront. Continual learning requires learning tasks sequentially without revisiting old data. The distinction is about data availability, not model architecture.

Summary

  • Catastrophic forgetting: fine-tuning on new data destroys old knowledge
  • Root cause: shared weights store all knowledge; updating for task B overwrites task A
  • Stability-plasticity dilemma: you cannot maximize both
  • EWC: penalize changes to important parameters (measured by Fisher information)
  • Replay: mix old examples into new training
  • Progressive networks: freeze old, add new
  • LoRA/PEFT: modify few parameters, preserve base model
  • Safety implication: fine-tuning can undo alignment

Exercises

ExerciseCore

Problem

Explain why EWC uses the Fisher information matrix rather than a uniform penalty on all parameter changes. What would happen if you used a uniform penalty λ2θθA2\frac{\lambda}{2}\|\theta - \theta_A^*\|^2 instead?

ExerciseAdvanced

Problem

After learning tasks A, B, and C sequentially with EWC, the penalty for task D has three terms (one for each previous task). Explain the "ossification" problem and propose a modification to address it.

ExerciseResearch

Problem

If an adversary can fine-tune a safety-aligned LLM on a small dataset of harmful examples, the alignment may be removed. Propose a defense strategy that makes the alignment robust to fine-tuning, and discuss its limitations.

References

Canonical:

  • McCloskey & Cohen, "Catastrophic Interference in Connectionist Networks" (1989)
  • Kirkpatrick et al., "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, 2017)

Current:

  • Rusu et al., "Progressive Neural Networks" (2016)
  • Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
  • De Lange et al., "A Continual Learning Survey" (2021)

Next Topics

The natural next steps from catastrophic forgetting:

  • Connections to Bayesian inference and posterior approximation
  • Implications for LLM alignment and safety

Last reviewed: April 2026

Builds on This