AI Safety
Catastrophic Forgetting
Fine-tuning a neural network on new data destroys knowledge of old data. Understanding the stability-plasticity dilemma and mitigation strategies: EWC, progressive networks, replay: is essential for continual learning and safe LLM fine-tuning.
Why This Matters
Live Demo: Scroll Down to See Forgetting
Neural networks learn by adjusting weights to minimize loss on training data.
Each weight encodes information about patterns in the data it was trained on.
When you fine-tune on new data, the weights shift to accommodate the new task.
But the old information was stored in those same weights.
As the weights change, the old knowledge is overwritten.
The network forgets what it previously knew.
This is catastrophic forgetting.
As you scroll, earlier sentences degrade. This is what happens to neural network weights during sequential fine-tuning.
Catastrophic forgetting is one of the most fundamental limitations of neural networks. When you fine-tune a model on task B, it forgets task A. This is not a minor degradation. Performance on the old task can drop to chance level.
This matters practically because:
- Fine-tuning LLMs on new instructions can destroy prior capabilities
- Models cannot learn continuously from a stream of tasks without explicit mitigation
- Safety alignment via fine-tuning can be undone by subsequent fine-tuning on other data
- The problem limits how we deploy and update AI systems
Mental Model
Imagine learning to ride a bicycle, then learning to play piano. For humans, learning piano does not make you forget how to ride a bicycle. Neural networks are not like this. Training a network on piano data overwrites the weights that encoded bicycle riding.
The root cause: all knowledge in a neural network is stored in shared weights. Learning task B modifies the same weights that encoded task A via gradient descent. If the two tasks need different weight configurations, learning B destroys A.
Formal Setup and Notation
Consider a neural network with parameters . Suppose we have two tasks, A and B, with data and and losses and .
Sequential training: First train on to get . Then fine-tune on starting from to get .
Catastrophic forgetting: . Performance on task A degrades severely after learning task B.
Core Definitions
Stability-Plasticity Dilemma
A learning system must balance two competing objectives:
- Plasticity: The ability to learn new information from new data
- Stability: The ability to retain previously learned knowledge
Maximum plasticity (unconstrained fine-tuning) causes catastrophic forgetting. Maximum stability (frozen weights) prevents learning anything new. Every continual learning method is a different point on this tradeoff.
Continual Learning (Lifelong Learning)
A learning setting where a model must learn from a sequence of tasks without access to data from previous tasks when learning the current task. The model must perform well on all tasks after training on the full sequence.
Task-Incremental vs. Class-Incremental
- Task-incremental: The model knows which task it is being evaluated on (e.g., receives a task ID). Easier because the model can route to task-specific components.
- Class-incremental: The model must distinguish between all classes from all tasks without knowing the task ID. Much harder because old classes must compete with new classes in a shared output space.
Why Forgetting Happens: The Weight-Level View
Consider the loss landscape. After training on task A, the parameters sit in a region of low . When we optimize starting from , the gradient of pushes away from toward regions where is low. In general, these regions have high .
The severity of forgetting depends on:
- Task similarity: If tasks A and B are similar, the low-loss regions overlap and forgetting is mild. If they are dissimilar, the regions are far apart and forgetting is catastrophic.
- Network capacity: Larger networks can potentially represent both tasks in different subnetworks, reducing interference.
- Training duration: The longer you train on B, the further moves from , and the worse forgetting becomes.
Mitigation Strategies
Regularization-Based: Elastic Weight Consolidation (EWC)
Elastic Weight Consolidation
Statement
After learning task A with optimal parameters , the EWC objective for learning task B is:
where is the diagonal of the Fisher information matrix evaluated at :
Intuition
EWC penalizes moving away from , but not equally for all parameters. Parameters that are important for task A (high Fisher information) get a strong penalty. Parameters that are unimportant for task A (low Fisher information) are free to change for task B.
The Fisher information measures how much the output distribution changes when you perturb a parameter. If the Fisher information for is large, then strongly determines task A predictions, and changing it will damage task A performance. If it is small, the parameter is redundant for task A and can be repurposed for task B.
Why It Matters
EWC is the most important theoretical contribution to continual learning. It provides a principled, Bayesian justification for the penalty: the Fisher information is the precision (inverse variance) of the approximate Gaussian posterior over parameters after observing task A data. The EWC penalty is equivalent to using this posterior as a prior for task B.
Failure Mode
EWC assumes a diagonal approximation to the Fisher information, ignoring parameter correlations. For large networks, this is a severe approximation. EWC also accumulates penalties as the number of tasks grows: after tasks, the penalty has terms, and eventually all parameters are heavily constrained, leaving no room for new learning. This is called the "ossification" problem.
Replay-Based: Experience Replay
Store a small buffer of examples from previous tasks. When learning task B, mix examples from the buffer with task B data. This directly prevents forgetting by keeping old data in the training stream.
Variants:
- Exact replay: Store and replay actual examples from old tasks. Simple but requires storage and may raise privacy concerns.
- Generative replay: Train a generative model to produce synthetic examples from old tasks. No storage needed, but the generative model itself can suffer from forgetting.
- Gradient episodic memory (GEM): Use stored examples to constrain gradients. project the task B gradient to ensure it does not increase loss on stored examples from task A.
Architecture-Based: Progressive Neural Networks
Allocate a separate set of parameters for each task. Old parameters are frozen; new parameters are added with lateral connections to old ones.
- Progressive networks (Rusu et al., 2016): Add a new column of layers for each task, with lateral connections from all previous columns. Forgetting is impossible (old weights are frozen), but the model grows linearly with the number of tasks.
- PackNet: Prune the network after each task, freeing up capacity for new tasks. Avoids the growth problem but requires a pruning strategy.
- Supermasks / lottery tickets: Find different binary masks over a shared set of weights for different tasks. Each task uses a different subnet.
Connection to LLM Fine-Tuning
Catastrophic forgetting has direct implications for LLM deployment:
Instruction tuning destroys base capabilities. When you fine-tune a base LLM on instruction-following data, the model can lose some of its original capabilities (e.g., factual knowledge, reasoning). This is why careful data mixing and evaluation on diverse benchmarks is essential during fine-tuning.
Safety alignment can be undone. If a model is aligned via RLHF or DPO to refuse harmful requests, subsequent fine-tuning on non-safety data can undo this alignment. This is a security concern: an adversary who can fine-tune a model can potentially remove safety guardrails. This motivates research on robust alignment that survives fine-tuning.
LoRA and parameter-efficient fine-tuning (PEFT). Methods like LoRA (Low-Rank Adaptation) mitigate forgetting by only modifying a small number of parameters (low-rank updates to attention matrices). The base model weights are frozen, so forgetting of base capabilities is reduced. This is a practical form of the architecture-based approach.
Common Confusions
Forgetting is not the same as overfitting
Overfitting means the model memorizes training data and fails to generalize to test data from the same distribution. Forgetting means the model loses performance on a different task it previously learned. A model can overfit to task B while simultaneously forgetting task A, but these are distinct phenomena with different causes and remedies.
Larger models do not automatically solve forgetting
While larger models have more capacity and can potentially represent multiple tasks, standard fine-tuning still causes forgetting in large models. The issue is not lack of capacity but the optimization process: gradients from the new task push shared weights away from the old optimum. LLMs with hundreds of billions of parameters still suffer from catastrophic forgetting during fine-tuning.
Multitask learning is not the same as continual learning
In multitask learning, you have simultaneous access to data from all tasks and train on them together. This avoids forgetting entirely but requires all data upfront. Continual learning requires learning tasks sequentially without revisiting old data. The distinction is about data availability, not model architecture.
Summary
- Catastrophic forgetting: fine-tuning on new data destroys old knowledge
- Root cause: shared weights store all knowledge; updating for task B overwrites task A
- Stability-plasticity dilemma: you cannot maximize both
- EWC: penalize changes to important parameters (measured by Fisher information)
- Replay: mix old examples into new training
- Progressive networks: freeze old, add new
- LoRA/PEFT: modify few parameters, preserve base model
- Safety implication: fine-tuning can undo alignment
Exercises
Problem
Explain why EWC uses the Fisher information matrix rather than a uniform penalty on all parameter changes. What would happen if you used a uniform penalty instead?
Problem
After learning tasks A, B, and C sequentially with EWC, the penalty for task D has three terms (one for each previous task). Explain the "ossification" problem and propose a modification to address it.
Problem
If an adversary can fine-tune a safety-aligned LLM on a small dataset of harmful examples, the alignment may be removed. Propose a defense strategy that makes the alignment robust to fine-tuning, and discuss its limitations.
References
Canonical:
- McCloskey & Cohen, "Catastrophic Interference in Connectionist Networks" (1989)
- Kirkpatrick et al., "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, 2017)
Current:
- Rusu et al., "Progressive Neural Networks" (2016)
- Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
- De Lange et al., "A Continual Learning Survey" (2021)
Next Topics
The natural next steps from catastrophic forgetting:
- Connections to Bayesian inference and posterior approximation
- Implications for LLM alignment and safety
Last reviewed: April 2026