LLM Construction
Fine-Tuning and Adaptation
Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.
Why This Matters
Pretraining a large language model costs millions of dollars and requires trillions of tokens. Adaptation takes a pretrained model and specializes it for a specific task or domain at a fraction of the cost. This is the practical realization of transfer learning: knowledge acquired during pretraining transfers to downstream tasks. The central question is: how do you update a model with billions of parameters efficiently, without destroying the knowledge learned during pretraining?
Full fine-tuning updates every parameter. This works but requires storing a separate copy of the full model for each task, and risks catastrophic forgetting of pretrained knowledge. Parameter-efficient methods (LoRA, adapters, prompt tuning) modify only a small fraction of parameters, reducing memory, storage, and forgetting risk.
Formal Setup
Let be the pretrained model parameters, where can be in the billions. Let be the task-specific dataset. The goal is to find that performs well on the task while staying close (in some sense) to .
Core Definitions
Full Fine-Tuning
Update all parameters by minimizing the task loss:
initialized at . Every parameter in the model is trainable. The number of trainable parameters equals .
Feature Extraction (Linear Probing)
Freeze all pretrained parameters . Add a new head (where is the number of classes and is the hidden dimension) and train only :
Trainable parameters: , typically a tiny fraction of .
LoRA (Low-Rank Adaptation)
For a pretrained weight matrix , LoRA adds a low-rank update:
where , , and . Only and are trained. is frozen.
Trainable parameters per layer: instead of .
QLoRA
LoRA applied to a quantized base model. The pretrained weights are stored in 4-bit precision (NormalFloat4). The low-rank matrices and are computed in higher precision (BFloat16). This reduces memory for the frozen weights by approximately 4x compared to FP16.
Adapter Layers
Small bottleneck modules inserted between frozen transformer layers. Each adapter has a down-projection , a nonlinearity, and an up-projection :
Trainable parameters per adapter: . The residual connection ensures the adapter can learn the identity function.
Prompt Tuning
Prepend learnable embedding vectors to the input sequence. The model parameters are entirely frozen. Only is trained.
Trainable parameters: . For and , this is 81,920 parameters, compared to billions for full fine-tuning.
Main Theorems
LoRA Expressiveness Bound
Statement
Let be the pretrained weight. The set of weight matrices reachable by LoRA with rank is:
This is a smooth manifold of dimension embedded in . When , LoRA can represent any weight matrix, recovering full fine-tuning.
Intuition
LoRA restricts weight updates to a low-dimensional subspace. The hypothesis is that the task-specific weight change has low intrinsic rank: most of the adaptation can be captured by a few directions in weight space. Empirically, Hu et al. (2021) found that or suffices for many NLP tasks, suggesting the adaptation subspace is far smaller than the full parameter space.
Proof Sketch
The set of matrices of rank at most is a well-studied algebraic variety. Its smooth part (matrices of rank exactly ) has dimension , which follows from the parametrization where and have entries minus for the symmetry for invertible .
Why It Matters
This explains why LoRA works: if the true adaptation has low rank, LoRA with sufficient captures it exactly. The parameter savings are substantial. For and : LoRA uses parameters instead of , a 256x reduction.
Failure Mode
If the required adaptation has high intrinsic rank, small is insufficient and performance degrades. This can happen when the task distribution is very different from the pretraining distribution, requiring changes in many independent directions of weight space. In such cases, full fine-tuning or higher is needed.
When to Use Each Method
| Method | Trainable Params | Memory | Best When |
|---|---|---|---|
| Full fine-tuning | All | High (full model + optimizer states) | Large task dataset, sufficient compute |
| Feature extraction | Low (frozen backbone) | Very small dataset, avoiding overfitting | |
| LoRA | per layer | Medium (frozen weights + small adapters) | Moderate dataset, need parameter efficiency |
| QLoRA | Same as LoRA | Low (4-bit base + small adapters) | Large model, limited GPU memory |
| Adapters | per layer | Medium | Multiple tasks, modular deployment |
| Prompt tuning | Low | Very parameter-efficient, large models |
Catastrophic Forgetting
Fine-tuning does not just add new knowledge
Fine-tuning modifies the same parameters that encode pretrained knowledge. Large gradient updates can overwrite representations learned during pretraining. This is catastrophic forgetting: the model improves on the task but degrades on capabilities it had before fine-tuning. A model fine-tuned on medical text may lose its ability to write code.
Three factors control forgetting severity:
- Learning rate: Fine-tuning learning rates should be 10x to 100x smaller than pretraining rates. Typical pretraining LR: to . Typical fine-tuning LR: to .
- Parameter count: Methods that update fewer parameters (LoRA, adapters) cause less forgetting because most of the model is frozen.
- Task similarity: Fine-tuning on a distribution close to the pretraining data causes less forgetting than fine-tuning on a very different distribution.
Common Confusions
LoRA initialization matters
is initialized from a random Gaussian; is initialized to zero. This ensures at the start of training, so the model begins with exactly the pretrained weights. If both and were randomly initialized, the initial model would be corrupted by a random perturbation.
Prompt tuning is not prompt engineering
Prompt engineering selects discrete tokens as instructions (natural language). Prompt tuning optimizes continuous vectors in embedding space via gradient descent. The learned soft prompts are not interpretable as natural language and occupy a different part of the embedding space than any real token.
Feature extraction is underpowered for complex tasks
Linear probing tests what the pretrained representations already encode. It cannot learn new features. If the task requires representations that the pretrained model does not already compute, feature extraction will fail regardless of dataset size. Full fine-tuning or LoRA can reshape the internal representations.
Key Takeaways
- Full fine-tuning is the most expressive but most expensive and most prone to forgetting
- LoRA constrains weight updates to a low-rank subspace, reducing trainable parameters by 100x or more
- QLoRA combines LoRA with 4-bit quantization for memory-efficient fine-tuning of large models
- Adapter layers and prompt tuning offer alternative parameter-efficient approaches
- Catastrophic forgetting is the central risk: use small learning rates and parameter-efficient methods to mitigate it
- The right method depends on dataset size, compute budget, and how different the task is from pretraining
Exercises
Problem
A transformer layer has weight matrices . You apply LoRA with rank to all four matrices. How many trainable parameters does this add per layer? What fraction of the original parameters is this?
Problem
Why does initializing and in LoRA ensure that the model starts at exactly the pretrained weights? What would go wrong if both and were initialized randomly?
Problem
A 7B parameter model uses FP16 (2 bytes per parameter). How much GPU memory does the base model require? With QLoRA using NF4 (0.5 bytes per parameter for the base) and rank LoRA adapters on all linear layers (approximately 200 matrices of size ), what is the total memory?
Related Comparisons
References
Canonical:
- Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
- Houlsby et al., "Parameter-Efficient Transfer Learning for NLP" (2019), Chapters 1-3
- Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models" (2023)
Current:
- Lester, Al-Rfou, Constant, "The Power of Scale for Parameter-Efficient Prompt Tuning" (2021)
- Lialin, Deshpande, Rumshisky, "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning" (2023)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Softmax and Numerical StabilityLayer 1