Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Fine-Tuning and Adaptation

Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.

CoreTier 1Current~55 min

Why This Matters

Pretraining a large language model costs millions of dollars and requires trillions of tokens. Adaptation takes a pretrained model and specializes it for a specific task or domain at a fraction of the cost. This is the practical realization of transfer learning: knowledge acquired during pretraining transfers to downstream tasks. The central question is: how do you update a model with billions of parameters efficiently, without destroying the knowledge learned during pretraining?

Pretrained LLMGPT/Llama/etc.Adaptation MethodsFull fine-tuneUpdate all paramsLoRALow-rank adaptersAdapter layersSmall inserted modulesPrompt tuningLearned prefix tokensTask ModelDeployed

Full fine-tuning updates every parameter. This works but requires storing a separate copy of the full model for each task, and risks catastrophic forgetting of pretrained knowledge. Parameter-efficient methods (LoRA, adapters, prompt tuning) modify only a small fraction of parameters, reducing memory, storage, and forgetting risk.

Formal Setup

Let θ0Rd\theta_0 \in \mathbb{R}^d be the pretrained model parameters, where dd can be in the billions. Let Dtask\mathcal{D}_{\text{task}} be the task-specific dataset. The goal is to find θ\theta^* that performs well on the task while staying close (in some sense) to θ0\theta_0.

Core Definitions

Definition

Full Fine-Tuning

Update all parameters θ\theta by minimizing the task loss:

θ=argminθL(θ;Dtask)\theta^* = \arg\min_{\theta} \mathcal{L}(\theta; \mathcal{D}_{\text{task}})

initialized at θ=θ0\theta = \theta_0. Every parameter in the model is trainable. The number of trainable parameters equals dd.

Definition

Feature Extraction (Linear Probing)

Freeze all pretrained parameters θ0\theta_0. Add a new head WheadRk×hW_{\text{head}} \in \mathbb{R}^{k \times h} (where kk is the number of classes and hh is the hidden dimension) and train only WheadW_{\text{head}}:

Whead=argminWL(W;fθ0,Dtask)W_{\text{head}}^* = \arg\min_{W} \mathcal{L}(W; f_{\theta_0}, \mathcal{D}_{\text{task}})

Trainable parameters: k×hk \times h, typically a tiny fraction of dd.

Definition

LoRA (Low-Rank Adaptation)

For a pretrained weight matrix W0Rm×nW_0 \in \mathbb{R}^{m \times n}, LoRA adds a low-rank update:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where BRm×rB \in \mathbb{R}^{m \times r}, ARr×nA \in \mathbb{R}^{r \times n}, and rmin(m,n)r \ll \min(m, n). Only AA and BB are trained. W0W_0 is frozen.

Trainable parameters per layer: r(m+n)r(m + n) instead of mnmn.

Definition

QLoRA

LoRA applied to a quantized base model. The pretrained weights W0W_0 are stored in 4-bit precision (NormalFloat4). The low-rank matrices AA and BB are computed in higher precision (BFloat16). This reduces memory for the frozen weights by approximately 4x compared to FP16.

Definition

Adapter Layers

Small bottleneck modules inserted between frozen transformer layers. Each adapter has a down-projection WdownRr×hW_{\text{down}} \in \mathbb{R}^{r \times h}, a nonlinearity, and an up-projection WupRh×rW_{\text{up}} \in \mathbb{R}^{h \times r}:

Adapter(x)=x+Wupσ(Wdownx)\text{Adapter}(x) = x + W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot x)

Trainable parameters per adapter: 2rh2rh. The residual connection ensures the adapter can learn the identity function.

Definition

Prompt Tuning

Prepend kk learnable embedding vectors PRk×hP \in \mathbb{R}^{k \times h} to the input sequence. The model parameters are entirely frozen. Only PP is trained.

Trainable parameters: khkh. For k=20k = 20 and h=4096h = 4096, this is 81,920 parameters, compared to billions for full fine-tuning.

Main Theorems

Proposition

LoRA Expressiveness Bound

Statement

Let W0Rm×nW_0 \in \mathbb{R}^{m \times n} be the pretrained weight. The set of weight matrices reachable by LoRA with rank rr is:

{W0+ΔW:rank(ΔW)r}\{W_0 + \Delta W : \text{rank}(\Delta W) \leq r\}

This is a smooth manifold of dimension r(m+nr)r(m + n - r) embedded in Rm×n\mathbb{R}^{m \times n}. When r=min(m,n)r = \min(m, n), LoRA can represent any weight matrix, recovering full fine-tuning.

Intuition

LoRA restricts weight updates to a low-dimensional subspace. The hypothesis is that the task-specific weight change ΔW\Delta W has low intrinsic rank: most of the adaptation can be captured by a few directions in weight space. Empirically, Hu et al. (2021) found that r=4r = 4 or r=8r = 8 suffices for many NLP tasks, suggesting the adaptation subspace is far smaller than the full parameter space.

Proof Sketch

The set of m×nm \times n matrices of rank at most rr is a well-studied algebraic variety. Its smooth part (matrices of rank exactly rr) has dimension r(m+nr)r(m + n - r), which follows from the parametrization ΔW=BA\Delta W = BA where BRm×rB \in \mathbb{R}^{m \times r} and ARr×nA \in \mathbb{R}^{r \times n} have r(m+n)r(m+n) entries minus r2r^2 for the symmetry BA=(BQ)(Q1A)BA = (BQ)(Q^{-1}A) for invertible QRr×rQ \in \mathbb{R}^{r \times r}.

Why It Matters

This explains why LoRA works: if the true adaptation has low rank, LoRA with sufficient rr captures it exactly. The parameter savings are substantial. For m=n=4096m = n = 4096 and r=8r = 8: LoRA uses 8×8192=65,5368 \times 8192 = 65{,}536 parameters instead of 40962=16,777,2164096^2 = 16{,}777{,}216, a 256x reduction.

Failure Mode

If the required adaptation has high intrinsic rank, small rr is insufficient and performance degrades. This can happen when the task distribution is very different from the pretraining distribution, requiring changes in many independent directions of weight space. In such cases, full fine-tuning or higher rr is needed.

When to Use Each Method

MethodTrainable ParamsMemoryBest When
Full fine-tuningAll ddHigh (full model + optimizer states)Large task dataset, sufficient compute
Feature extractionkhkhLow (frozen backbone)Very small dataset, avoiding overfitting
LoRAr(m+n)r(m+n) per layerMedium (frozen weights + small adapters)Moderate dataset, need parameter efficiency
QLoRASame as LoRALow (4-bit base + small adapters)Large model, limited GPU memory
Adapters2rh2rh per layerMediumMultiple tasks, modular deployment
Prompt tuningkhkhLowVery parameter-efficient, large models

Catastrophic Forgetting

Watch Out

Fine-tuning does not just add new knowledge

Fine-tuning modifies the same parameters that encode pretrained knowledge. Large gradient updates can overwrite representations learned during pretraining. This is catastrophic forgetting: the model improves on the task but degrades on capabilities it had before fine-tuning. A model fine-tuned on medical text may lose its ability to write code.

Three factors control forgetting severity:

  1. Learning rate: Fine-tuning learning rates should be 10x to 100x smaller than pretraining rates. Typical pretraining LR: 10410^{-4} to 3×1043 \times 10^{-4}. Typical fine-tuning LR: 10510^{-5} to 5×1055 \times 10^{-5}.
  2. Parameter count: Methods that update fewer parameters (LoRA, adapters) cause less forgetting because most of the model is frozen.
  3. Task similarity: Fine-tuning on a distribution close to the pretraining data causes less forgetting than fine-tuning on a very different distribution.

Common Confusions

Watch Out

LoRA initialization matters

AA is initialized from a random Gaussian; BB is initialized to zero. This ensures ΔW=BA=0\Delta W = BA = 0 at the start of training, so the model begins with exactly the pretrained weights. If both AA and BB were randomly initialized, the initial model would be corrupted by a random perturbation.

Watch Out

Prompt tuning is not prompt engineering

Prompt engineering selects discrete tokens as instructions (natural language). Prompt tuning optimizes continuous vectors in embedding space via gradient descent. The learned soft prompts are not interpretable as natural language and occupy a different part of the embedding space than any real token.

Watch Out

Feature extraction is underpowered for complex tasks

Linear probing tests what the pretrained representations already encode. It cannot learn new features. If the task requires representations that the pretrained model does not already compute, feature extraction will fail regardless of dataset size. Full fine-tuning or LoRA can reshape the internal representations.

Key Takeaways

  • Full fine-tuning is the most expressive but most expensive and most prone to forgetting
  • LoRA constrains weight updates to a low-rank subspace, reducing trainable parameters by 100x or more
  • QLoRA combines LoRA with 4-bit quantization for memory-efficient fine-tuning of large models
  • Adapter layers and prompt tuning offer alternative parameter-efficient approaches
  • Catastrophic forgetting is the central risk: use small learning rates and parameter-efficient methods to mitigate it
  • The right method depends on dataset size, compute budget, and how different the task is from pretraining

Exercises

ExerciseCore

Problem

A transformer layer has weight matrices WQ,WK,WV,WOR4096×4096W_Q, W_K, W_V, W_O \in \mathbb{R}^{4096 \times 4096}. You apply LoRA with rank r=8r = 8 to all four matrices. How many trainable parameters does this add per layer? What fraction of the original parameters is this?

ExerciseAdvanced

Problem

Why does initializing B=0B = 0 and AN(0,σ2)A \sim \mathcal{N}(0, \sigma^2) in LoRA ensure that the model starts at exactly the pretrained weights? What would go wrong if both AA and BB were initialized randomly?

ExerciseAdvanced

Problem

A 7B parameter model uses FP16 (2 bytes per parameter). How much GPU memory does the base model require? With QLoRA using NF4 (0.5 bytes per parameter for the base) and rank r=16r = 16 LoRA adapters on all linear layers (approximately 200 matrices of size 4096×40964096 \times 4096), what is the total memory?

Related Comparisons

References

Canonical:

  • Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
  • Houlsby et al., "Parameter-Efficient Transfer Learning for NLP" (2019), Chapters 1-3
  • Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models" (2023)

Current:

  • Lester, Al-Rfou, Constant, "The Power of Scale for Parameter-Efficient Prompt Tuning" (2021)
  • Lialin, Deshpande, Rumshisky, "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning" (2023)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics