Fine-Tuning and Adaptation

Sneiderman, Robby

LLM Construction

Fine-Tuning and Adaptation

Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.

CoreTier 1CurrentCore spine~55 min

Prerequisites

Feedforward Networks and Backpropagation

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 3 | tier 1. This page has 1 direct prerequisite and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

RLHF and Alignment

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Pretraining a large language model costs millions of dollars and requires trillions of tokens. Adaptation takes a pretrained model and specializes it for a specific task or domain at a fraction of the cost. This is the practical realization of transfer learning: knowledge acquired during pretraining transfers to downstream tasks. The central question is: how do you update a model with billions of parameters efficiently, without destroying the knowledge learned during pretraining?

Full fine-tuning updates every parameter. This works but requires storing a separate copy of the full model for each task, and risks catastrophic forgetting of pretrained knowledge. Parameter-efficient methods (LoRA, adapters, prompt tuning) modify only a small fraction of parameters, reducing memory, storage, and forgetting risk. The LoRA paper breakdown walks through the rank- $r$ decomposition, the intrinsic-rank hypothesis, and why the inference-time merge produces a kernel identical to the un-adapted forward pass.

Formal Setup

Let $\theta_0 \in \mathbb{R}^d$ be the pretrained model parameters, where $d$ can be in the billions. Let $\mathcal{D}_{\text{task}}$ be the task-specific dataset. The goal is to find $\theta^*$ that performs well on the task while staying close (in some sense) to $\theta_0$ .

Core Definitions

Definition

Full Fine-Tuning

Update all parameters $\theta$ by minimizing the task loss:

$\theta^* = \arg\min_{\theta} \mathcal{L}(\theta; \mathcal{D}_{\text{task}})$

initialized at $\theta = \theta_0$ via gradient descent. Every parameter in the model is trainable. The number of trainable parameters equals $d$ .

Definition

Feature Extraction (Linear Probing)

Freeze all pretrained parameters $\theta_0$ . Add a new head $W_{\text{head}} \in \mathbb{R}^{k \times h}$ (where $k$ is the number of classes and $h$ is the hidden dimension) and train only $W_{\text{head}}$ :

$W_{\text{head}}^* = \arg\min_{W} \mathcal{L}(W; f_{\theta_0}, \mathcal{D}_{\text{task}})$

Trainable parameters: $k \times h$ , typically a tiny fraction of $d$ .

Definition

LoRA (Low-Rank Adaptation) $Δ W = B A$

For a pretrained weight matrix $W_0 \in \mathbb{R}^{m \times n}$ , LoRA adds a low-rank update:

$W = W_0 + \Delta W = W_0 + BA$

where $B \in \mathbb{R}^{m \times r}$ , $A \in \mathbb{R}^{r \times n}$ , and $r \ll \min(m, n)$ . Only $A$ and $B$ are trained. $W_0$ is frozen.

Trainable parameters per layer: $r(m + n)$ instead of $mn$ .

Definition

QLoRA

LoRA applied to a quantized base model. The pretrained weights $W_0$ are stored in 4-bit precision (NormalFloat4). The low-rank matrices $A$ and $B$ are computed in higher precision (BFloat16). This reduces memory for the frozen weights by approximately 4x compared to FP16.

Definition

Adapter Layers

Small bottleneck modules inserted between frozen transformer layers. Each adapter has a down-projection $W_{\text{down}} \in \mathbb{R}^{r \times h}$ , a nonlinearity, and an up-projection $W_{\text{up}} \in \mathbb{R}^{h \times r}$ :

$\text{Adapter}(x) = x + W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot x)$

Trainable parameters per adapter: $2rh$ . The residual connection ensures the adapter can learn the identity function.

Definition

Prompt Tuning

Prepend $k$ learnable embedding vectors $P \in \mathbb{R}^{k \times h}$ to the input sequence. The model parameters are entirely frozen. Only $P$ is trained.

Trainable parameters: $kh$ . For $k = 20$ and $h = 4096$ , this is 81,920 parameters, compared to billions for full fine-tuning.

Main Theorems

Proposition

LoRA Expressiveness Bound

Statement

Let $W_0 \in \mathbb{R}^{m \times n}$ be the pretrained weight. The set of weight matrices reachable by LoRA with rank at most $r$ is:

$\{W_0 + \Delta W : \text{rank}(\Delta W) \leq r\}$

This is a (real) algebraic variety $\mathcal{M}_{\leq r}$ stratified by rank. Its rank-exact stratum $\{\Delta W : \text{rank}(\Delta W) = r\}$ is a smooth manifold of dimension $r(m + n - r)$ ; the lower-rank strata $\{\text{rank}(\Delta W) = k\}$ for $k < r$ are smooth submanifolds of strictly smaller dimension and lie in the singular locus of $\mathcal{M}_{\leq r}$ . When $r = \min(m, n)$ , the variety is all of $\mathbb{R}^{m \times n}$ and LoRA recovers full fine-tuning.

Intuition

LoRA restricts weight updates to a low-dimensional subspace. The hypothesis is that the task-specific weight change $\Delta W$ has low intrinsic rank: most of the adaptation can be captured by a few directions in weight space. Empirically, Hu et al. (2021) found that $r = 4$ or $r = 8$ suffices for many NLP tasks, suggesting the adaptation subspace is far smaller than the full parameter space.

Proof Sketch

The set of $m \times n$ matrices of rank at most $r$ is a well-studied algebraic variety (a determinantal variety). Its smooth locus consists of the matrices of rank exactly $r$ , and that smooth stratum has dimension $r(m + n - r)$ , which follows from the parametrization $\Delta W = BA$ where $B \in \mathbb{R}^{m \times r}$ and $A \in \mathbb{R}^{r \times n}$ have $r(m+n)$ entries minus $r^2$ for the symmetry $BA = (BQ)(Q^{-1}A)$ for invertible $Q \in \mathbb{R}^{r \times r}$ . Matrices of rank strictly less than $r$ are singular points of $\mathcal{M}_{\leq r}$ , so the global object is a stratified variety, not a globally smooth manifold.

Why It Matters

This explains why LoRA works: if the true adaptation has low rank, LoRA with sufficient $r$ captures it exactly. The parameter savings are substantial. For $m = n = 4096$ and $r = 8$ : LoRA uses $8 \times 8192 = 65{,}536$ parameters instead of $4096^2 = 16{,}777{,}216$ , a 256x reduction.

Failure Mode

If the required adaptation has high intrinsic rank, small $r$ is insufficient and performance degrades. This can happen when the task distribution is very different from the pretraining distribution, requiring changes in many independent directions of weight space. In such cases, full fine-tuning or higher $r$ is needed.

report a correction →

When to Use Each Method

Method	Trainable Params	Memory	Best When
Full fine-tuning	All $d$	High (full model + optimizer states)	Large task dataset, sufficient compute
Feature extraction	$kh$	Low (frozen backbone)	Very small dataset, avoiding overfitting
LoRA	$r(m+n)$ per layer	Medium (frozen weights + small adapters)	Moderate dataset, need parameter efficiency
QLoRA	Same as LoRA	Low (4-bit base + small adapters)	Large model, limited GPU memory
Adapters	$2rh$ per layer	Medium	Multiple tasks, modular deployment
Prompt tuning	$kh$	Low	Very parameter-efficient, large models

Catastrophic Forgetting

Watch Out

Fine-tuning does not just add new knowledge

Fine-tuning modifies the same parameters that encode pretrained knowledge. Large gradient updates can overwrite representations learned during pretraining. This is catastrophic forgetting: the model improves on the task but degrades on capabilities it had before fine-tuning. A model fine-tuned on medical text may lose its ability to write code.

Three factors control forgetting severity:

Learning rate: Fine-tuning learning rates should be 10x to 100x smaller than pretraining rates. Typical pretraining LR: $10^{-4}$ to $3 \times 10^{-4}$ . Typical fine-tuning LR: $10^{-5}$ to $5 \times 10^{-5}$ .
Parameter count: Methods that update fewer parameters (LoRA, adapters) cause less forgetting because most of the model is frozen.
Task similarity: Fine-tuning on a distribution close to the pretraining data causes less forgetting than fine-tuning on a very different distribution.

Common Confusions

Watch Out

LoRA initialization matters

$A$ is initialized from a random Gaussian; $B$ is initialized to zero. This ensures $\Delta W = BA = 0$ at the start of training, so the model begins with exactly the pretrained weights. If both $A$ and $B$ were randomly initialized, the initial model would be corrupted by a random perturbation.

Watch Out

Prompt tuning is not prompt engineering

Prompt engineering selects discrete tokens as instructions (natural language). Prompt tuning optimizes continuous vectors in embedding space via gradient descent. The learned soft prompts are not interpretable as natural language and occupy a different part of the embedding space than any real token.

Watch Out

Feature extraction is underpowered for complex tasks

Linear probing tests what the pretrained representations already encode. It cannot learn new features. If the task requires representations that the pretrained model does not already compute, feature extraction will fail regardless of dataset size. Full fine-tuning or LoRA can reshape the internal representations.

Summary

Full fine-tuning is the most expressive but most expensive and most prone to forgetting
LoRA constrains weight updates to a low-rank subspace, reducing trainable parameters by 100x or more
QLoRA combines LoRA with 4-bit quantization for memory-efficient fine-tuning of large models
Adapter layers and prompt tuning offer alternative parameter-efficient approaches
Catastrophic forgetting is the central risk: use small learning rates and parameter-efficient methods to mitigate it
The right method depends on dataset size, compute budget, and how different the task is from pretraining

Exercises

ExerciseCore

Problem

A transformer layer has weight matrices $W_Q, W_K, W_V, W_O \in \mathbb{R}^{4096 \times 4096}$ . You apply LoRA with rank $r = 8$ to all four matrices. How many trainable parameters does this add per layer? What fraction of the original parameters is this?

ExerciseAdvanced

Problem

Why does initializing $B = 0$ and $A \sim \mathcal{N}(0, \sigma^2)$ in LoRA ensure that the model starts at exactly the pretrained weights? What would go wrong if both $A$ and $B$ were initialized randomly?

ExerciseAdvanced

Problem

A 7B parameter model uses FP16 (2 bytes per parameter). How much GPU memory does the base model require? With QLoRA using NF4 (0.5 bytes per parameter for the base) and rank $r = 16$ LoRA adapters on all linear layers (approximately 200 matrices of size $4096 \times 4096$ ), what is the total memory?

Related Comparisons

LoRA vs. Full Fine-Tune vs. QLoRA

References

Canonical:

Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
Houlsby et al., "Parameter-Efficient Transfer Learning for NLP" (2019), Chapters 1-3
Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models" (2023)

Current:

Li and Liang, "Prefix-Tuning: Optimizing Continuous Prompts for Generation" (2021)
Lester, Al-Rfou, Constant, "The Power of Scale for Parameter-Efficient Prompt Tuning" (2021)
Lialin, Deshpande, Rumshisky, "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning" (2023)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

3

Catastrophic Forgettinglayer 4 · tier 2
RLHF and Alignmentlayer 4 · tier 2
LLaMA and Open Weight Modelslayer 5 · tier 2

Graph-backed continuations

RLHF and Alignment Catastrophic Forgetting LLaMA and Open Weight Models