What Each Method Does
All three adapt a pretrained model to a downstream task. They differ in which parameters are updated and how the base model is stored in memory.
Full fine-tuning updates every parameter in the model. For a model with parameters, you need memory for the full weights ( params in fp16/bf16), optimizer states (2N for Adam: first and second moments), and gradients (). Total: roughly values in memory, or about 16 bytes per parameter with bf16 weights and fp32 optimizer states.
LoRA (Low-Rank Adaptation) freezes all base model parameters and injects trainable low-rank decompositions into selected weight matrices. For a weight matrix , LoRA adds:
where , , and . Only and are trained. The base model is frozen and shared across tasks.
QLoRA combines 4-bit quantization of the base model with LoRA adapters in bf16. The base weights are stored in NF4 (4-bit NormalFloat) format, dequantized on the fly for forward and backward passes, and only the LoRA parameters receive gradient updates.
Memory Analysis
For a 7B parameter model (e.g., LLaMA-2-7B):
| Component | Full Fine-Tune | LoRA (r=16) | QLoRA (r=16) |
|---|---|---|---|
| Base model weights | 14 GB (bf16) | 14 GB (bf16, frozen) | 3.5 GB (NF4) |
| Trainable parameters | 7B | ~17M (0.24%) | ~17M (0.24%) |
| Optimizer states | 28 GB (fp32 Adam) | ~130 MB | ~130 MB |
| Gradients | 14 GB | ~34 MB | ~34 MB |
| Total training memory | ~56 GB | ~15 GB | ~5 GB |
| GPU requirement | 4x A100 (80GB) | 1x A100 (40GB) | 1x RTX 3090 (24GB) |
LoRA reduces trainable parameters by 400x but still loads the full bf16 model. QLoRA's 4-bit quantization cuts the frozen model memory by 4x, making the total footprint dramatically smaller.
How LoRA Works
LoRA exploits the hypothesis that the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full matrix , it constrains the update to rank :
is initialized from and is initialized to zero, so at the start of training. The scaling factor controls the magnitude of the adaptation relative to the pretrained weights.
LoRA is typically applied to the query and value projection matrices (, ) in each attention layer. Applying to all linear layers (Q, K, V, output projection, FFN up/down) improves performance at the cost of more trainable parameters.
At inference time, the adapter can be merged: . This adds zero latency compared to the base model.
How QLoRA Works
QLoRA adds three innovations on top of LoRA:
-
NF4 quantization. Base model weights are quantized to 4-bit NormalFloat, which is information-theoretically optimal for normally distributed weights. Each weight is mapped to one of 16 values chosen to minimize quantization error under a Gaussian assumption.
-
Double quantization. The quantization constants themselves (one per block of 64 weights) are quantized to 8-bit, saving an additional 0.37 bits per parameter.
-
Paged optimizers. Optimizer states for LoRA parameters are offloaded to CPU RAM when GPU memory spikes during long-sequence gradient checkpointing, then paged back in.
The forward pass dequantizes NF4 weights to bf16 on the fly, multiplies with activations, and the LoRA path adds in bf16. Gradients flow only through the LoRA parameters. The base model weights receive no gradient updates.
Side-by-Side Comparison
| Property | Full Fine-Tune | LoRA | QLoRA |
|---|---|---|---|
| Parameters updated | All | per adapted layer | Same as LoRA |
| Base model storage | bf16 | bf16 (frozen) | NF4 (4-bit, frozen) |
| Task performance | Best (upper bound) | Within 1-2% of full FT | Within 1-3% of full FT |
| Multi-task serving | Separate model copy per task | Swap adapters, share base | Swap adapters, share quantized base |
| Training speed | Slowest (full backprop) | 30-50% faster (fewer grads) | Similar to LoRA (dequant overhead) |
| Inference latency | Baseline | Zero overhead (merge adapters) | Slight overhead from dequantization |
| Hyperparameters | LR, WD, epochs | LR, WD, r, alpha, target modules | Same as LoRA + quantization config |
When Each Wins
Full fine-tuning: maximum performance, unlimited budget
When you need the absolute best task performance and have the compute budget, full fine-tuning remains the ceiling. It is also necessary when the task distribution differs substantially from pretraining (e.g., adapting an English LLM to a low-resource language). The full parameter space can accommodate larger distributional shifts than a low-rank subspace.
LoRA: efficient multi-task deployment
LoRA's primary advantage is serving: one base model in GPU memory, with task-specific adapters (~34 MB each for r=16 on a 7B model) swapped at request time. For serving 100 tasks, full fine-tuning requires 100 model copies (1.4 TB in bf16). LoRA requires one base model (14 GB) plus 100 adapters (3.4 GB total).
QLoRA: constrained GPU budget
QLoRA enables fine-tuning models that would otherwise not fit in available GPU memory. Fine-tuning a 65B model with QLoRA fits on a single 48GB GPU. Without QLoRA, the same model requires multiple 80GB A100s for full fine-tuning.
Choosing the LoRA Rank
The rank controls the expressiveness of the adaptation. Hu et al. found that to suffices for most NLP tasks on GPT-3 scale models. Higher ranks provide marginally better performance but increase memory and compute linearly.
The intrinsic dimensionality of the fine-tuning task determines the required rank. Simple classification tasks (sentiment, NLI) need . Complex generation tasks (instruction following, code generation) benefit from to . Beyond , gains plateau for most tasks.
Common Confusions
LoRA does not reduce inference cost
LoRA reduces training cost and multi-task serving memory, but after merging the adapter into the base weights (), the model is the same size and has the same inference cost as a fully fine-tuned model. The savings are in training memory and in serving multiple tasks from a shared base.
QLoRA quantization is training-time only (by default)
QLoRA quantizes the base model to reduce training memory. At inference time, you can either: (1) serve the quantized base with LoRA adapters (lower quality, lower memory), or (2) merge adapters into the full-precision base and serve at bf16 (full quality, full memory). The 4-bit base is a training convenience, not necessarily the deployment format.
LoRA rank is not the same as model rank
The rank in LoRA is the rank of the adaptation , not the rank of the full weight matrix . Even when , the effective weight is full-rank (assuming is full-rank). LoRA constrains the update, not the model.
Applying LoRA to more layers is not always better
Adding LoRA to every linear layer increases trainable parameters and can lead to overfitting on small datasets. The original paper showed that targeting only and was often sufficient. The optimal set of target modules depends on the task and dataset size.
References
- Hu, E. J. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. (Original LoRA paper with rank analysis and GPT-3 experiments.)
- Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. (QLoRA with NF4 quantization and double quantization.)
- Aghajanyan, A. et al. (2021). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL 2021. (Theoretical motivation for low-rank adaptation.)
- Dettmers, T. et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. (Foundation for quantized inference and training.)
- Houlsby, N. et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML 2019. (Adapter modules, a precursor to LoRA.)
- Liu, H. et al. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML 2024. (Extension decomposing into magnitude and direction components.)
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.7 (Transfer learning and fine-tuning fundamentals).