Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

LoRA vs. Full Fine-Tune vs. QLoRA

Full fine-tuning updates all parameters and achieves the best task performance but requires storing a complete copy of the model per task. LoRA freezes the base model and trains low-rank additive matrices, cutting trainable parameters by 100x or more. QLoRA quantizes the base model to 4-bit and applies LoRA on top, enabling fine-tuning of 65B models on a single 48GB GPU.

What Each Method Does

All three adapt a pretrained model to a downstream task. They differ in which parameters are updated and how the base model is stored in memory.

Full fine-tuning updates every parameter in the model. For a model with NN parameters, you need memory for the full weights (NN params in fp16/bf16), optimizer states (2N for Adam: first and second moments), and gradients (NN). Total: roughly 4N4N values in memory, or about 16 bytes per parameter with bf16 weights and fp32 optimizer states.

LoRA (Low-Rank Adaptation) freezes all base model parameters and injects trainable low-rank decompositions into selected weight matrices. For a weight matrix WRd×kW \in \mathbb{R}^{d \times k}, LoRA adds:

W=W+αrBAW' = W + \frac{\alpha}{r} BA

where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d, k). Only AA and BB are trained. The base model WW is frozen and shared across tasks.

QLoRA combines 4-bit quantization of the base model with LoRA adapters in bf16. The base weights are stored in NF4 (4-bit NormalFloat) format, dequantized on the fly for forward and backward passes, and only the LoRA parameters receive gradient updates.

Memory Analysis

For a 7B parameter model (e.g., LLaMA-2-7B):

ComponentFull Fine-TuneLoRA (r=16)QLoRA (r=16)
Base model weights14 GB (bf16)14 GB (bf16, frozen)3.5 GB (NF4)
Trainable parameters7B~17M (0.24%)~17M (0.24%)
Optimizer states28 GB (fp32 Adam)~130 MB~130 MB
Gradients14 GB~34 MB~34 MB
Total training memory~56 GB~15 GB~5 GB
GPU requirement4x A100 (80GB)1x A100 (40GB)1x RTX 3090 (24GB)

LoRA reduces trainable parameters by 400x but still loads the full bf16 model. QLoRA's 4-bit quantization cuts the frozen model memory by 4x, making the total footprint dramatically smaller.

How LoRA Works

LoRA exploits the hypothesis that the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full d×kd \times k matrix ΔW\Delta W, it constrains the update to rank rr:

ΔW=αrBA\Delta W = \frac{\alpha}{r} BA

AA is initialized from N(0,σ2)\mathcal{N}(0, \sigma^2) and BB is initialized to zero, so ΔW=0\Delta W = 0 at the start of training. The scaling factor α/r\alpha / r controls the magnitude of the adaptation relative to the pretrained weights.

LoRA is typically applied to the query and value projection matrices (WQW_Q, WVW_V) in each attention layer. Applying to all linear layers (Q, K, V, output projection, FFN up/down) improves performance at the cost of more trainable parameters.

At inference time, the adapter can be merged: W=W+αrBAW' = W + \frac{\alpha}{r} BA. This adds zero latency compared to the base model.

How QLoRA Works

QLoRA adds three innovations on top of LoRA:

  1. NF4 quantization. Base model weights are quantized to 4-bit NormalFloat, which is information-theoretically optimal for normally distributed weights. Each weight is mapped to one of 16 values chosen to minimize quantization error under a Gaussian assumption.

  2. Double quantization. The quantization constants themselves (one per block of 64 weights) are quantized to 8-bit, saving an additional 0.37 bits per parameter.

  3. Paged optimizers. Optimizer states for LoRA parameters are offloaded to CPU RAM when GPU memory spikes during long-sequence gradient checkpointing, then paged back in.

The forward pass dequantizes NF4 weights to bf16 on the fly, multiplies with activations, and the LoRA path adds BAxBA \cdot x in bf16. Gradients flow only through the LoRA parameters. The base model weights receive no gradient updates.

Side-by-Side Comparison

PropertyFull Fine-TuneLoRAQLoRA
Parameters updatedAll NNr(d+k)r(d+k) per adapted layerSame as LoRA
Base model storagebf16bf16 (frozen)NF4 (4-bit, frozen)
Task performanceBest (upper bound)Within 1-2% of full FTWithin 1-3% of full FT
Multi-task servingSeparate model copy per taskSwap adapters, share baseSwap adapters, share quantized base
Training speedSlowest (full backprop)30-50% faster (fewer grads)Similar to LoRA (dequant overhead)
Inference latencyBaselineZero overhead (merge adapters)Slight overhead from dequantization
HyperparametersLR, WD, epochsLR, WD, r, alpha, target modulesSame as LoRA + quantization config

When Each Wins

Full fine-tuning: maximum performance, unlimited budget

When you need the absolute best task performance and have the compute budget, full fine-tuning remains the ceiling. It is also necessary when the task distribution differs substantially from pretraining (e.g., adapting an English LLM to a low-resource language). The full parameter space can accommodate larger distributional shifts than a low-rank subspace.

LoRA: efficient multi-task deployment

LoRA's primary advantage is serving: one base model in GPU memory, with task-specific adapters (~34 MB each for r=16 on a 7B model) swapped at request time. For serving 100 tasks, full fine-tuning requires 100 model copies (1.4 TB in bf16). LoRA requires one base model (14 GB) plus 100 adapters (3.4 GB total).

QLoRA: constrained GPU budget

QLoRA enables fine-tuning models that would otherwise not fit in available GPU memory. Fine-tuning a 65B model with QLoRA fits on a single 48GB GPU. Without QLoRA, the same model requires multiple 80GB A100s for full fine-tuning.

Choosing the LoRA Rank

The rank rr controls the expressiveness of the adaptation. Hu et al. found that r=4r = 4 to r=16r = 16 suffices for most NLP tasks on GPT-3 scale models. Higher ranks provide marginally better performance but increase memory and compute linearly.

The intrinsic dimensionality of the fine-tuning task determines the required rank. Simple classification tasks (sentiment, NLI) need r=4r = 4. Complex generation tasks (instruction following, code generation) benefit from r=16r = 16 to r=64r = 64. Beyond r=64r = 64, gains plateau for most tasks.

Common Confusions

Watch Out

LoRA does not reduce inference cost

LoRA reduces training cost and multi-task serving memory, but after merging the adapter into the base weights (W=W+BAW' = W + BA), the model is the same size and has the same inference cost as a fully fine-tuned model. The savings are in training memory and in serving multiple tasks from a shared base.

Watch Out

QLoRA quantization is training-time only (by default)

QLoRA quantizes the base model to reduce training memory. At inference time, you can either: (1) serve the quantized base with LoRA adapters (lower quality, lower memory), or (2) merge adapters into the full-precision base and serve at bf16 (full quality, full memory). The 4-bit base is a training convenience, not necessarily the deployment format.

Watch Out

LoRA rank is not the same as model rank

The rank rr in LoRA is the rank of the adaptation ΔW\Delta W, not the rank of the full weight matrix WW. Even when r=8r = 8, the effective weight W+BAW + BA is full-rank (assuming WW is full-rank). LoRA constrains the update, not the model.

Watch Out

Applying LoRA to more layers is not always better

Adding LoRA to every linear layer increases trainable parameters and can lead to overfitting on small datasets. The original paper showed that targeting only WQW_Q and WVW_V was often sufficient. The optimal set of target modules depends on the task and dataset size.

References

  1. Hu, E. J. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. (Original LoRA paper with rank analysis and GPT-3 experiments.)
  2. Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. (QLoRA with NF4 quantization and double quantization.)
  3. Aghajanyan, A. et al. (2021). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL 2021. (Theoretical motivation for low-rank adaptation.)
  4. Dettmers, T. et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. (Foundation for quantized inference and training.)
  5. Houlsby, N. et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML 2019. (Adapter modules, a precursor to LoRA.)
  6. Liu, H. et al. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML 2024. (Extension decomposing into magnitude and direction components.)
  7. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.7 (Transfer learning and fine-tuning fundamentals).