LoRA vs Full Fine-Tuning vs QLoRA: Memory, Performance, and When to Use Each

What Each Method Does

All three adapt a pretrained model to a downstream task. They differ in which parameters are updated and how the base model is stored in memory.

Full fine-tuning updates every parameter in the model. For a model with $N$ parameters, you need memory for the full weights ( $N$ params in fp16/bf16), optimizer states (2N for Adam: first and second moments), and gradients ( $N$ ). Total: roughly $4N$ values in memory, or about 16 bytes per parameter with bf16 weights and fp32 optimizer states.

LoRA (Low-Rank Adaptation) freezes all base model parameters and injects trainable low-rank decompositions into selected weight matrices. For a weight matrix $W \in \mathbb{R}^{d \times k}$ , LoRA adds:

$W' = W + \frac{\alpha}{r} BA$

where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d, k)$ . Only $A$ and $B$ are trained. The base model $W$ is frozen and shared across tasks.

QLoRA combines 4-bit quantization of the base model with LoRA adapters in bf16. The base weights are stored in NF4 (4-bit NormalFloat) format, dequantized on the fly for forward and backward passes, and only the LoRA parameters receive gradient updates.

Memory Analysis

For a 7B parameter model (e.g., LLaMA-2-7B):

Component	Full Fine-Tune	LoRA (r=16)	QLoRA (r=16)
Base model weights	14 GB (bf16)	14 GB (bf16, frozen)	3.5 GB (NF4)
Trainable parameters	7B	~17M (0.24%)	~17M (0.24%)
Optimizer states	28 GB (fp32 Adam)	~130 MB	~130 MB
Gradients	14 GB	~34 MB	~34 MB
Total training memory	~56 GB	~15 GB	~5 GB
GPU requirement	4x A100 (80GB)	1x A100 (40GB)	1x RTX 3090 (24GB)

LoRA reduces trainable parameters by 400x but still loads the full bf16 model. QLoRA's 4-bit quantization cuts the frozen model memory by 4x, making the total footprint dramatically smaller.

How LoRA Works

LoRA exploits the hypothesis that the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full $d \times k$ matrix $\Delta W$ , it constrains the update to rank $r$ :

$\Delta W = \frac{\alpha}{r} BA$

$A$ is initialized from $\mathcal{N}(0, \sigma^2)$ and $B$ is initialized to zero, so $\Delta W = 0$ at the start of training. The scaling factor $\alpha / r$ controls the magnitude of the adaptation relative to the pretrained weights.

LoRA is typically applied to the query and value projection matrices ( $W_Q$ , $W_V$ ) in each attention layer. Applying to all linear layers (Q, K, V, output projection, FFN up/down) improves performance at the cost of more trainable parameters.

At inference time, the adapter can be merged: $W' = W + \frac{\alpha}{r} BA$ . This adds zero latency compared to the base model.

How QLoRA Works

QLoRA adds three innovations on top of LoRA:

NF4 quantization. Base model weights are quantized to 4-bit NormalFloat, which is information-theoretically optimal for normally distributed weights. Each weight is mapped to one of 16 values chosen to minimize quantization error under a Gaussian assumption.
Double quantization. The quantization constants themselves (one per block of 64 weights) are quantized to 8-bit, saving an additional 0.37 bits per parameter.
Paged optimizers. Optimizer states for LoRA parameters are offloaded to CPU RAM when GPU memory spikes during long-sequence gradient checkpointing, then paged back in.

The forward pass dequantizes NF4 weights to bf16 on the fly, multiplies with activations, and the LoRA path adds $BA \cdot x$ in bf16. Gradients flow only through the LoRA parameters. The base model weights receive no gradient updates.

Side-by-Side Comparison

Property	Full Fine-Tune	LoRA	QLoRA
Parameters updated	All $N$	$r(d+k)$ per adapted layer	Same as LoRA
Base model storage	bf16	bf16 (frozen)	NF4 (4-bit, frozen)
Task performance	Best (upper bound)	Within 1-2% of full FT	Within 1-3% of full FT
Multi-task serving	Separate model copy per task	Swap adapters, share base	Swap adapters, share quantized base
Training speed	Slowest (full backprop)	30-50% faster (fewer grads)	Similar to LoRA (dequant overhead)
Inference latency	Baseline	Zero overhead (merge adapters)	Slight overhead from dequantization
Hyperparameters	LR, WD, epochs	LR, WD, r, alpha, target modules	Same as LoRA + quantization config

When Each Wins

Full fine-tuning: maximum performance, unlimited budget

When you need the absolute best task performance and have the compute budget, full fine-tuning remains the ceiling. It is also necessary when the task distribution differs substantially from pretraining (e.g., adapting an English LLM to a low-resource language). The full parameter space can accommodate larger distributional shifts than a low-rank subspace.

LoRA: efficient multi-task deployment

LoRA's primary advantage is serving: one base model in GPU memory, with task-specific adapters (~34 MB each for r=16 on a 7B model) swapped at request time. For serving 100 tasks, full fine-tuning requires 100 model copies (1.4 TB in bf16). LoRA requires one base model (14 GB) plus 100 adapters (3.4 GB total).

QLoRA: constrained GPU budget

QLoRA enables fine-tuning models that would otherwise not fit in available GPU memory. Fine-tuning a 65B model with QLoRA fits on a single 48GB GPU. Without QLoRA, the same model requires multiple 80GB A100s for full fine-tuning.

Choosing the LoRA Rank

The rank $r$ controls the expressiveness of the adaptation. Hu et al. found that $r = 4$ to $r = 16$ suffices for most NLP tasks on GPT-3 scale models. Higher ranks provide marginally better performance but increase memory and compute linearly.

The intrinsic dimensionality of the fine-tuning task determines the required rank. Simple classification tasks (sentiment, NLI) need $r = 4$ . Complex generation tasks (instruction following, code generation) benefit from $r = 16$ to $r = 64$ . Beyond $r = 64$ , gains plateau for most tasks.

Common Confusions

Watch Out

LoRA does not reduce inference cost

LoRA reduces training cost and multi-task serving memory, but after merging the adapter into the base weights ( $W' = W + BA$ ), the model is the same size and has the same inference cost as a fully fine-tuned model. The savings are in training memory and in serving multiple tasks from a shared base.

Watch Out

QLoRA quantization is training-time only (by default)

QLoRA quantizes the base model to reduce training memory. At inference time, you can either: (1) serve the quantized base with LoRA adapters (lower quality, lower memory), or (2) merge adapters into the full-precision base and serve at bf16 (full quality, full memory). The 4-bit base is a training convenience, not necessarily the deployment format.

Watch Out

LoRA rank is not the same as model rank

The rank $r$ in LoRA is the rank of the adaptation $\Delta W$ , not the rank of the full weight matrix $W$ . Even when $r = 8$ , the effective weight $W + BA$ is full-rank (assuming $W$ is full-rank). LoRA constrains the update, not the model.

Watch Out

Applying LoRA to more layers is not always better

Adding LoRA to every linear layer increases trainable parameters and can lead to overfitting on small datasets. The original paper showed that targeting only $W_Q$ and $W_V$ was often sufficient. The optimal set of target modules depends on the task and dataset size.

References

Hu, E. J. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. (Original LoRA paper with rank analysis and GPT-3 experiments.)
Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. (QLoRA with NF4 quantization and double quantization.)
Aghajanyan, A. et al. (2021). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL 2021. (Theoretical motivation for low-rank adaptation.)
Dettmers, T. et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. (Foundation for quantized inference and training.)
Houlsby, N. et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML 2019. (Adapter modules, a precursor to LoRA.)
Liu, H. et al. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML 2024. (Extension decomposing into magnitude and direction components.)
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.7 (Transfer learning and fine-tuning fundamentals).