Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Model Timeline

LLaMA and Open Weight Models

The open weight movement in large language models: LLaMA 1/2/3, the ecosystem of fine-tuning and quantization tools, and why open weights changed the dynamics of AI research.

CoreTier 2Frontier~50 min
0

Why This Matters

Before LLaMA (February 2023), state-of-the-art language models were accessible only through APIs controlled by a handful of companies. Researchers could not inspect weights, reproduce results, or build on the models freely. LLaMA changed this by releasing competitive model weights to the research community. Within weeks, the weights leaked publicly. Within months, an ecosystem of fine-tuning, quantization, and inference tools emerged. Open weight models made it possible for individual researchers and small companies to run, study, and modify frontier-class language models.

LLaMA 1 (February 2023)

Architecture. Standard decoder-only transformer with modifications from recent literature: RMSNorm (pre-normalization), SwiGLU activation function, and rotary positional embeddings (RoPE). Context length: 2048 tokens.

Sizes. 7B, 13B, 33B, and 65B parameters.

Training. Trained on 1.0T to 1.4T tokens of publicly available data (Common Crawl, C4, GitHub, Wikipedia, Books, ArXiv, Stack Exchange). No proprietary data. The 65B model used approximately 2048 A100 GPUs for 21 days.

Key result. LLaMA-13B outperformed GPT-3 (175B) on most benchmarks. LLaMA-65B was competitive with Chinchilla (70B) and PaLM (540B). This demonstrated that smaller models trained on more tokens (following Chinchilla-optimal scaling) can match much larger models trained on fewer tokens.

What it taught us. Two things. First, open data + Chinchilla-optimal training yields competitive models. Second, the research community can do significant work with model weights alone, even without access to training infrastructure. Within weeks, Alpaca (Stanford) fine-tuned LLaMA-7B on 52K instruction-following examples for under $600, producing a model that qualitatively matched GPT-3.5 on simple tasks.

LLaMA 2 (July 2023)

Improvements. Trained on 2T tokens (40% more than LLaMA 1). Context length extended to 4096 tokens. Grouped-Query Attention (GQA) for the 34B and 70B models, reducing memory during inference.

Sizes. 7B, 13B, 34B (not released), and 70B parameters.

Chat models. LLaMA 2 included chat-optimized variants trained with RLHF, similar to the InstructGPT recipe. These were the first open weight models with competitive instruction-following behavior.

License. Released under a custom commercial license allowing use by organizations with fewer than 700M monthly active users. This was a significant shift from LLaMA 1's research-only license.

What it taught us. Open models can include post-training alignment (RLHF) and still be released openly. The commercial license enabled startups and enterprises to build products on open weight models.

LLaMA 3 (April 2024)

Improvements. Trained on over 15T tokens, approximately 7x more than LLaMA 2. Larger vocabulary: 128K tokens (up from 32K). Context length: 8192 tokens, with extended-context variants reaching 128K.

Sizes. 8B and 70B initially, with a 405B model released later.

Key result. LLaMA 3 70B was competitive with GPT-4 on many benchmarks. LLaMA 3 405B approached or matched GPT-4 on most tasks. The 8B model outperformed LLaMA 2 70B on several benchmarks, demonstrating the impact of training on significantly more data.

What it taught us. Training data quantity and quality continue to be the dominant factor. The 8B model trained on 15T tokens outperforms a 70B model trained on 2T tokens on many tasks. This is the Chinchilla insight taken further: keep scaling tokens, not just parameters.

The Open Weight Ecosystem

The release of model weights enabled a tooling ecosystem that did not exist when models were API-only.

Fine-Tuning Tools

LoRA (Low-Rank Adaptation). Instead of updating all parameters during fine-tuning, decompose weight updates into low-rank matrices: ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×dA \in \mathbb{R}^{r \times d} with rdr \ll d. This reduces trainable parameters from d2d^2 to 2dr2dr.

QLoRA. Combine 4-bit quantization of the base model with LoRA fine-tuning. This allows fine-tuning a 65B model on a single 48GB GPU, which is impossible with full-precision fine-tuning.

Quantization

Quantization reduces model size and inference cost by representing weights in lower precision.

Definition

Weight Quantization

Map floating-point weights wRw \in \mathbb{R} to a discrete set of values using fewer bits. 4-bit quantization represents each weight with 4 bits (16 possible values) instead of 16 bits (float16). A 70B parameter model in float16 requires 140GB; in 4-bit, approximately 35GB.

GGUF format. A file format for quantized models designed for CPU inference. Used by llama.cpp, the C++ inference engine that runs LLaMA models on consumer hardware (MacBooks, desktop CPUs). This brought LLM inference to hardware without GPUs.

GPTQ and AWQ. GPU-optimized quantization methods that calibrate quantization parameters using a small dataset to minimize accuracy loss.

Main Theorems

Proposition

Quantization Error for Uniform Scalar Quantization

Statement

For uniform scalar quantization of a weight w[a,b]w \in [a, b] to 2k2^k levels, the maximum quantization error per weight is:

wQ(w)ba2k+1|w - Q(w)| \leq \frac{b - a}{2^{k+1}}

The mean squared quantization error, assuming weights are uniformly distributed in [a,b][a, b], is:

E[(wQ(w))2]=(ba)21222k\mathbb{E}[(w - Q(w))^2] = \frac{(b - a)^2}{12 \cdot 2^{2k}}

For k=4k = 4 (16 levels) and typical weight range, this is small enough that model quality degrades only modestly.

Intuition

Dividing the weight range into 2k2^k equal intervals means each weight is rounded to the nearest grid point. The worst case is when the weight falls exactly between two grid points. More bits means finer grid and smaller error. The practical question is whether the cumulative effect of small per-weight errors degrades model outputs.

Proof Sketch

The maximum distance from any point in [a,b][a, b] to the nearest quantization level is half the interval width: (ba)/(22k)=(ba)/2k+1(b-a) / (2 \cdot 2^k) = (b-a) / 2^{k+1}. The MSE follows from computing the variance of a uniform distribution on an interval of width (ba)/2k(b-a)/2^k.

Why It Matters

This bounds the per-weight error, but the critical question is how errors accumulate through the network. Empirically, 4-bit quantization (with calibration) preserves most model quality. 2-bit quantization causes noticeable degradation. The gap between worst-case bounds and empirical behavior suggests that weight distributions have structure (near-Gaussian, not uniform) that makes quantization more forgiving than worst-case analysis predicts.

Failure Mode

Uniform quantization assumes a uniform weight distribution. Real weight distributions are approximately Gaussian with outliers. Outlier weights far from the mean are poorly quantized because most quantization levels are wasted on the dense central region. Methods like GPTQ and AWQ handle this by using non-uniform quantization or isolating outlier channels.

Comparison with Other Open Models

Mistral 7B (September 2023). Outperformed LLaMA 2 13B on most benchmarks with only 7B parameters. Introduced sliding window attention for efficient long-context processing. Mixtral 8x7B used mixture-of-experts with 47B total parameters but only 13B active per token.

Qwen (Alibaba). Strong multilingual models (especially Chinese and English). Qwen-72B and Qwen2 series competitive with LLaMA 3.

Gemma (Google). 2B and 7B models trained on large proprietary datasets. Competitive at smaller sizes.

The open weight ecosystem is now competitive with closed models for most applications. The gap between open and closed models has shrunk from years to months.

Why Open Weights Matter

Reproducibility. ML research requires running experiments on models. API access does not allow modifying architectures, inspecting internal representations, or controlling inference exactly. Open weights enable mechanistic interpretability, ablation studies, and controlled experiments.

Fine-tuning. Organizations can adapt open models to specific domains (medicine, law, finance) without sending sensitive data to third-party APIs. This addresses privacy and compliance requirements.

Cost. Running inference on your own hardware can be 10-100x cheaper than API access for high-volume applications, especially with quantization.

Resilience. API providers can change pricing, terms, or model behavior at any time. Open weight models provide stability: the weights you download today will work identically forever.

Common Confusions

Watch Out

Open weights is not open source

Releasing model weights is not the same as open source. Open source means releasing code, data, training scripts, and weights under a permissive license. LLaMA releases weights and inference code but not training data or the full training pipeline. You can use the model but cannot fully reproduce the training.

Watch Out

Quantization is not always free

4-bit quantization typically costs 1-3% on benchmarks compared to float16. But this average hides variation: some tasks (especially reasoning-heavy or low-frequency knowledge) degrade more than others. Always evaluate quantized models on your specific task before deploying.

Watch Out

Smaller open models do not replace larger closed models

LLaMA 3 70B is competitive with GPT-4 on many benchmarks, but GPT-4 still outperforms on complex reasoning, long-context tasks, and multimodal understanding (as of early 2024). The gap narrows with each release, but for the hardest tasks, closed frontier models still lead.

Canonical Examples

Example

QLoRA fine-tuning cost estimate

Fine-tuning LLaMA 2 70B with QLoRA on 50,000 instruction examples. Base model quantized to 4-bit: 35GB, fits on one 48GB A100. LoRA rank r=64r = 64: trainable parameters 2×64×8192×80 layers84M\approx 2 \times 64 \times 8192 \times 80 \text{ layers} \approx 84M (0.12% of total). Training time: approximately 8 hours on one A100 ($16 on cloud). Compare to full fine-tuning: requires 8x A100 80GB GPUs, ~$500+. QLoRA achieves 95-99% of full fine-tuning quality at 3% of the cost.

Exercises

ExerciseCore

Problem

A LLaMA 2 70B model has 70 billion parameters stored in float16 (2 bytes per parameter). How much GPU memory is needed to load the model for inference? If you quantize to 4-bit (0.5 bytes per parameter), how much memory is needed? How many consumer GPUs with 24GB VRAM would you need in each case?

ExerciseAdvanced

Problem

LoRA approximates a weight update ΔWRd×d\Delta W \in \mathbb{R}^{d \times d} as ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×dA \in \mathbb{R}^{r \times d}. For LLaMA 2 70B with hidden dimension d=8192d = 8192 and rank r=16r = 16, compute the number of trainable parameters per weight matrix and the compression ratio. What rank rr would you need to represent an arbitrary ΔW\Delta W exactly?

References

Canonical:

  • Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023)
  • Touvron et al., "LLaMA 2: Open Foundation and Fine-Tuned Chat Models" (2023)
  • Dubey et al., "The Llama 3 Herd of Models" (2024)

Current:

  • Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2022), ICLR
  • Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models" (2023), NeurIPS
  • Jiang et al., "Mistral 7B" (2023)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics