Model Timeline
LLaMA and Open Weight Models
The open weight movement in large language models: LLaMA 1/2/3, the ecosystem of fine-tuning and quantization tools, and why open weights changed the dynamics of AI research.
Prerequisites
Why This Matters
Before LLaMA (February 2023), state-of-the-art language models were accessible only through APIs controlled by a handful of companies. Researchers could not inspect weights, reproduce results, or build on the models freely. LLaMA changed this by releasing competitive model weights to the research community. Within weeks, the weights leaked publicly. Within months, an ecosystem of fine-tuning, quantization, and inference tools emerged. Open weight models made it possible for individual researchers and small companies to run, study, and modify frontier-class language models.
LLaMA 1 (February 2023)
Architecture. Standard decoder-only transformer with modifications from recent literature: RMSNorm (pre-normalization), SwiGLU activation function, and rotary positional embeddings (RoPE). Context length: 2048 tokens.
Sizes. 7B, 13B, 33B, and 65B parameters.
Training. Trained on 1.0T to 1.4T tokens of publicly available data (Common Crawl, C4, GitHub, Wikipedia, Books, ArXiv, Stack Exchange). No proprietary data. The 65B model used approximately 2048 A100 GPUs for 21 days.
Key result. LLaMA-13B outperformed GPT-3 (175B) on most benchmarks. LLaMA-65B was competitive with Chinchilla (70B) and PaLM (540B). This demonstrated that smaller models trained on more tokens (following Chinchilla-optimal scaling) can match much larger models trained on fewer tokens.
What it taught us. Two things. First, open data + Chinchilla-optimal training yields competitive models. Second, the research community can do significant work with model weights alone, even without access to training infrastructure. Within weeks, Alpaca (Stanford) fine-tuned LLaMA-7B on 52K instruction-following examples for under $600, producing a model that qualitatively matched GPT-3.5 on simple tasks.
LLaMA 2 (July 2023)
Improvements. Trained on 2T tokens (40% more than LLaMA 1). Context length extended to 4096 tokens. Grouped-Query Attention (GQA) for the 34B and 70B models, reducing memory during inference.
Sizes. 7B, 13B, 34B (not released), and 70B parameters.
Chat models. LLaMA 2 included chat-optimized variants trained with RLHF, similar to the InstructGPT recipe. These were the first open weight models with competitive instruction-following behavior.
License. Released under a custom commercial license allowing use by organizations with fewer than 700M monthly active users. This was a significant shift from LLaMA 1's research-only license.
What it taught us. Open models can include post-training alignment (RLHF) and still be released openly. The commercial license enabled startups and enterprises to build products on open weight models.
LLaMA 3 (April 2024)
Improvements. Trained on over 15T tokens, approximately 7x more than LLaMA 2. Larger vocabulary: 128K tokens (up from 32K). Context length: 8192 tokens, with extended-context variants reaching 128K.
Sizes. 8B and 70B initially, with a 405B model released later.
Key result. LLaMA 3 70B was competitive with GPT-4 on many benchmarks. LLaMA 3 405B approached or matched GPT-4 on most tasks. The 8B model outperformed LLaMA 2 70B on several benchmarks, demonstrating the impact of training on significantly more data.
What it taught us. Training data quantity and quality continue to be the dominant factor. The 8B model trained on 15T tokens outperforms a 70B model trained on 2T tokens on many tasks. This is the Chinchilla insight taken further: keep scaling tokens, not just parameters.
The Open Weight Ecosystem
The release of model weights enabled a tooling ecosystem that did not exist when models were API-only.
Fine-Tuning Tools
LoRA (Low-Rank Adaptation). Instead of updating all parameters during fine-tuning, decompose weight updates into low-rank matrices: where and with . This reduces trainable parameters from to .
QLoRA. Combine 4-bit quantization of the base model with LoRA fine-tuning. This allows fine-tuning a 65B model on a single 48GB GPU, which is impossible with full-precision fine-tuning.
Quantization
Quantization reduces model size and inference cost by representing weights in lower precision.
Weight Quantization
Map floating-point weights to a discrete set of values using fewer bits. 4-bit quantization represents each weight with 4 bits (16 possible values) instead of 16 bits (float16). A 70B parameter model in float16 requires 140GB; in 4-bit, approximately 35GB.
GGUF format. A file format for quantized models designed for CPU inference. Used by llama.cpp, the C++ inference engine that runs LLaMA models on consumer hardware (MacBooks, desktop CPUs). This brought LLM inference to hardware without GPUs.
GPTQ and AWQ. GPU-optimized quantization methods that calibrate quantization parameters using a small dataset to minimize accuracy loss.
Main Theorems
Quantization Error for Uniform Scalar Quantization
Statement
For uniform scalar quantization of a weight to levels, the maximum quantization error per weight is:
The mean squared quantization error, assuming weights are uniformly distributed in , is:
For (16 levels) and typical weight range, this is small enough that model quality degrades only modestly.
Intuition
Dividing the weight range into equal intervals means each weight is rounded to the nearest grid point. The worst case is when the weight falls exactly between two grid points. More bits means finer grid and smaller error. The practical question is whether the cumulative effect of small per-weight errors degrades model outputs.
Proof Sketch
The maximum distance from any point in to the nearest quantization level is half the interval width: . The MSE follows from computing the variance of a uniform distribution on an interval of width .
Why It Matters
This bounds the per-weight error, but the critical question is how errors accumulate through the network. Empirically, 4-bit quantization (with calibration) preserves most model quality. 2-bit quantization causes noticeable degradation. The gap between worst-case bounds and empirical behavior suggests that weight distributions have structure (near-Gaussian, not uniform) that makes quantization more forgiving than worst-case analysis predicts.
Failure Mode
Uniform quantization assumes a uniform weight distribution. Real weight distributions are approximately Gaussian with outliers. Outlier weights far from the mean are poorly quantized because most quantization levels are wasted on the dense central region. Methods like GPTQ and AWQ handle this by using non-uniform quantization or isolating outlier channels.
Comparison with Other Open Models
Mistral 7B (September 2023). Outperformed LLaMA 2 13B on most benchmarks with only 7B parameters. Introduced sliding window attention for efficient long-context processing. Mixtral 8x7B used mixture-of-experts with 47B total parameters but only 13B active per token.
Qwen (Alibaba). Strong multilingual models (especially Chinese and English). Qwen-72B and Qwen2 series competitive with LLaMA 3.
Gemma (Google). 2B and 7B models trained on large proprietary datasets. Competitive at smaller sizes.
The open weight ecosystem is now competitive with closed models for most applications. The gap between open and closed models has shrunk from years to months.
Why Open Weights Matter
Reproducibility. ML research requires running experiments on models. API access does not allow modifying architectures, inspecting internal representations, or controlling inference exactly. Open weights enable mechanistic interpretability, ablation studies, and controlled experiments.
Fine-tuning. Organizations can adapt open models to specific domains (medicine, law, finance) without sending sensitive data to third-party APIs. This addresses privacy and compliance requirements.
Cost. Running inference on your own hardware can be 10-100x cheaper than API access for high-volume applications, especially with quantization.
Resilience. API providers can change pricing, terms, or model behavior at any time. Open weight models provide stability: the weights you download today will work identically forever.
Common Confusions
Open weights is not open source
Releasing model weights is not the same as open source. Open source means releasing code, data, training scripts, and weights under a permissive license. LLaMA releases weights and inference code but not training data or the full training pipeline. You can use the model but cannot fully reproduce the training.
Quantization is not always free
4-bit quantization typically costs 1-3% on benchmarks compared to float16. But this average hides variation: some tasks (especially reasoning-heavy or low-frequency knowledge) degrade more than others. Always evaluate quantized models on your specific task before deploying.
Smaller open models do not replace larger closed models
LLaMA 3 70B is competitive with GPT-4 on many benchmarks, but GPT-4 still outperforms on complex reasoning, long-context tasks, and multimodal understanding (as of early 2024). The gap narrows with each release, but for the hardest tasks, closed frontier models still lead.
Canonical Examples
QLoRA fine-tuning cost estimate
Fine-tuning LLaMA 2 70B with QLoRA on 50,000 instruction examples. Base model quantized to 4-bit: 35GB, fits on one 48GB A100. LoRA rank : trainable parameters (0.12% of total). Training time: approximately 8 hours on one A100 ($16 on cloud). Compare to full fine-tuning: requires 8x A100 80GB GPUs, ~$500+. QLoRA achieves 95-99% of full fine-tuning quality at 3% of the cost.
Exercises
Problem
A LLaMA 2 70B model has 70 billion parameters stored in float16 (2 bytes per parameter). How much GPU memory is needed to load the model for inference? If you quantize to 4-bit (0.5 bytes per parameter), how much memory is needed? How many consumer GPUs with 24GB VRAM would you need in each case?
Problem
LoRA approximates a weight update as where and . For LLaMA 2 70B with hidden dimension and rank , compute the number of trainable parameters per weight matrix and the compression ratio. What rank would you need to represent an arbitrary exactly?
References
Canonical:
- Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023)
- Touvron et al., "LLaMA 2: Open Foundation and Fine-Tuned Chat Models" (2023)
- Dubey et al., "The Llama 3 Herd of Models" (2024)
Current:
- Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2022), ICLR
- Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models" (2023), NeurIPS
- Jiang et al., "Mistral 7B" (2023)
Next Topics
- Post-training overview: RLHF, DPO, and instruction tuning applied to open models
- Mixture of experts: the architecture behind Mixtral and other sparse models
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1