Model Timeline
Model Timeline
A structured factual timeline of major language and multimodal models from GPT-2 through the current frontier, with parameter counts, key innovations, and the ideas that defined each era.
Why This Matters
The landscape of large language models has changed faster than any other area of technology. Understanding the timeline is not just historical trivia: each model introduced ideas (scaling laws, RLHF, mixture-of-experts, multimodal fusion, long context, constitutional AI) that define how the field works today.
This page is a structured factual reference. Dates, parameter counts, and capabilities are stated as precisely as public information allows.
The Scaling Era (2019-2020)
GPT-2 (February 2019)
GPT-2
Organization: OpenAI. Parameters: 1.5 billion. Architecture: Decoder-only Transformer. Training data: WebText (40GB of web pages filtered by Reddit upvote quality).
Key contribution: Demonstrated that language models trained at sufficient scale exhibit zero-shot task performance. GPT-2 could generate coherent multi-paragraph text, summarize articles, and perform rudimentary translation without task-specific fine-tuning. OpenAI initially withheld the full model citing misuse concerns, sparking the first major debate about responsible release in the LLM era.
GPT-3 (June 2020)
GPT-3
Organization: OpenAI. Parameters: 175 billion. Architecture: Decoder-only Transformer (96 layers, 96 heads, ). Training data: Approximately 300 billion tokens from Common Crawl, WebText2, Books, and Wikipedia.
Key contribution: In-context learning (ICL). GPT-3 showed that by simply placing examples in the prompt (few-shot), a frozen model could perform tasks it was never explicitly trained on. This was a qualitative shift: instead of fine-tuning a model per task, you write a prompt. The paper also established smooth scaling laws relating model size to performance.
In-Context Learning as Implicit Bayesian Inference
Few-shot prompting works by conditioning the model on examples that implicitly specify the task. Theoretical work (Xie et al., 2022) argues ICL can be understood as implicit Bayesian inference: the model maintains a posterior over latent "concepts" (tasks) given the prompt, and generates completions consistent with the inferred concept.
The Alignment Era (2022-2023)
ChatGPT (November 2022)
ChatGPT
Organization: OpenAI. Base model: GPT-3.5 (a fine-tuned version of GPT-3). Key innovation: Reinforcement Learning from Human Feedback (RLHF).
Key contribution: The consumer product that brought LLMs to mainstream awareness. The RLHF pipeline (supervised fine-tuning followed by reward model training followed by PPO optimization, drawing on policy gradient methods) made the model substantially more helpful, harmless, and conversational than the base GPT-3.5. Reached 100 million users within two months.
GPT-4 (March 2023)
GPT-4
Organization: OpenAI. Parameters: Not officially disclosed (rumored mixture-of-experts with approximately 1.8 trillion total parameters). Architecture: Believed to be a mixture-of-experts Transformer.
Key contribution: Multimodal input (text and images). Substantially improved reasoning, factuality, and instruction following over GPT-3.5. Scored in the 90th percentile on the bar exam. Demonstrated that scaling plus RLHF plus careful data curation could produce models with broad expert-level knowledge across many domains.
Claude Family (2023-present)
Claude (Anthropic)
Organization: Anthropic. Models: Claude 1 (March 2023), Claude 2 (July 2023), Claude 3 family (March 2024: Haiku, Sonnet, Opus), Claude 3.5 Sonnet (June 2024), Claude 4 family (2025).
Key innovation: Constitutional AI (CAI). Instead of relying solely on human preference labels for alignment, CAI uses a set of written principles (a "constitution") to guide the model. The model critiques and revises its own outputs according to these principles, then trains on the revised outputs. This makes the alignment process more transparent and scalable than pure RLHF.
Anthropic has emphasized safety research alongside capability, including work on interpretability (circuit analysis, dictionary learning), honest calibration, and refusal behavior.
The Open Model Revolution (2023-present)
Llama Family (2023-present)
Llama (Meta)
Organization: Meta. Models: Llama 1 (Feb 2023, 7B-65B), Llama 2 (July 2023, 7B-70B), Llama 3 (April 2024, 8B and 70B), Llama 3.1 (July 2024, 8B-405B).
Key contribution: Established the open-weight model ecosystem. Llama 1 showed that smaller models trained on more tokens could match larger models (the Chinchilla scaling insight applied aggressively). Llama 2 added RLHF and a permissive license. Llama 3 pushed quality to near-frontier levels. The 405B Llama 3.1 model demonstrated that open-weight models could compete with proprietary systems on many benchmarks.
The Llama family spawned an enormous ecosystem of fine-tunes, quantizations, and derivative models (Alpaca, Vicuna, Code Llama, and many others).
Qwen Family (2023-present)
Qwen (Alibaba)
Organization: Alibaba Cloud. Models: Qwen 1 (2023), Qwen 1.5 (2024), Qwen 2 (June 2024), Qwen 2.5 (September 2024), with sizes from 0.5B to 72B.
Key contribution: Strong multilingual performance, particularly in Chinese and English. Qwen models are open-weight with permissive licenses and have demonstrated competitive performance across coding, mathematics, and general language tasks. The Qwen-VL series added vision capabilities. Represents the rapid advancement of models developed outside the US-European axis.
Architectural Innovation (2024-2025)
DeepSeek Family (2024-2025)
DeepSeek (DeepSeek AI)
Organization: DeepSeek AI (China). Key models: DeepSeek-V2 (May 2024), DeepSeek-V3 (December 2024, 671B total, 37B active per token), DeepSeek-R1 (January 2025).
Key innovations:
- Mixture-of-Experts (MoE): DeepSeek-V3 uses MoE with 671 billion total parameters but only 37 billion active per token, achieving frontier performance at dramatically lower inference cost.
- Multi-head Latent Attention (MLA): A KV-cache compression technique that reduces memory requirements for long sequences.
- Reasoning models: DeepSeek-R1 demonstrated that chain-of-thought reasoning can emerge from reinforcement learning on reasoning tasks, achieving competitive performance with OpenAI o1 on math and coding benchmarks. The model was released open-weight, enabling the community to study reasoning model internals.
DeepSeek demonstrated that architectural innovation (MoE, MLA) and clever training (RL for reasoning) can compensate for having less compute than the largest Western labs.
Gemini Family (2023-present)
Gemini (Google DeepMind)
Organization: Google DeepMind. Models: Gemini 1.0 (December 2023: Ultra, Pro, Nano), Gemini 1.5 (February 2024), Gemini 2.0 (December 2024).
Key innovations:
- Natively multimodal: Trained from the ground up on interleaved text, images, audio, and video, rather than bolting vision onto a text model.
- Long context: Gemini 1.5 Pro demonstrated reliable processing of up to 1 million tokens (later extended to 2 million), enabling processing of entire codebases, hours of video, or thousands of pages of documents in a single context window.
- Mixture-of-Experts: Gemini 1.5 uses a MoE architecture for efficiency.
Kimi (2024-present)
Kimi (Moonshot AI)
Organization: Moonshot AI (China). Key innovation: Pioneered extremely long context windows for consumer applications, initially supporting 200K tokens and extending further. Kimi demonstrated that long context is not just a benchmark metric but enables qualitatively different use patterns: processing entire books, long legal documents, and extended conversation histories.
Key Architectural Trends
Dense Models vs Mixture-of-Experts
Early models (GPT-2, GPT-3, Llama 1-2) used dense architectures where every parameter activates for every token. MoE models (DeepSeek-V3, Gemini 1.5, likely GPT-4) activate only a subset of parameters per token. This means a 671B MoE model can have the quality of a very large dense model but the inference cost of a much smaller one. The trade-off: MoE models require more total memory (all experts must be loaded) even though compute per token is lower.
The RLHF-to-RL-for-Reasoning Pipeline
The alignment methodology evolved from: (1) supervised fine-tuning on human demonstrations, to (2) RLHF with reward models trained on human preferences, to (3) reinforcement learning directly on verifiable reasoning tasks (math, code). DeepSeek-R1 and the OpenAI o-series showed that training models to "think step by step" via RL produces qualitatively stronger reasoning than SFT or standard RLHF alone.
Parameter Counts at a Glance
| Model | Year | Parameters | Active Params | Key Innovation |
|---|---|---|---|---|
| GPT-2 | 2019 | 1.5B | 1.5B (dense) | Zero-shot generation |
| GPT-3 | 2020 | 175B | 175B (dense) | In-context learning |
| ChatGPT | 2022 | ~175B | ~175B | RLHF for alignment |
| Llama 1 | 2023 | 7-65B | Dense | Open-weight ecosystem |
| GPT-4 | 2023 | Undisclosed | Undisclosed | Multimodal, frontier reasoning |
| Claude 3 Opus | 2024 | Undisclosed | Undisclosed | Constitutional AI |
| Llama 3.1 | 2024 | 8-405B | Dense | Open-weight at frontier scale |
| DeepSeek-V3 | 2024 | 671B | 37B (MoE) | MoE efficiency, MLA |
| DeepSeek-R1 | 2025 | 671B | 37B (MoE) | RL-trained reasoning |
| Gemini 1.5 | 2024 | Undisclosed | Undisclosed | 1M+ context, native multimodal |
| GLM-5 | 2025 | Undisclosed | Undisclosed | Native multimodal, 128K, agentic |
| Kimi k1.5 | 2025 | Undisclosed | Undisclosed | Long-context + RL reasoning |
| Claude 4 Opus | 2025 | Undisclosed | Undisclosed | Agentic coding, extended thinking |
| Llama 4 | 2025 | Scout/Maverick | MoE | Open-weight MoE at scale |
| Qwen 3 | 2025 | 0.6-235B | Dense + MoE | Hybrid thinking modes |
Common Confusions
Parameter count does not equal capability
A 671B MoE model with 37B active parameters can outperform a 70B dense model despite using similar compute per token. Total parameter count is a poor proxy for capability. What matters is effective compute (active parameters times tokens processed), data quality, and training methodology (RLHF, RL for reasoning, data curation). Llama 1-65B, trained on 1.4T tokens, underperformed Llama 3-8B, trained on 15T+ tokens.
Open-weight is not open-source
Most "open" models release model weights but not training data, training code, or data processing pipelines. Llama, Qwen, and DeepSeek provide weights and inference code under various licenses, but reproducing the training run from scratch is not feasible without the full data pipeline. True open-source would include data, code, and training recipes.
Summary
- GPT-3 (2020) introduced in-context learning, the key capability of the LLM era
- RLHF (ChatGPT, 2022) turned base models into useful assistants
- Constitutional AI (Claude) made alignment more transparent and principle-based
- MoE architectures (DeepSeek-V3, Gemini) decouple total parameters from inference cost
- RL for reasoning (DeepSeek-R1, o-series) is the current frontier of capability improvement
- Open-weight models (Llama, Qwen, DeepSeek) have reached near-parity with proprietary systems
- Long context (Gemini 1.5, Kimi) enables qualitatively new applications
- Parameter count alone is a poor measure of model capability
Exercises
Problem
Explain the difference between few-shot in-context learning (GPT-3 style) and fine-tuning. When would you prefer each approach?
Problem
DeepSeek-V3 has 671B total parameters but only 37B active per token. Explain why, and state one advantage and one disadvantage compared to a dense 70B model.
Problem
Constitutional AI (Claude) and RLHF (ChatGPT) are both alignment methods. Describe the key difference in how they obtain training signal for the preference/reward model.
References
Foundational Papers:
- Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019)
- Brown et al., "Language Models are Few-Shot Learners" (GPT-3, 2020)
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT/RLHF, 2022)
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022)
Open Models:
- Touvron et al., "Llama: Open and Efficient Foundation Language Models" (2023)
- Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
Architecture Innovation:
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024)
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL" (2025)
Next Topics
This is a living reference document. As new models emerge, check primary technical reports for authoritative details.
Last reviewed: April 2026
Builds on This
- AI Labs LandscapeLayer 5