Model Timeline

Sneiderman, Robby

Model Timeline

A structured factual timeline of major language and multimodal models from GPT-2 through GPT-5.6, GPT-Live, Claude Fable 5, and the current frontier, with parameter counts, key innovations, and the ideas that defined each era.

CoreTier 2FrontierReference~70 min

Prerequisites

History of AI Key Researchers and Ideas

Prereq Map

Learning position

Read this page in the graph.

model-timeline | layer 5 | tier 2. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

AI Labs Landscape

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The landscape of large language models has changed faster than any other area of technology. Understanding the timeline is not just historical trivia: each model introduced ideas (scaling laws, RLHF, mixture-of-experts, multimodal fusion, long context, constitutional AI, inference-time scaling) that define how the field works today.

This page is a structured factual reference. Dates, parameter counts, and capabilities are stated as precisely as public information allows. Frontier- tagged pages can become stale quickly; the currency of each section is noted.

How To Read A Frontier Timeline

This page mixes three kinds of evidence, and they should not be treated as equally strong:

Claim type	What counts as primary evidence here	How the page states it
Release date and product surface	official release posts, model cards, system cards, or provider docs	stated directly
Architecture and parameter count	technical reports or model cards published by the provider	stated directly only when disclosed
Architecture rumors or leaked internals	secondary reporting outside provider docs	labeled as unconfirmed and kept separate from facts
Benchmark claims	launch-day reports, system cards, or technical reports	described as release-time evidence, not timeless truth

The rule is simple: if the provider did not publish the architecture, this page should say that it was not published rather than quietly filling in the blank.

Capability Shifts At A Glance

Era	What changed	Why it mattered
2019-2020 scaling	GPT-2 and GPT-3 made pretraining scale itself a capability story	prompting started replacing task-specific fine-tuning for many use cases
2022-2023 alignment	InstructGPT, ChatGPT, and Claude turned base models into assistants	product behavior became a post-training problem, not just a pretraining one
2023-2025 open-weight expansion	Llama, Qwen, and DeepSeek made strong local and self-hosted models normal	labs no longer had a monopoly on serious model access
2024-2025 reasoning	o-series, DeepSeek-R1, Gemini 2.5, and Claude thinking modes added test-time compute as a capability axis	smaller or cheaper models could trade latency for stronger reasoning
2025-2026 multimodal agents	GPT-5, Claude 4, Gemini 3, and newer open models fused tools, reasoning, and multimodal IO	the important unit stopped being "chatbot only" and became "model inside an action loop"

Headline-model spec table (July 2026)

A complementary quantitative view: parameter counts, context windows, and headline release dates for the models the rest of this page treats in detail. Use this to anchor reasoning about deployment cost; the per-section definitions below cover the technical content.

Model	Org	Release	Active params	Total params	Context	Open weights
GPT-2	OpenAI	Feb 2019	1.5B	1.5B	1K	yes (eventually)
GPT-3	OpenAI	Jun 2020	175B	175B	2K	no
ChatGPT	OpenAI	Nov 2022	undisclosed	undisclosed	4K	no
GPT-4	OpenAI	Mar 2023	undisclosed	undisclosed	8K-32K	no
Claude 2	Anthropic	Jul 2023	undisclosed	undisclosed	100K	no
Llama 2 70B	Meta	Jul 2023	70B	70B	4K	yes
Mixtral 8x7B	Mistral	Dec 2023	13B	47B	32K	yes (Apache)
Gemini 1.5 Pro	Google	Feb 2024	undisclosed	undisclosed	1M-2M	no
Llama 3.1 405B	Meta	Jul 2024	405B	405B	128K	yes
DeepSeek-V3	DeepSeek	Dec 2024	37B	671B	128K	yes
o1	OpenAI	Dec 2024	undisclosed	undisclosed	200K	no
DeepSeek-R1	DeepSeek	Jan 2025	37B	671B	128K	yes
Claude 4 Sonnet / Opus	Anthropic	May 2025	undisclosed	undisclosed	200K	no
Llama 4 Scout / Maverick	Meta	Apr 2025	17B / 17B	109B / 400B	10M / 1M	yes
GPT-5	OpenAI	Aug 2025	undisclosed	undisclosed	undisclosed	no
Claude Opus 4.5	Anthropic	Nov 2025	undisclosed	undisclosed	200K-1M	no
GPT-5.5 / 5.5 Pro	OpenAI	Apr 23, 2026	undisclosed	undisclosed	undisclosed	no
GPT-5.6 Sol / Terra / Luna	OpenAI	Jul 9, 2026	undisclosed	undisclosed	undisclosed	no
Claude Opus 4.7	Anthropic	Apr 16, 2026	undisclosed	undisclosed	200K-1M	no
Claude Opus 4.8	Anthropic	May 28, 2026	undisclosed	undisclosed	1M	no
Claude Fable 5	Anthropic	Jun 2026	undisclosed	undisclosed	undisclosed	no
Claude Sonnet 5	Anthropic	Jun 30, 2026	undisclosed	undisclosed	undisclosed	no

"Undisclosed" is honest reporting: most frontier closed-weight models do not publish parameter counts or training compute, and inferring them from API behavior is unreliable. The system cards focus on capability and safety evaluations, not architectural details.

The Scaling Era (2019-2020)

GPT-2 (February 2019)

Definition

GPT-2

Organization: OpenAI. Parameters: 1.5 billion. Architecture: Decoder-only Transformer. Training data: WebText (40GB of web pages filtered by Reddit upvote quality).

Key contribution: Demonstrated that language models trained at sufficient scale exhibit zero-shot task performance. GPT-2 could generate coherent multi-paragraph text, summarize articles, and perform rudimentary translation without task-specific fine-tuning. OpenAI initially withheld the full model citing misuse concerns, sparking the first major debate about responsible release in the LLM era.

GPT-3 (June 2020)

Definition

GPT-3

Organization: OpenAI. Parameters: 175 billion. Architecture: Decoder-only Transformer (96 layers, 96 heads, $d_{\text{model}} = 12288$ ). Training data: Approximately 300 billion tokens from Common Crawl, WebText2, Books, and Wikipedia.

Key contribution: In-context learning (ICL). GPT-3 showed that by simply placing examples in the prompt (few-shot), a frozen model could perform tasks it was never explicitly trained on. This was a qualitative shift: instead of fine-tuning a model per task, you write a prompt. The paper also established smooth scaling laws relating model size to performance.

Example

In-Context Learning as Implicit Bayesian Inference

Few-shot prompting works by conditioning the model on examples that implicitly specify the task. Theoretical work (Xie et al., 2022) argues ICL can be understood as implicit Bayesian inference: the model maintains a posterior over latent "concepts" (tasks) given the prompt, and generates completions consistent with the inferred concept.

The Alignment Era (2022-2023)

ChatGPT (November 2022)

Definition

ChatGPT

Organization: OpenAI. Base model lineage: GPT-3.5, itself a fine-tuned descendant of InstructGPT (Ouyang et al. 2022), which was in turn built on GPT-3. The training pipeline was: (1) supervised fine-tuning on human-written demonstrations, (2) reward model training on human preference comparisons, (3) PPO optimization against the reward model. ChatGPT is not a simple fine-tune of base GPT-3; it is the end product of this multi-step alignment pipeline. Key innovation: Reinforcement Learning from Human Feedback (RLHF).

Key contribution: The consumer product that brought LLMs to mainstream awareness. The RLHF pipeline (drawing on policy gradient methods) made the model substantially more helpful, harmless, and conversational than the base GPT-3.5.

GPT-4 (March 2023)

Definition

GPT-4

Organization: OpenAI. Parameters: Not officially disclosed. Architecture: Not officially disclosed.

Architecture rumors (not confirmed). External reporting (SemiAnalysis and other secondary sources) has described GPT-4 as a mixture-of-experts Transformer in roughly the 1.8 trillion total-parameter range with a small number of active experts per token. OpenAI has never confirmed these numbers and they should be treated as unverified leaks, not factual architecture details.

Key contribution: Multimodal input (text and images). Substantially improved reasoning, factuality, and instruction following over GPT-3.5. Reported to score in the 90th percentile on the bar exam; this figure has been contested. Martinez (2024, arXiv:2308.07312, "Re-Evaluating GPT-4's Bar Exam Performance") reanalyzed the result and placed the actual percentile at roughly 48-68th, depending on the testing cohort and exam form. The claim should be read as "strong performance on legal reasoning tasks," not as a settled empirical benchmark. Demonstrated that scaling plus RLHF plus careful data curation can produce models with broad expert-level knowledge across many domains.

Claude Family (2023-present)

Definition

Claude (Anthropic)

Organization: Anthropic. Models (2023-2024): Claude 1 (March 2023), Claude 2 (July 2023), Claude 3 family (March 2024: Haiku, Sonnet, Opus), Claude 3.5 Sonnet (June 2024), Claude 3.5 Sonnet upgraded with computer use (October 2024).

Key innovation: Constitutional AI (CAI). Instead of relying solely on human preference labels for alignment, CAI uses a set of written principles (a "constitution") to guide the model. The model critiques and revises its own outputs according to these principles, then trains on the revised outputs. This makes the alignment process more transparent and scalable than pure RLHF.

Computer use milestone (October 2024): Claude 3.5 Sonnet became the first frontier model to support computer use in public beta: the ability to interpret screenshots and generate mouse/keyboard actions to control desktop software. This marked the beginning of agentic computer interaction as a model capability, distinct from text-only tool use.

Anthropic has emphasized safety research alongside capability, including work on interpretability (circuit analysis, dictionary learning), honest calibration, and refusal behavior.

The Open Model Revolution (2023-present)

Llama Family (2023-present)

Definition

Llama (Meta)

Organization: Meta. Models: Llama 1 (Feb 2023, 7B-65B), Llama 2 (July 2023, 7B-70B), Llama 3 (April 2024, 8B and 70B), Llama 3.1 (July 2024, 8B-405B), Llama 4 Scout and Maverick (April 5, 2025).

Key contribution: Established the open-weight model ecosystem. Llama 1 showed that smaller models trained on more tokens could match larger models (the Chinchilla scaling insight applied aggressively). Llama 2 added RLHF and a permissive license. Llama 3 pushed quality to near-frontier levels. The 405B Llama 3.1 model demonstrated that open-weight models could compete with proprietary systems on many benchmarks.

Llama 4 (April 2025): The first Llama generation to use Mixture-of-Experts and native multimodal pretraining. Scout (109B total, 17B active, 16 experts, 10M context window) fits on a single H100 GPU with int4 quantization. Maverick (approximately 400B total, 17B active, 128 experts, 1M context window for instruct variant) targets frontier-level performance. Both models were pretrained on interleaved text, images, and video as first-class inputs.

The Llama family spawned an enormous ecosystem of fine-tunes, quantizations, and derivative models (Alpaca, Vicuna, Code Llama, and many others).

Qwen Family (2023-present)

Definition

Qwen (Alibaba)

Organization: Alibaba Cloud. Models: Qwen 1 (2023), Qwen 1.5 (2024), Qwen 2 (June 2024), Qwen 2.5 (September 2024), Qwen 3 (April 28, 2025).

Key contribution: Strong multilingual performance across 119 languages, particularly in Chinese and English. Qwen models are open-weight with Apache 2.0 licensing.

Qwen 3 (April 2025): Introduced hybrid thinking modes (switching between chain-of-thought reasoning and direct response). Dense variants from 0.6B to 32B; MoE variants including Qwen3-30B-A3B (30B total, 3B active) and Qwen3-235B-A22B (235B total, 22B active). Also released a proprietary Qwen3-Max with over 1 trillion parameters, available via API. Trained on 36 trillion tokens. Demonstrated competitive performance across coding, mathematics, and general language tasks. The Qwen-VL series added vision capabilities.

Architectural Innovation (2024-2025)

DeepSeek Family (2024-2025)

Definition

DeepSeek (DeepSeek AI)

Organization: DeepSeek AI (China). Key models: DeepSeek-V2 (May 2024, introduced MLA), DeepSeek-V3 (December 2024, 671B total, 37B active), DeepSeek-R1 (January 2025), DeepSeek-V3.1 (August 19, 2025).

Key innovations:

Mixture-of-Experts (MoE): DeepSeek-V3 uses MoE with 671 billion total parameters but only 37 billion active per token, achieving frontier performance at dramatically lower inference cost.
Multi-head Latent Attention (MLA): Introduced in V2, this KV-cache compression technique projects keys and values into a low-dimensional latent space, reducing memory requirements for long sequences.
Reasoning models: DeepSeek-R1 demonstrated that chain-of-thought reasoning can emerge from reinforcement learning on reasoning tasks, achieving competitive performance with OpenAI o1 on math and coding benchmarks. Released open-weight, enabling community study of reasoning model internals.
Hybrid reasoning (V3.1): DeepSeek-V3.1 (August 2025) introduced a hybrid architecture that combines thinking and non-thinking modes in a single model, switchable via the chat template. Maintains the 671B/37B MoE structure with 128K context.

DeepSeek demonstrated that architectural innovation (MoE, MLA) and RL-based training can compensate for having less compute than the largest Western labs.

Gemini Family (2023-present)

Definition

Gemini (Google DeepMind)

Organization: Google DeepMind. Models: Gemini 1.0 (December 2023), Gemini 1.5 (February 2024), Gemini 2.0 (December 2024), Gemini 2.5 Pro (March 25, 2025; GA June 2025), Gemini 3 Pro (November 18, 2025), and Gemini 3.5 Flash listed in current Gemini API docs by June 2026.

Key innovations:

Natively multimodal: Trained from the ground up on interleaved text, images, audio, and video, rather than bolting vision onto a text model.
Long context: Gemini 1.5 Pro demonstrated reliable processing of up to 1 million tokens (later extended to 2 million), enabling processing of entire codebases, hours of video, or thousands of pages of documents in a single context window.
Mixture-of-Experts: Gemini 1.5 uses a MoE architecture for efficiency.
Reasoning model era (Gemini 2.5): Gemini 2.5 Pro was Google's first explicitly designated thinking model, using chain-of-thought reasoning before responding. Led benchmarks on math, coding, and multimodal tasks at time of release.
Gemini 3 (November 2025): Released simultaneously across the Gemini app, Google AI Studio, and Vertex AI. Gemini 3 Pro and Gemini 3 Deep Think were the flagship variants. First time Google launched a new Gemini model across multiple products on day one.
Gemini 3.5 Flash (current docs, June 2026): Google AI for Developers lists Gemini 3.5 Flash as a current Gemini 3 model for sustained frontier performance on agentic and coding tasks. Treat the docs as the source of current API availability; use release posts or model cards for deeper claims.

Kimi (2024-present)

Definition

Kimi (Moonshot AI)

Organization: Moonshot AI (China). Key models: Kimi k1.5 (2024-2025), Kimi K2 (July 11, 2025), Kimi K2.5 (January 27, 2026).

Kimi k1.5 (2024-2025): Pioneered extremely long context windows for consumer applications. Demonstrated that long context is not just a benchmark metric but enables qualitatively different use patterns: processing entire books, long legal documents, and extended conversation histories. Extended RL-based reasoning in a long-context setting.

Kimi K2 (July 2025): Open-weight release (Apache 2.0). Strong performance on coding benchmarks. Instruct update in September 2025 doubled context window to 256K and improved agentic coding. K2 Thinking (November 2025) added dedicated chain-of-thought reasoning capability.

Kimi K2.5 (January 2026): Native multimodal architecture supporting visual and text input, thinking and non-thinking modes, dialogue and agent tasks. 1 trillion parameters. Competitive on agent, coding, image, and video tasks.

The Claude Frontier Series (May 2025 - Present)

Definition

Claude frontier family (Anthropic)

Organization: Anthropic.

Claude Opus 4 and Sonnet 4 (May 22, 2025): The Claude 4 generation introduced hybrid models offering both near-instant responses and extended thinking modes for deeper reasoning in a single model. Opus 4 focused on sustained performance on complex, long-running agentic and coding tasks. Sonnet 4 provided superior coding and reasoning relative to Sonnet 3.7.

Claude Opus 4.1 (August 5, 2025): Incremental improvement on agentic tasks, real-world coding, and reasoning. Achieved 74.5% on SWE-bench Verified. Extended thinking capability expanded to 64,000 tokens.

Claude Sonnet 4.5 (September 29, 2025): Highest-intelligence Sonnet at its price point. Added 1M token context in beta. Strong coding performance at the Sonnet tier.

Claude Opus 4.5 (November 24, 2025): Next full Opus release, continuing Anthropic's focus on agentic workflows and extended reasoning.

Claude Opus 4.7 (April 16, 2026): Previous generally-available Opus tier. Continued improvements in reasoning, coding, and agentic task performance.

Claude Opus 4.8 (May 28, 2026): Current generally-available Opus tier. Anthropic reports improvements over Opus 4.7 on coding, agentic work, professional tasks, citation precision, and token efficiency. The API model ID is claude-opus-4-8, and the release keeps regular Opus pricing unchanged.

Claude Mythos Preview (April 7, 2026): Anthropic's most capable frontier model to date, publicly documented via the Claude Mythos Preview System Card (April 7, 2026). Per the card, Mythos Preview's large increase in capabilities led Anthropic to not make it generally available; instead it is being used as part of a defensive cybersecurity program with a limited set of partners. Findings from the Mythos System Card (RSP evaluations, cyber, alignment assessment, model-welfare assessment) are being used to inform future Claude releases and their safeguards. Not a public deployment, but a publicly documented capability datapoint.

Claude Fable 5 and Mythos 5 (June 2026): Fable 5 is Anthropic's broadly available fifth-generation model for difficult coding and knowledge-work tasks. Anthropic documents claude-fable-5 as its API handle. Mythos 5 is a separate, restricted research model for vetted cybersecurity and biology work. It should not be treated as a broadly deployable Claude tier.

Claude Sonnet 5 (June 30, 2026): Sonnet 5 succeeds Sonnet 4.6 as a lower-cost general deployment tier. Anthropic lists claude-sonnet-5 for API use and compares it with Sonnet 4.6 and Opus 4.8 on selected evaluations. Those release comparisons are not a universal provider ranking.

GPT-5 (August 7, 2025)

Definition

GPT-5

Organization: OpenAI. Release: August 7, 2025, via livestream event. Available in ChatGPT, Microsoft Copilot, and the OpenAI API.

Key contribution: Highest published scores at release on benchmarks testing mathematics, programming, finance, and multimodal understanding. Improvements over GPT-4o in faster response times, coding and writing quality, accuracy on health-related questions, and reduced hallucination rates. Publicly accessible across tiers.

GPT-5.5 (April 23, 2026)

Definition

GPT-5.5

Organization: OpenAI. Release: April 23, 2026 in ChatGPT for Plus, Pro, Business, and Enterprise tiers and in Codex; API rollout (gpt-5.5 and gpt-5.5-pro) on April 24, 2026. Internal codename "Spud."

Key contribution: OpenAI's most capable mainline model at release, with emphasis on coding, debugging, online research, document and spreadsheet creation, and multi-tool agentic workflows. OpenAI states GPT-5.5 matches GPT-5.4 per-token serving latency while using fewer tokens to complete the same tasks, so the headline change is token efficiency at the same latency budget rather than a new latency floor. Continues the GPT-5.x trend of folding specialist capabilities (reasoning, Codex-class coding, computer use, tool search) into the mainline product surface.

Pricing at release: gpt-5.5 at 5 USD per million input tokens and 30 USD per million output tokens. gpt-5.5-pro at 30 USD / 180 USD per million input / output tokens, restricted to Pro, Business, and Enterprise.

What remains undisclosed. Parameter count, dense-vs-MoE architecture, training data composition, and training compute. OpenAI publishes a system card focused on capability and safety evaluations, not architectural disclosure.

GPT-5.6 and GPT-Live (July 2026)

Definition

GPT-5.6 and GPT-Live

Organization: OpenAI.

GPT-5.6 Sol, Terra, and Luna (July 9, 2026): OpenAI released Sol as the flagship tier, Terra as a lower-cost tier, and Luna as the fastest and lowest-cost tier. The release is available across ChatGPT, Codex, and the API, subject to product-specific access conditions. Its system card evaluates the three models together and does not disclose parameter counts or architecture.

GPT-Live-1 and GPT-Live-1 mini (July 8, 2026): GPT-Live is a full-duplex voice family for ChatGPT Voice. At launch, deeper search and reasoning requests delegated to GPT-5.5. This is a product integration detail, not evidence that every GPT-5.6 capability is exposed through voice or the API.

The Reasoning Model Era (2024-2025)

A distinct capability shift emerged in late 2024: models that explicitly "think" before responding, generating extended chain-of-thought (sometimes thousands of tokens) as a scratchpad before producing the final answer. This inference-time scaling approach trades latency and token cost for accuracy on difficult reasoning tasks.

Example

Inference-Time Scaling Laws

Standard scaling laws (Kaplan et al. 2020) relate pretraining compute to loss. Reasoning models introduce a second scaling axis: test-time compute. Empirically, accuracy on hard math and coding tasks scales with the number of "thinking tokens" the model generates before answering. This is distinct from the training compute axis and enables smaller pretrained models to reach accuracy levels previously requiring much larger models, at the cost of increased latency and token spend.

OpenAI o-series:

o1-preview and o1-mini (September 12, 2024): first public reasoning models, outperforming standard GPT-4o on math and coding at the cost of latency.
o1 full release (December 5, 2024).
o3-mini (January 31, 2025): efficient reasoning model.
o3 and o4-mini (April 16, 2025): o3 became the strongest publicly available reasoning model at release; o4-mini added multimodal tool use.
o3-pro (June 2025): most capable and expensive o-series model.

DeepSeek-R1 (January 2025): Demonstrated that chain-of-thought reasoning can be induced via RL on verifiable tasks without supervised reasoning traces. Open-weight release enabled community study of reasoning model training.

Gemini 2.5 thinking mode (March 2025): Google's entry into the reasoning model category, with Gemini 2.5 Pro as the first Gemini model designated as a thinking model.

Claude extended thinking (2025): Available in Claude Opus 4 and later models, offering a separate high-latency reasoning path with up to 64K thinking tokens.

xAI and Mistral

xAI Grok Series

Definition

Grok (xAI)

Organization: xAI. Models: Grok 1 (2023), Grok 2 (August 2024), Grok 3 (February 17, 2025), Grok 4 (July 9, 2025), Grok 4.1 (November 17, 2025), and Grok 4.5 (July 8, 2026).

Current release surface: Grok 4.5 is xAI's latest public flagship for coding, tool-using tasks, and knowledge work. The provider publishes product capabilities and API availability, but not parameter counts, training data, or a technical report for the post-Grok-1 line. Real-time X integration remains a product distinction rather than evidence about the underlying architecture.

Mistral

Definition

Mistral (Mistral AI)

Organization: Mistral AI (France). European frontier lab producing both open-weight and proprietary models.

2025 releases: Mistral Large 24.11 (November 2024, 123B dense), Codestral 25.01 (January 2025, code-focused, led Lmsys Copilot Arena at launch), Mistral Small 3 (January 2025), Codestral (August 2025 refresh), Mistral 3 family (December 2025: full open-weight frontier model plus nine smaller offline-capable variants).

Key contribution: Demonstrated that efficient dense models (not requiring MoE) at 7B-123B scale can be competitive with much larger models when trained on high-quality data. Sliding window attention and grouped-query attention introduced in early Mistral 7B remain widely used. Mistral 3 positioned as a serious open-weight European alternative to US and Chinese MoE giants.

GLM-5 (Zhipu AI, February 2026)

Definition

GLM-5 (Zhipu AI)

Organization: Zhipu AI (China; Z.ai). Release: February 11, 2026. Parameters: 744B total, 40B active (MoE, vs GLM-4.5 which was 355B/32B). Training data: 28.5T tokens (vs 23T for GLM-4.5).

Release date: February 11, 2026, open-source on Hugging Face; the parameter table below lists GLM-5 under year 2026 accordingly.

Key contribution: Native multimodal, 128K context, designed for agentic tasks. At release approached Claude Opus 4.5 on coding benchmarks and surpassed Gemini 3 Pro on some benchmarks. First major model release from Zhipu after the company went public. Established GLM series as a serious frontier competitor from the Chinese open ecosystem.

Evaluation Benchmark Evolution

A parallel history runs alongside model releases: the benchmarks used to measure capability have changed as models saturated them. Understanding this evolution matters because each benchmark shift revealed something new about what capability means.

Benchmark	Introduced	What it measures	When frontier models saturated
MMLU	2020	57-subject multiple-choice knowledge	GPT-4 era (~2023)
MATH	2021	Competition mathematics (proof and computation)	o1/o3 era (~2024-25)
HumanEval	2021	Python function synthesis from docstrings	GPT-4 era (~2023)
GPQA	2023	PhD-level biology, chemistry, physics questions	o3/Gemini 2.5 era (~2025)
SWE-bench Verified	2024	Real GitHub issue resolution in large codebases	Ongoing frontier (60-80% range as of 2025)
ARC-AGI	2019/updated	Novel visual reasoning requiring abstraction	Ongoing frontier; o3 high-compute crossed 85% in 2024

The pattern: each benchmark was designed to be difficult when introduced, then became a target for optimization, then was superseded by a harder one. SWE-bench Verified and ARC-AGI are the primary contested frontiers as of mid-2026.

Watch Out

Benchmark saturation versus benchmark gaming

When a model scores 95% on MMLU, it may reflect genuine world knowledge or fine-tuning on MMLU variants. The two are hard to distinguish from the benchmark number alone. SWE-bench Verified was designed to resist gaming (each issue requires actually running the patched code), which is why it remains the primary coding benchmark despite being harder to interpret than simple accuracy metrics.

Key Architectural Trends

Example

Dense Models vs Mixture-of-Experts

Early models (GPT-2, GPT-3, Llama 1-2) used dense architectures where every parameter activates for every token. MoE models (DeepSeek-V3, Gemini 1.5, Llama 4, Qwen3-235B) activate only a subset of parameters per token. This means a 671B MoE model can have the quality of a very large dense model but the inference cost of a much smaller one. The trade-off: MoE models require more total memory (all experts must be loaded) even though compute per token is lower.

Example

The RLHF-to-RL-for-Reasoning Pipeline

The alignment methodology evolved from: (1) supervised fine-tuning on human demonstrations, to (2) RLHF with reward models trained on human preferences, to (3) reinforcement learning directly on verifiable reasoning tasks (math, code). DeepSeek-R1 and the OpenAI o-series showed that training models to "think step by step" via RL produces qualitatively stronger reasoning than SFT or standard RLHF alone.

Parameter Counts at a Glance

Model	Year	Parameters	Active Params	Key Innovation
GPT-2	2019	1.5B	1.5B (dense)	Zero-shot generation
GPT-3	2020	175B	175B (dense)	In-context learning
ChatGPT	2022	~175B	~175B	RLHF alignment
Llama 1	2023	7-65B	Dense	Open-weight ecosystem
GPT-4	2023	Undisclosed	Undisclosed	Multimodal, frontier reasoning
Claude 3 Opus	2024	Undisclosed	Undisclosed	Constitutional AI
Llama 3.1	2024	8-405B	Dense	Open-weight at frontier scale
DeepSeek-V3	2024	671B	37B (MoE)	MoE efficiency, MLA
DeepSeek-R1	2025	671B	37B (MoE)	RL-trained reasoning
Gemini 1.5	2024	Undisclosed	Undisclosed	1M+ context, native multimodal
Kimi k1.5	2025	Undisclosed	Undisclosed	Long-context + RL reasoning
Claude 4 Opus	2025	Undisclosed	Undisclosed	Agentic coding, extended thinking
Llama 4 Scout	2025	109B	17B (MoE)	Open-weight MoE, 10M context
Llama 4 Maverick	2025	~400B	17B (MoE)	Open-weight MoE, native multimodal
Qwen 3 235B	2025	235B	22B (MoE)	Hybrid thinking, 119 languages
GPT-5	2025	Undisclosed	Undisclosed	Frontier reasoning + coding
GPT-5.5	2026	Undisclosed	Undisclosed	Token efficiency + agentic tools at GPT-5.4 latency
GPT-5.6 Sol / Terra / Luna	2026	Undisclosed	Undisclosed	Three deployment tiers with separate cost and capability positioning
DeepSeek-V3.1	2025	671B	37B (MoE)	Hybrid think/no-think in one model
Grok 4.5	2026	Undisclosed	Undisclosed	Coding, tool-using tasks, and knowledge work
Kimi K2	2025	Undisclosed	Undisclosed	Open-weight agentic coding
Gemini 2.5 Pro	2025	Undisclosed	Undisclosed	First Google thinking model
Gemini 3 Pro	2025	Undisclosed	Undisclosed	Multiproduct day-one launch
Gemini 3.5 Flash	2026	Undisclosed	Undisclosed	Current Google API Flash tier for agentic and coding tasks
GLM-5	2026	744B	40B (MoE)	Native multimodal, 128K, agentic
Claude Opus 4.8	2026	Undisclosed	Undisclosed	Current Opus tier, agentic and coding work
Claude Fable 5	2026	Undisclosed	Undisclosed	Broadly available fifth-generation Claude tier
Claude Sonnet 5	2026	Undisclosed	Undisclosed	Lower-cost general deployment tier
Kimi K2.5	2026	~1T	Undisclosed	Native multimodal MoE

Common Confusions

Watch Out

Parameter count does not equal capability

A 671B MoE model with 37B active parameters can outperform a 70B dense model despite using similar compute per token. Total parameter count is a poor proxy for capability. What matters is effective compute (active parameters times tokens processed), data quality, and training methodology (RLHF, RL for reasoning, data curation). Llama 1-65B, trained on 1.4T tokens, underperformed Llama 3-8B, trained on 15T+ tokens.

Watch Out

Open-weight is not open-source

Most "open" models release model weights but not training data, training code, or data processing pipelines. Llama, Qwen, and DeepSeek provide weights and inference code under various licenses, but reproducing the training run from scratch is not feasible without the full data pipeline. True open-source would include data, code, and training recipes.

Watch Out

GPT-3.5 is not a simple GPT-3 fine-tune

ChatGPT's base model (GPT-3.5) went through a multi-step pipeline: starting from a GPT-3-class pretrained model, then supervised fine-tuning on human demonstrations (InstructGPT stage), then reward model training on human preference comparisons, then PPO-based RL optimization. This is a qualitatively different model from base GPT-3, not just a prompt-engineering wrapper or single-pass fine-tune. The Ouyang et al. 2022 InstructGPT paper describes the full pipeline.

Summary

GPT-3 (2020) introduced in-context learning, the key capability of the LLM era
RLHF (ChatGPT, 2022) turned base models into useful assistants via a multi-step InstructGPT pipeline, not a simple fine-tune
Constitutional AI (Claude) made alignment more transparent and principle-based
MoE architectures (DeepSeek-V3, Gemini, Llama 4, Qwen 3) decouple total parameters from inference cost
RL for reasoning (DeepSeek-R1, o-series, Kimi K2) is a distinct capability tier with inference-time scaling as the new axis
Open-weight models (Llama, Qwen, DeepSeek, Kimi K2) have reached near-parity with proprietary systems on many benchmarks
Long context (Gemini 1.5, Kimi, Llama 4) enables qualitatively new applications
GPT-4 bar exam "90th percentile" is a contested figure; actual estimate from re-analysis is 48-68th percentile
Benchmarks saturate: MMLU and HumanEval are no longer discriminative; SWE-bench Verified and ARC-AGI are the current contested frontiers
GLM-5 (February 2026, Zhipu AI) and Kimi K2.5 (January 2026) mark Chinese labs releasing 700B-1T scale open-weight models

Exercises

ExerciseCore

Problem

Explain the difference between few-shot in-context learning (GPT-3 style) and fine-tuning. When would you prefer each approach?

ExerciseCore

Problem

DeepSeek-V3 has 671B total parameters but only 37B active per token. Explain why, and state one advantage and one disadvantage compared to a dense 70B model.

ExerciseAdvanced

Problem

Constitutional AI (Claude) and RLHF (ChatGPT) are both alignment methods. Describe the key difference in how they obtain training signal for the preference/reward model.

References

Reference note: this page is a maintained technical timeline, not an exhaustive historiography. The references below prioritize primary release posts, technical reports, and system cards.

Foundational Papers:

Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019)
Brown et al., "Language Models are Few-Shot Learners" (GPT-3, 2020)
Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT/RLHF, 2022)
Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022)

Primary release anchors:

OpenAI, "Introducing GPT-5" (August 7, 2025)
OpenAI, "Introducing GPT-5.5" (April 23, 2026)
OpenAI, "GPT-5.5 System Card" (April 23, 2026)
OpenAI, "GPT-5.6: Frontier intelligence that scales with your ambition" (July 9, 2026)
OpenAI, "GPT-5.6 System Card" (July 9, 2026)
OpenAI, "Introducing GPT-Live" (July 8, 2026)
Anthropic, "Model system cards" (accessed July 11, 2026)
Anthropic, "Introducing Claude Opus 4.8" (May 28, 2026)
Anthropic, "Claude Fable 5" (June 2026)
Anthropic, "Introducing Claude Sonnet 5" (June 30, 2026)
Anthropic, "Claude Fable 5 & Claude Mythos 5 System Card" (2026)
Google, "Gemini 2.5: Our most intelligent AI model" (March 25, 2025)
Google, "Introducing Gemini 3" (November 18, 2025)
Google AI for Developers, "Gemini models" (accessed June 1, 2026)
Meta AI, "The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes" (2026)
Qwen Team, "Qwen3: Think Deeper, Act Faster" (April 2025)
DeepSeek-AI, "DeepSeek-V3" official repository (December 2024 technical report / repo)
DeepSeek-AI, "DeepSeek-R1" official repository (January 2025 technical report / repo)

Open and open-weight lines:

Touvron et al., "Llama: Open and Efficient Foundation Language Models" (2023)
Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
Meta, "The Llama 4 Herd" technical report (2025)
Hui et al., "Qwen2.5 Technical Report" (2024)
Qwen Team, "Qwen3 Technical Report" (2025)

Architecture Innovation:

DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024) (introduces MLA)
DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024)
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)

Reasoning Models:

OpenAI, "OpenAI o1 System Card" (2024)
OpenAI, "OpenAI o3 and o4-mini System Card" (2025)

Proprietary Frontier:

OpenAI, "GPT-4 Technical Report" (2023)
OpenAI, "GPT-4 System Card" (2023)
Google DeepMind, "Gemini: A Family of Highly Capable Multimodal Models" (2023)
Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context" (2024)
Anthropic, "Claude 3 Model Card" (2024)
Anthropic, "Claude 4 System Card" (2025)
Zhipu AI, "GLM-4.5 Technical Report" (2025); "GLM-5" (2026)
Moonshot AI, "Kimi K2 Technical Report" (2025)

Contested claims:

Martinez, "Re-Evaluating GPT-4's Bar Exam Performance" (2024), arXiv:2308.07312

Next Topics

This is a living reference document. As new models emerge, check primary technical reports for authoritative details.

Last reviewed: July 12, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

History of Artificial Intelligencelayer 5 · tier 2

Derived topics

1

AI Labs Landscapelayer 5 · tier 2

Graph-backed continuations

AI Labs Landscape