Model Comparison Table

Sneiderman, Robby

Model Timeline

Model Comparison Table

Structured comparison of major LLM families as of July 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.

CoreTier 2FrontierReference~50 min

Prerequisites

Transformer Architecture Claude Model Family Deepseek Models Gemini and Google Models

Prereq Map

Learning position

Read this page in the graph.

model-timeline | layer 5 | tier 2. This page has 6 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Choosing a model for a specific application requires comparing across multiple dimensions: capability, cost, latency, context length, open weights availability, and task-specific performance. No single model dominates on every axis. This page provides a structured factual comparison and explains why naive model comparisons are often misleading.

Model comparison ledger

Compare deployment surfaces, not just names

Snapshot current to April 22, 2026. Use this as a decision frame, then verify exact model IDs and limits in provider docs.

OpenAI

GPT-5.4, GPT-5.4 Pro, GPT-5.4 mini/nano

Closed hosted API and ChatGPT/Codex product surfaces

Architecture

Undisclosed for the GPT-5 family

Context

GPT-5.4 API docs list 1M context and 128K max output; mini/nano list 400K context.

Best use

Complex reasoning, coding, computer use, tool search, and professional workflows.

API IDs, ChatGPT labels, and Codex limits are separate product surfaces. GPT-Rosalind is a trusted-access life-sciences preview, not a general model.

Anthropic

Claude Opus 4.7, Sonnet 4.6, Haiku 4.5

Closed hosted API, Bedrock, Vertex AI, and Microsoft Foundry

Architecture

Undisclosed for Claude

Context

Opus 4.7 and Sonnet 4.6 list 1M context; Haiku 4.5 lists 200K.

Best use

Coding, long document work, vision, enterprise workflows, and controlled output style.

Mythos Preview is invitation-only for defensive cybersecurity work through Project Glasswing.

Google

Gemini 3.1 previews, Gemini 2.5 stable, Gemma 4

Gemini API / Vertex AI for hosted models; Gemma for open weights

Architecture

Partly disclosed across generations; do not infer one Gemini endpoint from another.

Context

Gemini 3.1 and Gemini 2.5 text models list million-token-class context; exact limits depend on endpoint.

Best use

Long-context multimodal analysis, Google product integration, real-time media, and local Gemma deployment.

Preview endpoints can be renamed or shut down. Production code should prefer documented stable strings unless the newest preview is required.

Open-weight systems

DeepSeek, Llama 4, Qwen3, Gemma, Mistral-family models

Downloadable weights or permissive model cards, depending on family

Architecture

Often disclosed: MoE for DeepSeek, Llama 4 Scout/Maverick, and Qwen3-235B-A22B.

Context

Ranges from 32K native windows to Llama 4 Scout's 10M-token model-card claim.

Best use

Local control, fine-tuning, inspectability, privacy boundaries, or high-volume inference economics.

Open weights are not automatically cheap. Large MoE models can need enormous memory even when active-parameter compute is low.

If correctness matters

Build a small evaluation set from your real task. Benchmark deltas of one or two points rarely justify switching providers by themselves.

If context matters

Test retrieval at the beginning, middle, and end of the prompt. Advertised context length is not the same as reliable evidence use.

If cost matters

Compare total cost, not token price alone: latency, retries, human review, caching, batch pricing, and error handling change the answer.

If privacy matters

Shortlist open-weight or private-cloud options first, then check whether their quality is sufficient for the task.

Guardrails for reading model tables

Architecture is listed only when the provider or model card publishes it.
Parameter count is not a capability ranking, especially for mixture-of-experts models.
A preview endpoint can be stronger and less production-stable at the same time.
Leaderboards are useful for shortlist generation, not final procurement decisions.

Provider-Backed Snapshot

This page is a dated decision frame, not a live leaderboard. The snapshot below is current to July 11, 2026 and intentionally separates published facts from unknowns.

Closed hosted frontier models. OpenAI's July 2026 family is GPT-5.6 Sol, Terra, and Luna; GPT-Live is a separate voice family. Anthropic's broadly available line includes Fable 5, Sonnet 5, Opus 4.8, and Haiku 4.5, while Mythos 5 remains restricted. Google's hosted line includes Gemini 3.1 Pro, Gemini 3.5 Flash, and Gemini 3 Flash preview alongside stable Gemini 2.5 endpoints. For all three labs, architecture and parameter counts for the newest closed frontier models are largely undisclosed.

Open-weight families. DeepSeek, Llama 4, Qwen3, Gemma, Mistral-family models, Kimi, and GLM are the families to check when local control, fine-tuning, model inspection, or private deployment matters. The safe rule is simple: use the current model card for exact parameter count, active-parameter count, license, context length, and tool/multimodal claims. Do not copy old numbers across minor releases.

Context-window claims. Treat advertised context length as an input limit, not a guarantee of reliable retrieval. A million-token model can still miss evidence placed in the middle of the prompt. Test with documents shaped like your real workload.

Capability claims. Benchmark wins should be read as evidence for a specific harness on a specific date. They are not a proof that a model is better for your product, dataset, latency target, or safety boundary.

What This Table Does Not Tell You

Benchmarks are noisy

Proposition

Benchmark Score Variance from Evaluation Design

Statement

If a benchmark has $N$ questions and a model's true accuracy is $p$ , the observed accuracy $\hat{p}$ has standard error $\sqrt{p(1-p)/N}$ . For $N = 500$ questions and $p = 0.85$ , the standard error is $\sqrt{0.85 \times 0.15 / 500} \approx 0.016$ . A 95% confidence interval for the true accuracy is approximately $[0.82, 0.88]$ . Two models scoring 0.85 and 0.83 on a 500-question benchmark are not meaningfully different.

Intuition

Benchmark scores are sample estimates of a model's capability on a distribution of tasks. With a finite number of questions, there is sampling noise. This connects directly to hypothesis testing: small differences (1-3 percentage points on a few-hundred-question benchmark) are often within the noise margin and should not be interpreted as genuine capability differences.

Proof Sketch

Each question is a Bernoulli trial with success probability $p$ . The sample mean of $N$ independent Bernoulli trials has variance $p(1-p)/N$ . By the CLT, the distribution of $\hat{p}$ is approximately normal for large $N$ . The 95% CI is $\hat{p} \pm 1.96\sqrt{p(1-p)/N}$ .

Why It Matters

Model comparison tables and leaderboards often rank models by benchmark scores with differences of 1-2 percentage points. These differences are frequently within sampling error. Claiming model A is "better" than model B based on a 0.5% difference on MMLU (approximately 14K questions across all subjects, but individual subject subsets have far fewer) is statistically unjustified.

Failure Mode

This bound assumes questions are independent and identically distributed, which they are not in practice. Benchmark questions cluster by topic, difficulty, and format. Models may systematically succeed or fail on certain clusters. The effective sample size is smaller than $N$ when questions are correlated, making the true confidence interval wider than the formula suggests.

report a correction →

Benchmarks are not your task

A model that scores highest on MMLU may not be best for your specific application. Benchmarks measure performance on standardized question sets. Your task has specific characteristics: domain vocabulary, input format, output requirements, latency constraints, and cost budget. The only reliable way to choose a model is to evaluate candidates on your actual task with your actual data. See model evaluation best practices for a systematic approach.

Contamination

Many benchmarks are partially or fully present in training data. Models may have memorized specific benchmark questions, inflating their scores. Newer benchmarks (like GPQA, SWE-bench, or LiveCodeBench) are more resistant to contamination, but no benchmark is immune once it becomes widely used.

Pricing changes frequently

API pricing is a competitive lever. Providers cut prices regularly. Any specific pricing comparison is outdated within months. The structural insight is more durable: smaller and faster model tiers usually cost far less per token than maximum-capability reasoning tiers. For many extraction, classification, and routing tasks, the cheaper model is sufficient.

Axes of Comparison

Architecture: Dense vs. MoE

Dense models with disclosed architecture activate all parameters for every token. Mixture-of-experts models activate a subset of experts per token. Public model cards for families such as DeepSeek, Llama 4, Qwen3, Kimi, and GLM report MoE variants; for closed frontier models whose architecture is undisclosed, do not infer dense or MoE from benchmark behavior alone. Both approaches build on the core transformer architecture. The trade-offs:

MoE advantage: lower compute per token, so faster and cheaper inference for the same quality level.
MoE disadvantage: higher memory footprint (all experts must be loaded), more complex training (load balancing, expert collapse).
Dense advantage: simpler to train, deploy, and quantize. No routing overhead.

Context Length

Long-context claims should be grouped by use case, not memorized as a ranking:

Million-token-class hosted models are useful for whole repositories, large legal records, long transcripts, and multi-document research. GPT-5.5, Claude Opus 4.8 / Sonnet 4.6, and Gemini 2.5 / 3.x models all have documented million-token-class surfaces in at least some product or API settings.
Open-weight long-context models can be attractive when the privacy boundary matters. Llama 4, Kimi, GLM, Qwen, and DeepSeek-family releases should be checked against the current model card before deployment.
128K is already enough for many products. If the task is a short support ticket, a single paper, or a structured extraction job, a larger window may add latency and cost without improving the answer.

Context length matters for tasks that require processing large documents, codebases, or conversation histories. For shorter inputs, a larger context window provides little benefit.

Open Weights

Open weights enable: fine-tuning for specific domains, running inference on your own hardware (no API dependency), inspecting model internals for research, and quantization for deployment on smaller hardware.

The main open-weight shortlist to check in July 2026 is DeepSeek, Llama 4, Qwen3, Gemma, Mistral-family models, Kimi, and GLM. The main closed-hosted shortlist is GPT-5.x, Claude, and Gemini frontier models. That division is about access and deployment control, not absolute capability.

For production applications with data privacy requirements or high-volume inference, open-weight models can be cheaper than API access. At low volume, hosted APIs are often cheaper once hardware, serving, monitoring, and engineer time are counted.

Reasoning Models

A distinct category emerged in 2024-2025: models trained specifically for multi-step reasoning via RL.

OpenAI: GPT-5.6, GPT-5.5, GPT-5.4, and earlier o-series releases expose reasoning-time control through product and API settings.
Anthropic: Fable 5, Sonnet 5, and Claude 4.x expose different thinking or effort controls in supported product and API settings.
Google: Gemini 2.5 and 3.x models make thinking modes part of the Gemini Pro, Flash, and preview story.
Open-weight systems: DeepSeek-R1, Qwen3, Kimi, and related releases expose variants of reasoning, tool use, or switchable thinking modes.

These models trade latency for accuracy. On math, coding, and long-horizon agent tasks, extra reasoning can help. On simple routing, extraction, or reformatting, it can waste tokens.

How to Choose a Model

There is no universal best model. The right choice depends on your constraints:

1. What is your task? Coding tasks, multilingual tasks, multimodal tasks, long-context tasks, and low-latency routing tasks favor different shortlists. Start with the task shape, then choose candidates.

2. What are your cost constraints? For high-volume applications, open-weight models with self-hosted inference are cheapest. For low-volume or prototyping, API access is simpler.

3. Do you need open weights? If you need to fine-tune, run on-premise, or inspect model internals, start with open-weight families such as Llama, Qwen, DeepSeek, Kimi, GLM, Gemma, and Mistral.

4. What context length do you need? If you need to process documents longer than 128K tokens, shortlist million-token-class hosted models and the current open-weight long-context model cards. Then test retrieval accuracy at the positions that matter, not just the advertised limit.

5. What latency do you need? Smaller models are often much faster than flagship reasoning models. For real-time applications, latency often matters more than marginal quality.

The most common mistake is choosing based on benchmark rankings. Evaluate on your task. A model that scores 2% lower on MMLU but costs 5x less and is 3x faster may be the correct choice for your application.

Common Confusions

Watch Out

Higher benchmark score does not mean better for your task

MMLU measures broad knowledge. HumanEval measures Python coding. MATH measures competition math. A model that tops one benchmark may underperform on your specific domain. Always evaluate on data representative of your actual use case. Benchmark rankings are starting points for a shortlist, not final decisions.

Watch Out

Parameter count is not a capability measure

An MoE model can report a very large total parameter count while activating far fewer parameters per token. A dense model can report fewer total parameters while doing more per-token compute. Training data quality, post-training, evaluation harness, tool use, and serving setup all matter. Comparing models by parameter count alone is misleading.

Watch Out

Open weights does not mean free

Running a 405B parameter model requires multiple high-end GPUs. The hardware cost for self-hosted inference can exceed API costs at low volume. Open weights are cost-effective only at sufficient scale (typically thousands to millions of requests per day) or when fine-tuning and data privacy are requirements.

Watch Out

Frontier is a moving target

Any specific comparison in this table will be outdated within months. New model releases occur roughly quarterly from each major lab. The structural patterns (MoE vs. dense trade-offs, cost tiers, open vs. closed dynamics) are more durable than specific benchmark numbers.

Exercises

ExerciseCore

Problem

A benchmark has 1000 questions. Model A scores 87.2% and Model B scores 85.8%. Compute the standard error for each score and determine whether the difference is statistically significant at the 95% confidence level.

ExerciseCore

Problem

An MoE model has 600B total parameters and activates 40B per token. A dense model has 70B parameters (all active). Compare the per-token compute cost (in terms of FLOPs, proportional to active parameters) and the memory required to load each model in float16.

ExerciseAdvanced

Problem

You are choosing between three models for a customer support chatbot processing 10 million messages per month. Model A (API): 0.50 USD per 1M input tokens, 95% task accuracy. Model B (open weights, self-hosted): 15,000 USD/month fixed infrastructure cost, unlimited tokens, 93% task accuracy. Model C (API, cheaper): 0.05 USD per 1M input tokens, 88% task accuracy. Average message length: 500 tokens. Which model is cheapest? At what accuracy threshold would you switch from the cheapest to a more expensive option?

References

Canonical:

Liang et al., "Holistic Evaluation of Language Models" (HELM, Stanford, 2023)
Srivastava et al., "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models" (BIG-bench, 2023)

Current provider and model-card references:

LMArena, ongoing human-preference model comparison; use as a dated shortlist aid, not final evaluation.
Artificial Analysis, ongoing latency, throughput, price, and capability measurements; use with the measurement date attached.
OpenAI, "Models" API documentation (accessed June 1, 2026), https://developers.openai.com/api/docs/models
OpenAI, "Introducing GPT-5.5" (Apr 23, 2026), https://openai.com/index/introducing-gpt-5-5/
OpenAI, "GPT-5.5 System Card" (Apr 23, 2026), https://openai.com/index/gpt-5-5-system-card/
OpenAI, "GPT-5.6: Frontier intelligence that scales with your ambition" (Jul 9, 2026), https://openai.com/index/gpt-5-6/
OpenAI, "GPT-5.6 System Card" (Jul 9, 2026), https://deploymentsafety.openai.com/gpt-5-6
OpenAI, "Introducing GPT-Live" (Jul 8, 2026), https://openai.com/index/introducing-gpt-live/
OpenAI, "Introducing GPT-5.4" (Mar 5, 2026), https://openai.com/index/introducing-gpt-5-4/
OpenAI, "Introducing GPT-Rosalind for life sciences research" (Apr 16, 2026), https://openai.com/index/introducing-gpt-rosalind/
Anthropic, "Models overview" API documentation (accessed June 1, 2026), https://platform.claude.com/docs/en/about-claude/models/overview
Anthropic, "Introducing Claude Opus 4.8" (May 28, 2026), https://www.anthropic.com/news/claude-opus-4-8
Anthropic, "Introducing Claude Opus 4.7" (Apr 16, 2026), https://www.anthropic.com/news/claude-opus-4-7
Anthropic, "Claude Fable 5" (June 2026), https://www.anthropic.com/claude/fable
Anthropic, "Introducing Claude Sonnet 5" (Jun 30, 2026), https://www.anthropic.com/news/claude-sonnet-5
Anthropic, "Claude Fable 5 & Claude Mythos 5 System Card" (2026), https://www-cdn.anthropic.com/2f9323abbcc4abe219577539efe19a623c9ca2bd/Claude%20Fable%205%20%26%20Claude%20Mythos%205%20System%20Card.pdf
Google AI for Developers, "Gemini models" (accessed June 1, 2026), https://ai.google.dev/gemini-api/docs/models
Google AI for Developers, "Gemini API deprecations" (accessed June 1, 2026), https://ai.google.dev/gemini-api/docs/deprecations
Google, "Gemini 3.1 Pro" (Feb 19, 2026), https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro
Google, "Gemma 4" (Apr 2, 2026), https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
Meta, Llama 4 model cards (Apr 5, 2025), https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E
Qwen Team, Qwen3 model card (2025), https://huggingface.co/Qwen/Qwen3-235B-A22B
Moonshot AI, Kimi K2.6 model card (Apr 2026), https://huggingface.co/moonshotai/Kimi-K2.6
Z.ai, GLM-5 model card (Feb 2026), https://huggingface.co/zai-org/GLM-5

Model technical reports:

DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437
Guo et al., "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025), arXiv:2501.12948
DeepSeek API Docs, "DeepSeek-V3.1 Release" (Aug 21, 2025), "Introducing DeepSeek-V3.2-Exp" (Sep 29, 2025), and "DeepSeek-V3.2 Release" (Dec 1, 2025)
Meta, "The Llama 3 Herd of Models" (2024), arXiv:2407.21783

Benchmark papers:

Hendrycks et al., "Measuring Massive Multitask Language Understanding" (MMLU, 2021), arXiv:2009.03300
Chen et al., "Evaluating Large Language Models Trained on Code" (HumanEval, 2021), arXiv:2107.03374
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023), arXiv:2310.06770
Rein et al., "GPQA: A Graduate-Level Google-Proof Q&A Benchmark" (2023), arXiv:2311.12022
Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code" (2024), arXiv:2403.07974

Next Topics

This is a living reference. Consult primary technical reports and evaluation platforms for the latest data.

Last reviewed: July 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

DeepSeek Modelslayer 5 · tier 1
Mistral Modelslayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2
Claude Model Familylayer 5 · tier 2
Gemini and Google Modelslayer 5 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.