Model Timeline
Model Comparison Table
Structured comparison of major LLM families as of early 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.
Prerequisites
Why This Matters
Choosing a model for a specific application requires comparing across multiple dimensions: capability, cost, latency, context length, open weights availability, and task-specific performance. No single model dominates on every axis. This page provides a structured factual comparison and explains why naive model comparisons are often misleading.
Comparison Table
The following table summarizes publicly available information as of early 2026. Parameter counts marked "undisclosed" reflect cases where the organization has not published official numbers.
| Model | Organization | Params (Total) | Params (Active) | Architecture | Context | Open Weights | Key Strengths |
|---|---|---|---|---|---|---|---|
| GPT-4o | OpenAI | Undisclosed | Undisclosed | Believed MoE | 128K | No | Multimodal (text, image, audio), strong general reasoning |
| GPT-4o-mini | OpenAI | Undisclosed | Undisclosed | Undisclosed | 128K | No | Low cost, fast, good for simple tasks |
| Claude 3.5 Sonnet | Anthropic | Undisclosed | Undisclosed | Dense | 200K | No | Coding, analysis, long-context, instruction following |
| Claude 4 Opus | Anthropic | Undisclosed | Undisclosed | Dense | 200K | No | Complex reasoning, agentic tasks |
| Gemini 2.0 Flash | Undisclosed | Undisclosed | Believed MoE | 1M+ | No | Native multimodal, long context, fast | |
| DeepSeek-V3 | DeepSeek AI | 671B | 37B | MoE | 128K | Yes | Math, coding, cost efficiency |
| DeepSeek-R1 | DeepSeek AI | 671B | 37B | MoE | 128K | Yes | Chain-of-thought reasoning, math |
| Llama 3.1 405B | Meta | 405B | 405B | Dense | 128K | Yes | Largest open-weight dense model, broad capability |
| Qwen 3 235B | Alibaba | ~235B | ~22B | MoE | 128K | Yes | Multilingual, math, coding |
| Mistral Large | Mistral AI | Undisclosed | Undisclosed | Undisclosed | 128K | Partial | European alternative, multilingual |
| Gemma 3n | Varies (E2B, E4B) | Varies | MatFormer (Matryoshka Transformer) | 32K | Yes | On-device, efficient, nested sub-models, multimodal |
Notes on the table:
- "Believed MoE" indicates strong evidence from external analysis or leaks but no official confirmation.
- Context lengths are the maximum supported; performance may degrade before the limit.
- "Partial" open weights for Mistral means some models are open (Mistral 7B, Mixtral) while others (Mistral Large) are API-only.
What This Table Does Not Tell You
Benchmarks are noisy
Benchmark Score Variance from Evaluation Design
Statement
If a benchmark has questions and a model's true accuracy is , the observed accuracy has standard error . For questions and , the standard error is . A 95% confidence interval for the true accuracy is approximately . Two models scoring 0.85 and 0.83 on a 500-question benchmark are not meaningfully different.
Intuition
Benchmark scores are sample estimates of a model's capability on a distribution of tasks. With a finite number of questions, there is sampling noise. This connects directly to hypothesis testing: small differences (1-3 percentage points on a few-hundred-question benchmark) are often within the noise margin and should not be interpreted as genuine capability differences.
Proof Sketch
Each question is a Bernoulli trial with success probability . The sample mean of independent Bernoulli trials has variance . By the CLT, the distribution of is approximately normal for large . The 95% CI is .
Why It Matters
Model comparison tables and leaderboards often rank models by benchmark scores with differences of 1-2 percentage points. These differences are frequently within sampling error. Claiming model A is "better" than model B based on a 0.5% difference on MMLU (approximately 14K questions across all subjects, but individual subject subsets have far fewer) is statistically unjustified.
Failure Mode
This bound assumes questions are independent and identically distributed, which they are not in practice. Benchmark questions cluster by topic, difficulty, and format. Models may systematically succeed or fail on certain clusters. The effective sample size is smaller than when questions are correlated, making the true confidence interval wider than the formula suggests.
Benchmarks are not your task
A model that scores highest on MMLU may not be best for your specific application. Benchmarks measure performance on standardized question sets. Your task has specific characteristics: domain vocabulary, input format, output requirements, latency constraints, and cost budget. The only reliable way to choose a model is to evaluate candidates on your actual task with your actual data. See model evaluation best practices for a systematic approach.
Contamination
Many benchmarks are partially or fully present in training data. Models may have memorized specific benchmark questions, inflating their scores. Newer benchmarks (like GPQA, SWE-bench, or LiveCodeBench) are more resistant to contamination, but no benchmark is immune once it becomes widely used.
Pricing changes frequently
API pricing is a competitive lever. Providers cut prices regularly. Any specific pricing comparison is outdated within months. The structural insight is more durable: smaller/faster models (Haiku, Flash, GPT-4o-mini) cost 10-30x less per token than flagship models (Opus, Ultra, GPT-4). For many tasks, the cheaper model is sufficient.
Axes of Comparison
Architecture: Dense vs. MoE
Dense models (Llama 3.1 405B, Claude) activate all parameters for every token. MoE models (DeepSeek-V3, Gemini 1.5, likely GPT-4) activate a subset. Both build on the core transformer architecture. The trade-offs:
- MoE advantage: lower compute per token, so faster and cheaper inference for the same quality level.
- MoE disadvantage: higher memory footprint (all experts must be loaded), more complex training (load balancing, expert collapse).
- Dense advantage: simpler to train, deploy, and quantize. No routing overhead.
Context Length
| Model | Max Context |
|---|---|
| Gemini 2.0 Flash | 1M+ tokens |
| Claude 3.5 Sonnet / Claude 4 | 200K tokens |
| GPT-4o | 128K tokens |
| DeepSeek-V3 / R1 | 128K tokens |
| Llama 3.1 405B | 128K tokens |
| Qwen 3 235B | 128K tokens |
Context length matters for tasks that require processing large documents, codebases, or conversation histories. For shorter inputs, a larger context window provides no benefit.
Open Weights
Open weights enable: fine-tuning for specific domains, running inference on your own hardware (no API dependency), inspecting model internals for research, and quantization for deployment on smaller hardware.
| Open Weights | Models |
|---|---|
| Yes | DeepSeek-V3, DeepSeek-R1, Llama 3.1, Qwen 3, Gemma 3n |
| No | GPT-4o, Claude, Gemini (frontier models) |
For production applications with data privacy requirements or high-volume inference, open-weight models can be 10-100x cheaper than API access.
Reasoning Models
A distinct category emerged in 2024-2025: models trained specifically for multi-step reasoning via RL.
| Model | Type | Approach |
|---|---|---|
| OpenAI o1 / o3 | Closed | RL on reasoning tasks, extended thinking |
| DeepSeek-R1 | Open weights | RL on math/code, open chain-of-thought |
| Claude with extended thinking | Closed | Extended thinking mode for complex tasks |
| Gemini 2.0 Flash Thinking | Closed | Reasoning mode with visible thinking |
These models trade latency for accuracy: they generate long internal reasoning chains before producing a final answer. On math and coding benchmarks, they substantially outperform standard models. On simple tasks, they are slower and more expensive with little benefit.
How to Choose a Model
There is no universal best model. The right choice depends on your constraints:
1. What is your task? Coding tasks favor Claude and DeepSeek. Multilingual tasks favor Qwen and Gemini. Multimodal tasks favor Gemini and GPT-4o. Reasoning-heavy tasks favor R1 and o1-class models.
2. What are your cost constraints? For high-volume applications, open-weight models with self-hosted inference are cheapest. For low-volume or prototyping, API access is simpler.
3. Do you need open weights? If you need to fine-tune, run on-premise, or inspect model internals, your options are Llama, Qwen, DeepSeek, Gemma, and Mistral.
4. What context length do you need? If you need to process documents longer than 128K tokens, Gemini is the only production option with 1M+ support.
5. What latency do you need? Smaller models (Haiku, Flash, GPT-4o-mini) are 3-10x faster than flagship models. For real-time applications, latency often matters more than marginal quality.
The most common mistake is choosing based on benchmark rankings. Evaluate on your task. A model that scores 2% lower on MMLU but costs 5x less and is 3x faster may be the correct choice for your application.
Common Confusions
Higher benchmark score does not mean better for your task
MMLU measures broad knowledge. HumanEval measures Python coding. MATH measures competition math. A model that tops one benchmark may underperform on your specific domain. Always evaluate on data representative of your actual use case. Benchmark rankings are starting points for a shortlist, not final decisions.
Parameter count is not a capability measure
DeepSeek-V3 (671B total, 37B active) outperforms Llama 3.1 405B (405B, all active) on many benchmarks despite activating fewer parameters per token. MoE architecture, training data quality, and training methodology all matter more than raw parameter count. Comparing models by parameter count alone is misleading.
Open weights does not mean free
Running a 405B parameter model requires multiple high-end GPUs. The hardware cost for self-hosted inference can exceed API costs at low volume. Open weights are cost-effective only at sufficient scale (typically thousands to millions of requests per day) or when fine-tuning and data privacy are requirements.
Frontier is a moving target
Any specific comparison in this table will be outdated within months. New model releases occur roughly quarterly from each major lab. The structural patterns (MoE vs. dense trade-offs, cost tiers, open vs. closed dynamics) are more durable than specific benchmark numbers.
Exercises
Problem
A benchmark has 1000 questions. Model A scores 87.2% and Model B scores 85.8%. Compute the standard error for each score and determine whether the difference is statistically significant at the 95% confidence level.
Problem
An MoE model has 600B total parameters and activates 40B per token. A dense model has 70B parameters (all active). Compare the per-token compute cost (in terms of FLOPs, proportional to active parameters) and the memory required to load each model in float16.
Problem
You are choosing between three models for a customer support chatbot processing 10 million messages per month. Model A (API): 0.50 USD per 1M input tokens, 95% task accuracy. Model B (open weights, self-hosted): 15,000 USD/month fixed infrastructure cost, unlimited tokens, 93% task accuracy. Model C (API, cheaper): 0.05 USD per 1M input tokens, 88% task accuracy. Average message length: 500 tokens. Which model is cheapest? At what accuracy threshold would you switch from the cheapest to a more expensive option?
References
Canonical:
- Liang et al., "Holistic Evaluation of Language Models" (HELM, Stanford, 2023)
- Srivastava et al., "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models" (BIG-bench, 2023)
Current:
- Chatbot Arena (LMSYS), ongoing Elo-based model comparison from human preferences
- Artificial Analysis, model speed and pricing benchmarks (updated continuously)
Next Topics
This is a living reference. Consult primary technical reports and evaluation platforms for the latest data.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1