Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Model Timeline

Model Comparison Table

Structured comparison of major LLM families as of early 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.

CoreTier 2Frontier~50 min
0

Why This Matters

Choosing a model for a specific application requires comparing across multiple dimensions: capability, cost, latency, context length, open weights availability, and task-specific performance. No single model dominates on every axis. This page provides a structured factual comparison and explains why naive model comparisons are often misleading.

Comparison Table

The following table summarizes publicly available information as of early 2026. Parameter counts marked "undisclosed" reflect cases where the organization has not published official numbers.

ModelOrganizationParams (Total)Params (Active)ArchitectureContextOpen WeightsKey Strengths
GPT-4oOpenAIUndisclosedUndisclosedBelieved MoE128KNoMultimodal (text, image, audio), strong general reasoning
GPT-4o-miniOpenAIUndisclosedUndisclosedUndisclosed128KNoLow cost, fast, good for simple tasks
Claude 3.5 SonnetAnthropicUndisclosedUndisclosedDense200KNoCoding, analysis, long-context, instruction following
Claude 4 OpusAnthropicUndisclosedUndisclosedDense200KNoComplex reasoning, agentic tasks
Gemini 2.0 FlashGoogleUndisclosedUndisclosedBelieved MoE1M+NoNative multimodal, long context, fast
DeepSeek-V3DeepSeek AI671B37BMoE128KYesMath, coding, cost efficiency
DeepSeek-R1DeepSeek AI671B37BMoE128KYesChain-of-thought reasoning, math
Llama 3.1 405BMeta405B405BDense128KYesLargest open-weight dense model, broad capability
Qwen 3 235BAlibaba~235B~22BMoE128KYesMultilingual, math, coding
Mistral LargeMistral AIUndisclosedUndisclosedUndisclosed128KPartialEuropean alternative, multilingual
Gemma 3nGoogleVaries (E2B, E4B)VariesMatFormer (Matryoshka Transformer)32KYesOn-device, efficient, nested sub-models, multimodal

Notes on the table:

  • "Believed MoE" indicates strong evidence from external analysis or leaks but no official confirmation.
  • Context lengths are the maximum supported; performance may degrade before the limit.
  • "Partial" open weights for Mistral means some models are open (Mistral 7B, Mixtral) while others (Mistral Large) are API-only.

What This Table Does Not Tell You

Benchmarks are noisy

Proposition

Benchmark Score Variance from Evaluation Design

Statement

If a benchmark has NN questions and a model's true accuracy is pp, the observed accuracy p^\hat{p} has standard error p(1p)/N\sqrt{p(1-p)/N}. For N=500N = 500 questions and p=0.85p = 0.85, the standard error is 0.85×0.15/5000.016\sqrt{0.85 \times 0.15 / 500} \approx 0.016. A 95% confidence interval for the true accuracy is approximately [0.82,0.88][0.82, 0.88]. Two models scoring 0.85 and 0.83 on a 500-question benchmark are not meaningfully different.

Intuition

Benchmark scores are sample estimates of a model's capability on a distribution of tasks. With a finite number of questions, there is sampling noise. This connects directly to hypothesis testing: small differences (1-3 percentage points on a few-hundred-question benchmark) are often within the noise margin and should not be interpreted as genuine capability differences.

Proof Sketch

Each question is a Bernoulli trial with success probability pp. The sample mean of NN independent Bernoulli trials has variance p(1p)/Np(1-p)/N. By the CLT, the distribution of p^\hat{p} is approximately normal for large NN. The 95% CI is p^±1.96p(1p)/N\hat{p} \pm 1.96\sqrt{p(1-p)/N}.

Why It Matters

Model comparison tables and leaderboards often rank models by benchmark scores with differences of 1-2 percentage points. These differences are frequently within sampling error. Claiming model A is "better" than model B based on a 0.5% difference on MMLU (approximately 14K questions across all subjects, but individual subject subsets have far fewer) is statistically unjustified.

Failure Mode

This bound assumes questions are independent and identically distributed, which they are not in practice. Benchmark questions cluster by topic, difficulty, and format. Models may systematically succeed or fail on certain clusters. The effective sample size is smaller than NN when questions are correlated, making the true confidence interval wider than the formula suggests.

Benchmarks are not your task

A model that scores highest on MMLU may not be best for your specific application. Benchmarks measure performance on standardized question sets. Your task has specific characteristics: domain vocabulary, input format, output requirements, latency constraints, and cost budget. The only reliable way to choose a model is to evaluate candidates on your actual task with your actual data. See model evaluation best practices for a systematic approach.

Contamination

Many benchmarks are partially or fully present in training data. Models may have memorized specific benchmark questions, inflating their scores. Newer benchmarks (like GPQA, SWE-bench, or LiveCodeBench) are more resistant to contamination, but no benchmark is immune once it becomes widely used.

Pricing changes frequently

API pricing is a competitive lever. Providers cut prices regularly. Any specific pricing comparison is outdated within months. The structural insight is more durable: smaller/faster models (Haiku, Flash, GPT-4o-mini) cost 10-30x less per token than flagship models (Opus, Ultra, GPT-4). For many tasks, the cheaper model is sufficient.

Axes of Comparison

Architecture: Dense vs. MoE

Dense models (Llama 3.1 405B, Claude) activate all parameters for every token. MoE models (DeepSeek-V3, Gemini 1.5, likely GPT-4) activate a subset. Both build on the core transformer architecture. The trade-offs:

  • MoE advantage: lower compute per token, so faster and cheaper inference for the same quality level.
  • MoE disadvantage: higher memory footprint (all experts must be loaded), more complex training (load balancing, expert collapse).
  • Dense advantage: simpler to train, deploy, and quantize. No routing overhead.

Context Length

ModelMax Context
Gemini 2.0 Flash1M+ tokens
Claude 3.5 Sonnet / Claude 4200K tokens
GPT-4o128K tokens
DeepSeek-V3 / R1128K tokens
Llama 3.1 405B128K tokens
Qwen 3 235B128K tokens

Context length matters for tasks that require processing large documents, codebases, or conversation histories. For shorter inputs, a larger context window provides no benefit.

Open Weights

Open weights enable: fine-tuning for specific domains, running inference on your own hardware (no API dependency), inspecting model internals for research, and quantization for deployment on smaller hardware.

Open WeightsModels
YesDeepSeek-V3, DeepSeek-R1, Llama 3.1, Qwen 3, Gemma 3n
NoGPT-4o, Claude, Gemini (frontier models)

For production applications with data privacy requirements or high-volume inference, open-weight models can be 10-100x cheaper than API access.

Reasoning Models

A distinct category emerged in 2024-2025: models trained specifically for multi-step reasoning via RL.

ModelTypeApproach
OpenAI o1 / o3ClosedRL on reasoning tasks, extended thinking
DeepSeek-R1Open weightsRL on math/code, open chain-of-thought
Claude with extended thinkingClosedExtended thinking mode for complex tasks
Gemini 2.0 Flash ThinkingClosedReasoning mode with visible thinking

These models trade latency for accuracy: they generate long internal reasoning chains before producing a final answer. On math and coding benchmarks, they substantially outperform standard models. On simple tasks, they are slower and more expensive with little benefit.

How to Choose a Model

There is no universal best model. The right choice depends on your constraints:

1. What is your task? Coding tasks favor Claude and DeepSeek. Multilingual tasks favor Qwen and Gemini. Multimodal tasks favor Gemini and GPT-4o. Reasoning-heavy tasks favor R1 and o1-class models.

2. What are your cost constraints? For high-volume applications, open-weight models with self-hosted inference are cheapest. For low-volume or prototyping, API access is simpler.

3. Do you need open weights? If you need to fine-tune, run on-premise, or inspect model internals, your options are Llama, Qwen, DeepSeek, Gemma, and Mistral.

4. What context length do you need? If you need to process documents longer than 128K tokens, Gemini is the only production option with 1M+ support.

5. What latency do you need? Smaller models (Haiku, Flash, GPT-4o-mini) are 3-10x faster than flagship models. For real-time applications, latency often matters more than marginal quality.

The most common mistake is choosing based on benchmark rankings. Evaluate on your task. A model that scores 2% lower on MMLU but costs 5x less and is 3x faster may be the correct choice for your application.

Common Confusions

Watch Out

Higher benchmark score does not mean better for your task

MMLU measures broad knowledge. HumanEval measures Python coding. MATH measures competition math. A model that tops one benchmark may underperform on your specific domain. Always evaluate on data representative of your actual use case. Benchmark rankings are starting points for a shortlist, not final decisions.

Watch Out

Parameter count is not a capability measure

DeepSeek-V3 (671B total, 37B active) outperforms Llama 3.1 405B (405B, all active) on many benchmarks despite activating fewer parameters per token. MoE architecture, training data quality, and training methodology all matter more than raw parameter count. Comparing models by parameter count alone is misleading.

Watch Out

Open weights does not mean free

Running a 405B parameter model requires multiple high-end GPUs. The hardware cost for self-hosted inference can exceed API costs at low volume. Open weights are cost-effective only at sufficient scale (typically thousands to millions of requests per day) or when fine-tuning and data privacy are requirements.

Watch Out

Frontier is a moving target

Any specific comparison in this table will be outdated within months. New model releases occur roughly quarterly from each major lab. The structural patterns (MoE vs. dense trade-offs, cost tiers, open vs. closed dynamics) are more durable than specific benchmark numbers.

Exercises

ExerciseCore

Problem

A benchmark has 1000 questions. Model A scores 87.2% and Model B scores 85.8%. Compute the standard error for each score and determine whether the difference is statistically significant at the 95% confidence level.

ExerciseCore

Problem

An MoE model has 600B total parameters and activates 40B per token. A dense model has 70B parameters (all active). Compare the per-token compute cost (in terms of FLOPs, proportional to active parameters) and the memory required to load each model in float16.

ExerciseAdvanced

Problem

You are choosing between three models for a customer support chatbot processing 10 million messages per month. Model A (API): 0.50 USD per 1M input tokens, 95% task accuracy. Model B (open weights, self-hosted): 15,000 USD/month fixed infrastructure cost, unlimited tokens, 93% task accuracy. Model C (API, cheaper): 0.05 USD per 1M input tokens, 88% task accuracy. Average message length: 500 tokens. Which model is cheapest? At what accuracy threshold would you switch from the cheapest to a more expensive option?

References

Canonical:

  • Liang et al., "Holistic Evaluation of Language Models" (HELM, Stanford, 2023)
  • Srivastava et al., "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models" (BIG-bench, 2023)

Current:

  • Chatbot Arena (LMSYS), ongoing Elo-based model comparison from human preferences
  • Artificial Analysis, model speed and pricing benchmarks (updated continuously)

Next Topics

This is a living reference. Consult primary technical reports and evaluation platforms for the latest data.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.