Model Timeline

Mistral Models

The Mistral AI model family: Mistral 7B with sliding-window attention, the Mixtral 8x7B and 8x22B sparse mixture-of-experts releases, the dense Mistral Large/Nemo line, and the specialist Codestral, Pixtral, and Ministral variants.

CoreTier 2FrontierFrontier watch~30 min

Prerequisites

Transformer Architecture Mixture of Experts Attention Mechanism Theory Tokenization and Information Theory

Prereq Map

Why This Matters

Mistral AI is the European lab that made sparse mixture-of-experts feel routine in the open-weight ecosystem. Mistral 7B (September 2023) was the first widely used open-weight model to ship with sliding-window attention and grouped-query attention together; Mixtral 8x7B (December 2023) was the first open-weight MoE to clearly outperform a leading dense competitor at lower active parameter count. Since 2024 the lab has split into two product tiers: a permissively licensed open-weight line (Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Nemo, Codestral Mamba, Ministral 3B/8B) and a research-licensed or proprietary frontier line (Mistral Large, Mistral Large 2, Pixtral, Mistral Medium, and the hosted "le Chat" assistant).

The pages on LLaMA, DeepSeek, and Qwen cover their respective families. This page is the standalone Mistral entry.

Interactive module

Open-weight foundation, closed flagships, and specialized variants. Hide if you prefer text-only.

Mistral model lineage

Open-weight MoE, dense flagships, and specialist variants

Snapshot current to April 25, 2026. License tiers and hosted-model identities change faster than the historical architecture story.

Open-weight foundation line

Mistral 7B

Sep 2023

First Mistral release. Sliding-window attention with W = 4096 plus grouped-query attention. Apache 2.0.

Mixtral 8x7B

Dec 2023

Sparse MoE, 8 experts with top-2 routing. Roughly 47B total and 13B active per token. Apache 2.0.

Mixtral 8x22B

Apr 2024

Sparse MoE scaled up. Roughly 141B total and 39B active per token, 64K context. Apache 2.0.

Closed flagship line

Mistral Large

Feb 2024

Proprietary frontier-tier model on the Mistral API and partner platforms. Parameter count not disclosed.

Mistral Large 2

Jul 2024

Dense 123B parameters, 128K context. Mistral Research License; commercial use requires a paid license.

Specialist variants

Codestral 22B

May 2024

Dense 22B code model, 32K context, more than 80 programming languages. Mistral Non-Production License.

Codestral Mamba 7B

Jul 2024

Mamba state-space architecture for code. Linear-time inference on long contexts. Apache 2.0.

Pixtral 12B

Sep 2024

Open-weight vision-language model on a Nemo backbone with a 400M vision encoder. Apache 2.0.

Pixtral Large

Nov 2024

124B multimodal model on Mistral Large 2 with a 1B vision encoder. Mistral Research License.

Ministral 3B / 8B

Oct 2024

Edge-class models with 128K context. Ministral 8B interleaves full and sliding-attention layers.

Dense small and medium line

Mistral Nemo 12B

Jul 2024

Dense 12B, 128K context, released with NVIDIA. Tekken tokenizer trained on more than 100 languages. Apache 2.0.

Mistral Small 3

Jan 2025

Dense 24B, 32K context. Available on the API and as an Apache 2.0 weight release.

Mistral Medium

Hosted API

Hosted-only API model. Parameter count and architecture not disclosed.

Open-weight default

Mixtral 8x7B / 8x22B

Apache 2.0 sparse MoE releases that the developer community treats as drop-in replacements for larger dense models.

Closed flagship

Mistral Large 2

Dense 123B, 128K context, designed for single-node inference. Research license; paid license for commercial use.

Vision tier

Pixtral 12B / Pixtral Large

Pixtral 12B is Apache 2.0 on a Nemo backbone. Pixtral Large is 124B on Mistral Large 2 under the Research License.

Code tier

Codestral 22B / Codestral Mamba 7B

Codestral 22B is a transformer under the Non-Production License. Codestral Mamba 7B is a state-space model under Apache 2.0.

Current deployment ledger

Open-weight starting point

Mixtral 8x7B

Apache 2.0 sparse MoE. Roughly 47B total parameters and 13B active per token at top-2 routing, 32K context. Treated as a drop-in for a 13B dense model in compute terms.

Memory cost still scales with all 8 experts loaded simultaneously, even though only 2 fire per token.

Larger open-weight option

Mixtral 8x22B

Apache 2.0 sparse MoE with roughly 141B total parameters, 39B active per token, and 64K context.

Larger total parameter count means a larger memory footprint at inference, even at the same active count per token.

Closed frontier API

Mistral Large 2

Dense 123B with 128K context, function calling, and broad multilingual coverage. Available on the Mistral API and partner platforms.

Mistral Research License: research and non-commercial use of weights only. Commercial deployment requires a paid license.

Edge-class deployment

Ministral 3B / 8B

Edge-class models with 128K context. Ministral 8B interleaves full-attention and sliding-attention layers to maintain long-range coherence at low memory budgets.

Ministral 3B is API-only at release. Ministral 8B is research-licensed with a separate commercial license.

Code workflows

Codestral 22B / Codestral Mamba 7B

Codestral 22B is a dense 22B transformer with 32K context. Codestral Mamba 7B is a state-space model with in-principle linear-time inference on long contexts.

Codestral 22B uses the Mistral Non-Production License. Codestral Mamba 7B is Apache 2.0. Do not conflate their licensing or inference characteristics.

Vision-language input

Pixtral 12B

Apache 2.0 vision-language model on a Mistral Nemo backbone with a 400M-parameter vision encoder. Accepts arbitrary numbers and resolutions of images interleaved with text.

Pixtral Large is the larger 124B variant but is released under the Mistral Research License, not Apache 2.0.

Hosted assistant backing

le Chat (rotating model)

The consumer assistant is served by a Mistral model that has rotated over time.

The backing model can change without an API version bump. Do not assume a fixed identity for le Chat.

Mistral 7B (September 2023)

Release. Announced September 27, 2023, with weights on Hugging Face and a technical report on arXiv (2310.06825). Apache 2.0 license. The first model from the lab.

Architecture. Decoder-only transformer with 7.3B parameters. 32 layers, hidden dimension 4096, 32 attention heads, 8 key/value heads (grouped-query attention), 32K vocabulary. SwiGLU activation, RMSNorm, rotary positional embeddings.

Sliding-window attention. Each token attends to the previous $W = 4096$ tokens rather than the full prefix. Stacking $L$ layers gives an effective receptive field of roughly $L \cdot W$ tokens because information from earlier tokens propagates through successive layers. The Mistral 7B paper reports a 4096-token sliding window with a nominal 8192-token context, achieving long-range information flow without a quadratic attention bill at every layer.

Grouped-query attention (GQA). 32 query heads share 8 key/value heads. This shrinks the KV cache by a factor of 4 relative to standard multi-head attention while empirically losing little quality.

Result. The release notes report that Mistral 7B outperforms Llama 2 13B on the benchmarks the report covers. Treat that as Mistral's evaluation rather than an independent one; it is also the claim that drove the model's adoption.

Mixtral 8x7B (December 2023)

Release. First posted as a magnet link on December 8, 2023, then accompanied by the "Mixtral of Experts" paper (arXiv 2401.04088, January 2024). Apache 2.0 license.

Architecture. Sparse mixture-of-experts. The model has 46.7B total parameters; for each token, a router selects 2 of 8 feed-forward experts, so roughly 12.9B parameters are active per token. The non-expert parameters (attention, embeddings, router) are shared across tokens. Context length: 32K tokens. Same SwiGLU/RMSNorm/RoPE backbone as Mistral 7B, with the dense feed-forward block replaced by 8 experts plus a top-2 router.

Definition

Top-k sparse MoE routing

A router maps each token's hidden state $x$ to a score vector $g(x) \in \mathbb{R}^E$ over $E$ experts. The top $k$ scoring experts are selected; the others contribute zero. The output is $\sum_{i \in \text{top-}k} \mathrm{softmax}(g(x))_i \, \mathrm{Expert}_i(x)$ . Mixtral uses $E = 8$ , $k = 2$ . The compute cost per token is roughly $k/E$ of a dense feed-forward of the same expert size, but the memory cost is $E$ experts loaded simultaneously.

Result. Mixtral 8x7B is the first widely deployed open-weight MoE that the developer community treated as a drop-in replacement for a 13B dense model. The paper reports competitive performance with Llama 2 70B and GPT-3.5 across language, code, and math benchmarks, at roughly the inference compute of a 13B dense model.

Mixtral 8x22B (April 2024)

Release. April 17, 2024, again starting with a magnet link. Apache 2.0 license.

Architecture. Sparse MoE with 141B total parameters and roughly 39B active per token (8 experts, top-2 routing). 64K context window per the model card. Same architectural family as Mixtral 8x7B, scaled up.

Why it matters. 8x22B was Mistral's largest open-weight release to date and one of the largest open MoE models at the time. It pushed the open-weight frontier of "low active count, high total count" higher, before DeepSeek-V3 (671B total / 37B active, December 2024) took the same idea further.

Mistral Large and Mistral Large 2

Mistral Large (February 2024). Proprietary frontier-tier model, available only through Mistral's API and partner platforms (Azure AI, Amazon Bedrock). Mistral has not published a parameter count for Mistral Large. The release blog positioned it as Mistral's competitor to GPT-4 and Claude 2 on reasoning and multilingual tasks.

Mistral Large 2 (July 24, 2024). Dense 123B-parameter model, 128K context, designed for single-node inference. Released under the Mistral Research License (research and non-commercial use of weights; commercial use requires a paid license). The release blog highlights code generation, math, and multilingual coverage (including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean) plus function calling.

The Mistral Large line is not part of the open-weight tier. Treat it as the proprietary product end of the Mistral family.

Mistral Small, Medium, and Nemo

Mistral Nemo 12B (July 18, 2024). Released jointly with NVIDIA. Dense 12B model, 128K context, Apache 2.0. Uses a new tokenizer named Tekken (based on Tiktoken) trained on more than 100 languages; Mistral reports Tekken is roughly 30% more efficient than the Llama 3 tokenizer at compressing source code and most non-English languages, and around 3x more efficient on Korean and Arabic.

Mistral Small. A series of smaller hosted/proprietary models. As of April 2026, Mistral Small 3 (released January 30, 2025) is a 24B dense model with a 32K context, available on the API and as an Apache 2.0 weight release. The "Small" name has been reused across releases; check the version when citing.

Mistral Medium. Hosted-only API model. Mistral has not published parameter counts. Treat any specific size claim about Mistral Medium as unverified unless it appears in an official release note.

Specialist Variants: Codestral, Pixtral, Ministral

Codestral 22B (May 29, 2024). Dense 22B code model, 32K context, trained on more than 80 programming languages per the release blog. Released under the Mistral Non-Production License (free for research and personal use; paid license for commercial deployment). Not Apache 2.0.

Codestral Mamba 7B (July 16, 2024). Released alongside Mathstral 7B. A 7B-parameter code model based on the Mamba state-space architecture rather than a transformer, supporting in-principle linear-time inference on long contexts. Apache 2.0 license. This is one of the few Mamba-architecture models from a major lab released with open weights.

Pixtral 12B (September 11, 2024). Mistral's first open-weight vision-language model. 12B parameters built on a Mistral Nemo backbone with a 400M-parameter vision encoder. Apache 2.0. Accepts arbitrary numbers and resolutions of images interleaved with text.

Pixtral Large (November 2024). A 124B-parameter multimodal model built on Mistral Large 2 with a 1B-parameter vision encoder. Released under the Mistral Research License, not Apache 2.0.

Ministral 3B and Ministral 8B (October 16, 2024). Edge-class models with 128K context. Ministral 8B uses a sliding-window attention pattern that interleaves full-attention and sliding-attention layers; the release blog describes this as a way to maintain long-range coherence at low memory budgets. Ministral 3B is API-only at release; Ministral 8B is released for research use under the Mistral Research License, with a separate commercial license available.

Engineering Choices

Sliding-Window Attention

Mistral 7B uses local sliding-window self-attention with window size $W = 4096$ . Within a single layer this caps attention cost at $O(n \cdot W)$ instead of $O(n^2)$ . Across $L$ layers, information from a token $t$ tokens earlier can still reach the current position if $t \le L \cdot W$ , because each layer can move information by up to $W$ positions. For Mistral 7B, $L = 32$ layers and $W = 4096$ give an effective range well beyond the 8K nominal context.

This is a deliberately simple long-context strategy compared to the sparse-attention schemes in DeepSeek's V3.2 / V4 line or longer-context attention variants. Mistral's later releases moved toward standard dense attention at long context (Mistral Large 2, Nemo) and reserved sliding windows for the edge models.

Grouped-Query Attention

Mistral 7B's 32-query-head, 8-KV-head split is an early adoption of GQA in the open-weight tier. The KV cache savings (4x relative to multi-head) are the same trick later used in Llama 2 70B, Llama 3, and most subsequent decoder-only models. See attention mechanism theory for the structural details.

Sparse MoE Routing

Mixtral 8x7B and 8x22B use the standard top-2 token-choice routing scheme: each token independently selects its top-2 experts. The Mixtral paper reports that experts do not specialize cleanly along human-interpretable axes (for example, different languages or domains do not map to dedicated experts); routing patterns instead correlate with positional and syntactic features.

Proposition

Active Parameters in Top-k Sparse MoE

Statement

Total parameter count is $N_{\text{total}} = S + E \cdot P_e$ . Per-token active parameter count is $N_{\text{active}} = S + k \cdot P_e$ . The compute per token at the feed-forward layer scales with $k \cdot P_e$ rather than $E \cdot P_e$ . For Mixtral 8x7B, public numbers give $N_{\text{total}} \approx 46.7$ B and $N_{\text{active}} \approx 12.9$ B with $E = 8$ , $k = 2$ . For Mixtral 8x22B, $N_{\text{total}} \approx 141$ B and $N_{\text{active}} \approx 39$ B with the same $E = 8$ , $k = 2$ .

Intuition

Sparse MoE decouples model capacity from per-token compute. You pay memory for all $E$ experts because they all need to be loaded and ready, but you only pay FLOPs for the $k$ that fire on any given token. The shared parameters $S$ run on every token regardless.

Proof Sketch

The non-expert parameters run for every token, contributing $S$ to both total and active counts. The expert parameters contribute $E \cdot P_e$ to the total but only $k \cdot P_e$ to the active count, because routing selects $k$ of $E$ experts per token. Plug in $E = 8$ , $k = 2$ , and Mixtral's published per-expert sizes to recover the headline numbers.

Why It Matters

This is why "Mixtral 8x7B" does not have $8 \times 7 = 56$ B parameters and is not 8x more expensive than a 7B model. It has roughly 47B total parameters (most of the parameter budget is in the 8 experts, but the shared backbone is not duplicated) and runs at roughly the cost of a 13B dense model. Misreading these numbers leads to incorrect cost and capacity estimates.

Failure Mode

The accounting assumes balanced routing: every expert is selected on roughly the same fraction of tokens. If the router collapses (most tokens routing to a few experts), the effective active capacity is smaller than $k \cdot P_e$ , and the unused experts waste memory. Standard MoE training adds an auxiliary load-balancing loss to prevent this; Mixtral uses such an auxiliary loss, while later work like DeepSeek-V3 replaces it with auxiliary-loss-free schemes.

report a correction →

Limitations and What Is Not Publicly Disclosed

Several facts are commonly asserted about Mistral models but are not in the official release notes or papers. Be careful with these:

Mistral Large parameter count. Not disclosed. Numbers in the range of "around 70B" or "around 120B" appear in third-party blog posts; Mistral has not confirmed a count.
Mistral Medium parameter count and architecture. Not disclosed.
Training data composition. Mistral has not released a detailed training data breakdown for any model. The Mistral 7B and Mixtral papers describe data quality and language coverage in general terms but do not list sources or proportions.
Training compute. Mistral has not published GPU-hour totals for any model.
Le Chat backing model identity at any given time. Mistral has rotated which model serves the consumer assistant; assume the backing model can change without an API version bump.

When citing Mistral facts, prefer the official release blog post (mistral.ai/news), the model card on Hugging Face, or one of the two Mistral papers. Be explicit when a claim is from a third-party benchmark rather than from Mistral.

Common Confusions

Watch Out

Mixtral 8x7B does not have 56B parameters

Top-2 sparse MoE shares the non-expert parameters across tokens. Mixtral 8x7B has roughly 47B total parameters (8 experts plus shared attention, embeddings, and router), not 56B. Active parameters per token are about 13B, not 14B, and definitely not 56B. The "8x7B" naming refers to the 8-expert structure, not a multiplication.

Watch Out

Mistral models are not all Apache 2.0

The open-weight tier (Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Nemo, Codestral Mamba, Pixtral 12B, Mistral Small 3) is Apache 2.0. The proprietary tier (Mistral Large, Mistral Large 2, Pixtral Large, Mistral Medium, the original Codestral 22B) is not. Codestral 22B uses the Mistral Non-Production License; Mistral Large 2 and Pixtral Large use the Mistral Research License. Check the license on the model card before assuming commercial use is allowed.

Watch Out

Sliding-window attention is not the same as linear attention

Mistral 7B's sliding window keeps standard softmax attention but restricts each query to the previous $W$ keys, giving $O(n \cdot W)$ cost. Linear-attention variants (Performers, Linear Transformers) replace softmax with a kernel-feature decomposition to get $O(n)$ cost with full receptive field. Sliding windows trade receptive-field-per-layer for compute and rely on layer stacking to recover long-range information; linear attention trades attention quality for asymptotic compute.

Watch Out

Codestral and Codestral Mamba are different models

Codestral 22B is a transformer-based code model under the Mistral Non-Production License. Codestral Mamba 7B is a Mamba-architecture (state-space) code model under Apache 2.0. The shared name reflects the product line, not the architecture; do not conflate their licensing or their inference characteristics.

Examples

Example

KV cache for Mistral 7B vs. a hypothetical multi-head variant

Mistral 7B has 32 layers, hidden dimension 4096, 32 query heads, and 8 KV heads. With head dimension $d_h = 128$ , the per-token KV cache stores $2 \cdot 8 \cdot 128 = 2048$ values per layer; across 32 layers that is $65{,}536$ values per token. At float16 (2 bytes), one token costs 128 KB of KV cache. A 32K-token context costs $32{,}000 \cdot 128 \text{ KB} \approx 4$ GB. A multi-head variant with 32 KV heads instead of 8 would cost 4x as much: roughly 16 GB at the same context length, which is a large fraction of a single-GPU memory budget.

Exercises

ExerciseCore

Problem

Mixtral 8x22B has 141B total parameters and 39B active per token with 8 experts and top-2 routing. Estimate the per-expert parameter count and the size of the shared (non-expert) parameters. State any assumption you make.

ExerciseAdvanced

Problem

Sliding-window attention with window $W$ in a transformer of $L$ layers gives an effective receptive field of $L \cdot W$ tokens for the topmost layer's representation of the last position. Suppose you double $W$ but halve $L$ , keeping $L \cdot W$ constant. State at least one quality loss and one efficiency change you would expect, and explain why the receptive-field equality alone does not capture the trade-off.

References

Canonical:

Jiang et al., "Mistral 7B" (2023), arXiv:2310.06825.
Jiang et al., "Mixtral of Experts" (2024), arXiv:2401.04088.

Release notes and model cards:

Mistral AI, "Announcing Mistral 7B" (September 27, 2023), mistral.ai/news/announcing-mistral-7b.
Mistral AI, "Mixtral of experts" (December 11, 2023), mistral.ai/news/mixtral-of-experts.
Mistral AI, "Cheaper, better, faster, stronger" (Mixtral 8x22B announcement, April 17, 2024), mistral.ai/news/mixtral-8x22b.
Mistral AI, "Au Large" (Mistral Large announcement, February 26, 2024), mistral.ai/news/mistral-large.
Mistral AI, "Mistral NeMo" (July 18, 2024), mistral.ai/news/mistral-nemo.
Mistral AI, "Large Enough" (Mistral Large 2 announcement, July 24, 2024), mistral.ai/news/mistral-large-2407.
Mistral AI, "Codestral: Hello, World!" (May 29, 2024), mistral.ai/news/codestral.
Mistral AI, "Codestral Mamba" (July 16, 2024), mistral.ai/news/codestral-mamba.
Mistral AI, "Pixtral 12B" (September 11, 2024), mistral.ai/news/pixtral-12b.
Mistral AI, "Pixtral Large" (November 18, 2024), mistral.ai/news/pixtral-large.
Mistral AI, "Un Ministral, des Ministraux" (Ministral 3B and 8B announcement, October 16, 2024), mistral.ai/news/ministraux.
Mistral AI, "Mistral Small 3" (January 30, 2025), mistral.ai/news/mistral-small-3.
Hugging Face model cards for mistralai/Mistral-7B-v0.1, mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x22B-v0.1, mistralai/Mistral-Nemo-Base-2407, mistralai/Mistral-Large-Instruct-2407, mistralai/Codestral-22B-v0.1, mistralai/Pixtral-12B-2409, mistralai/Ministral-8B-Instruct-2410.

Related context:

Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (2017), arXiv:1701.06538. Originating top-k MoE routing scheme.
Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023), arXiv:2305.13245. Grouped-query attention.

Next Topics

Model comparison table: structured comparison across frontier model families.
Mixture of experts: routing schemes, load-balancing losses, and the architecture used in Mixtral.
LLaMA and open-weight models: the broader open-weight ecosystem Mistral participates in.

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

Attention Mechanism Theorylayer 4 · tier 2
Mixture of Expertslayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2
Tokenization and Information Theorylayer 4 · tier 3

Derived topics

No published topic currently declares this as a prerequisite.