Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Model Timeline

DeepSeek Models

DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, and RL-trained reasoning in DeepSeek-R1.

CoreTier 2Frontier~40 min
0

Why This Matters

DeepSeek demonstrated that architectural innovation and clever training can compensate for having less compute than the largest Western labs. DeepSeek-V3 achieves frontier performance with a mixture-of-experts architecture that activates only 37B of 671B total parameters per token. DeepSeek-R1 showed that chain-of-thought reasoning can emerge from reinforcement learning alone, without supervised demonstrations of reasoning traces. Both models are open-weight, making their internals available for study.

DeepSeek-V2 (May 2024)

Architecture. MoE transformer with 236B total parameters, 21B active per token. 160 experts with 6 active per token plus 2 shared experts that process every token.

Key innovation: Multi-head Latent Attention (MLA). Standard multi-head attention stores separate key and value vectors for each head in the KV cache, which grows linearly with sequence length and number of heads. MLA compresses the KV cache by projecting keys and values into a shared low-dimensional latent space. This reduces KV cache memory by roughly 93% compared to standard multi-head attention.

Key innovation: DeepSeekMoE. Fine-grained expert segmentation: instead of a few large experts, use many small experts. This allows more precise routing and better expert specialization. Shared experts handle common patterns while routed experts specialize.

Training. 8.1T tokens. Trained on a cluster of Nvidia H800 GPUs (the export-restricted variant of H100, available to Chinese labs).

Result. Competitive with Llama 3 70B and Mixtral 8x22B at significantly lower inference cost due to the low active parameter count.

DeepSeek-V3 (December 2024)

Architecture. MoE transformer with 671B total parameters, 37B active per token. 256 routed experts with 8 active per token, plus 1 shared expert.

Training. 14.8T tokens. Estimated training cost: approximately $5.6M in compute, dramatically lower than comparable frontier models. This cost estimate (from the technical report) generated significant attention because it suggested frontier-quality models could be trained for a fraction of what US labs spend.

Auxiliary-loss-free load balancing. Standard MoE training uses an auxiliary loss to prevent expert collapse (all tokens routing to the same few experts). DeepSeek-V3 introduced a load-balancing method that does not require an auxiliary loss term, reducing interference with the primary training objective.

Multi-token prediction. The model predicts multiple future tokens simultaneously during training, which improves training efficiency and downstream performance.

Result. Competitive with GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on most benchmarks. Particularly strong on math and coding. The combination of frontier quality, open weights, and low training cost made V3 one of the most discussed model releases of 2024.

DeepSeek-R1 (January 2025)

Architecture. Same base architecture as DeepSeek-V3 (671B/37B MoE).

Key innovation: reasoning via RL. DeepSeek-R1 was trained to reason using reinforcement learning on tasks with verifiable answers (math problems, coding challenges). The model generates long chain-of-thought reasoning traces before producing a final answer. Critically, the reasoning behavior emerged from RL training alone (DeepSeek-R1-Zero), without first training on human-written reasoning traces.

Definition

DeepSeek-R1-Zero

A variant trained with pure RL (no supervised fine-tuning on reasoning traces). Given a math or coding problem, the model is rewarded for producing correct final answers. Through RL training, the model spontaneously learned to generate chain-of-thought reasoning, self-check its work, and explore multiple solution paths. This demonstrated that explicit reasoning can emerge from outcome-based RL without human demonstrations of the reasoning process.

Distilled models. DeepSeek released distilled versions (1.5B to 70B dense models) trained on reasoning traces from R1. These smaller models inherit some of R1's reasoning ability at much lower cost.

Result. Competitive with OpenAI's o1 on math and coding benchmarks (AIME 2024, Codeforces). Open-weight release enabled the research community to study reasoning model internals for the first time.

Multi-head Latent Attention (MLA)

Proposition

KV Cache Compression via Latent Projection

Statement

In standard multi-head attention, the KV cache stores hdhh \cdot d_h values per token per layer for both keys and values, giving a per-token memory cost of 2hdh2 \cdot h \cdot d_h per layer. MLA projects all heads into a shared latent vector of dimension dchdhd_c \ll h \cdot d_h, then recovers per-head keys and values via learned up-projections at inference time. The per-token KV cache memory reduces from 2hdh2 \cdot h \cdot d_h to dcd_c per layer. For DeepSeek-V2 with h=128h = 128, dh=128d_h = 128, and dc=512d_c = 512, this is a compression ratio of approximately 2×128×128512=64×\frac{2 \times 128 \times 128}{512} = 64\times.

Intuition

Standard attention stores separate key and value vectors for each head. But these vectors are often correlated across heads because they derive from the same input representation. MLA exploits this by storing only a compressed representation and reconstructing per-head keys and values on the fly. You trade a small amount of compute (the up-projection) for a large memory saving.

Proof Sketch

The compression works if the joint key-value representation across heads lies approximately in a low-dimensional subspace. MLA learns this subspace during training. The up-projection matrices are absorbed into the attention computation, so the only additional cost is matrix multiplications with the projection matrices, which is small relative to the memory savings for long sequences.

Why It Matters

KV cache is the primary memory bottleneck for long-context inference with large models. For a 671B parameter model serving sequences of 100K+ tokens, the KV cache can exceed the model weights in memory. MLA makes long-context inference feasible on hardware that would otherwise be insufficient. This is one of the reasons DeepSeek-V2 and V3 can offer competitive long-context performance at lower cost.

Failure Mode

If the key-value representations across heads are not well-approximated by a low-rank structure, compression introduces approximation error that degrades attention quality. In practice, the learned latent space captures most of the relevant information, but there may be tasks where fine-grained per-head information matters and MLA slightly underperforms standard attention. The compression ratio is also fixed at training time; it cannot adapt to the difficulty of individual sequences.

Why DeepSeek Matters for the Field

Cost efficiency. DeepSeek-V3's reported training cost of ~$5.6M challenges the assumption that frontier models require hundreds of millions in compute. Even if the true fully-loaded cost is higher (the estimate excludes researcher salaries, failed experiments, and infrastructure), it demonstrates that architectural efficiency (MoE, MLA) can substantially reduce the compute needed for frontier quality.

Open weights for reasoning models. Before DeepSeek-R1, reasoning models (OpenAI o1, o1-pro) were closed. R1's open release let researchers study how reasoning emerges from RL, what the chain-of-thought traces look like internally, and how to distill reasoning into smaller models.

Hardware constraints driving innovation. Chinese labs have restricted access to top-tier Nvidia GPUs (H100 vs. H800). This constraint may have incentivized architectural innovations (MoE, MLA) that reduce compute requirements. Constraints sometimes accelerate innovation.

Common Confusions

Watch Out

671B parameters does not mean 671B inference cost

DeepSeek-V3 has 671B total parameters but only activates 37B per token. The inference FLOPs per token are comparable to a ~37B dense model. However, all 671B parameters must be loaded into GPU memory, so the memory footprint is still large. The efficiency gain is in compute per token, not memory.

Watch Out

R1-Zero reasoning is not the same as prompting for chain-of-thought

When you prompt a standard model with "think step by step", you are using a pattern the model learned during pretraining. DeepSeek-R1-Zero's reasoning emerged from RL training on outcomes. It was never shown examples of reasoning traces. The model discovered that generating intermediate steps helps it get correct answers, purely from the reward signal.

Exercises

ExerciseCore

Problem

DeepSeek-V3 has 256 routed experts and activates 8 per token, plus 1 shared expert. What fraction of the routed expert parameters are active for any given token? If total parameters are 671B and shared/non-expert parameters account for roughly 37B, estimate the total routed expert parameters.

ExerciseAdvanced

Problem

MLA compresses the KV cache from 2hdh2 \cdot h \cdot d_h dimensions per token per layer to dcd_c dimensions. For a model with 60 layers, h=128h = 128 heads, dh=128d_h = 128, sequence length 100K tokens, and dc=512d_c = 512, compute the KV cache size in GB for both standard attention and MLA (using float16, 2 bytes per value).

References

Canonical:

  • DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024)
  • DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024)
  • DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)

Current:

  • Liu et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" (2024)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics