Model Timeline
DeepSeek Models
DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, and RL-trained reasoning in DeepSeek-R1.
Prerequisites
Why This Matters
DeepSeek demonstrated that architectural innovation and clever training can compensate for having less compute than the largest Western labs. DeepSeek-V3 achieves frontier performance with a mixture-of-experts architecture that activates only 37B of 671B total parameters per token. DeepSeek-R1 showed that chain-of-thought reasoning can emerge from reinforcement learning alone, without supervised demonstrations of reasoning traces. Both models are open-weight, making their internals available for study.
DeepSeek-V2 (May 2024)
Architecture. MoE transformer with 236B total parameters, 21B active per token. 160 experts with 6 active per token plus 2 shared experts that process every token.
Key innovation: Multi-head Latent Attention (MLA). Standard multi-head attention stores separate key and value vectors for each head in the KV cache, which grows linearly with sequence length and number of heads. MLA compresses the KV cache by projecting keys and values into a shared low-dimensional latent space. This reduces KV cache memory by roughly 93% compared to standard multi-head attention.
Key innovation: DeepSeekMoE. Fine-grained expert segmentation: instead of a few large experts, use many small experts. This allows more precise routing and better expert specialization. Shared experts handle common patterns while routed experts specialize.
Training. 8.1T tokens. Trained on a cluster of Nvidia H800 GPUs (the export-restricted variant of H100, available to Chinese labs).
Result. Competitive with Llama 3 70B and Mixtral 8x22B at significantly lower inference cost due to the low active parameter count.
DeepSeek-V3 (December 2024)
Architecture. MoE transformer with 671B total parameters, 37B active per token. 256 routed experts with 8 active per token, plus 1 shared expert.
Training. 14.8T tokens. Estimated training cost: approximately $5.6M in compute, dramatically lower than comparable frontier models. This cost estimate (from the technical report) generated significant attention because it suggested frontier-quality models could be trained for a fraction of what US labs spend.
Auxiliary-loss-free load balancing. Standard MoE training uses an auxiliary loss to prevent expert collapse (all tokens routing to the same few experts). DeepSeek-V3 introduced a load-balancing method that does not require an auxiliary loss term, reducing interference with the primary training objective.
Multi-token prediction. The model predicts multiple future tokens simultaneously during training, which improves training efficiency and downstream performance.
Result. Competitive with GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on most benchmarks. Particularly strong on math and coding. The combination of frontier quality, open weights, and low training cost made V3 one of the most discussed model releases of 2024.
DeepSeek-R1 (January 2025)
Architecture. Same base architecture as DeepSeek-V3 (671B/37B MoE).
Key innovation: reasoning via RL. DeepSeek-R1 was trained to reason using reinforcement learning on tasks with verifiable answers (math problems, coding challenges). The model generates long chain-of-thought reasoning traces before producing a final answer. Critically, the reasoning behavior emerged from RL training alone (DeepSeek-R1-Zero), without first training on human-written reasoning traces.
DeepSeek-R1-Zero
A variant trained with pure RL (no supervised fine-tuning on reasoning traces). Given a math or coding problem, the model is rewarded for producing correct final answers. Through RL training, the model spontaneously learned to generate chain-of-thought reasoning, self-check its work, and explore multiple solution paths. This demonstrated that explicit reasoning can emerge from outcome-based RL without human demonstrations of the reasoning process.
Distilled models. DeepSeek released distilled versions (1.5B to 70B dense models) trained on reasoning traces from R1. These smaller models inherit some of R1's reasoning ability at much lower cost.
Result. Competitive with OpenAI's o1 on math and coding benchmarks (AIME 2024, Codeforces). Open-weight release enabled the research community to study reasoning model internals for the first time.
Multi-head Latent Attention (MLA)
KV Cache Compression via Latent Projection
Statement
In standard multi-head attention, the KV cache stores values per token per layer for both keys and values, giving a per-token memory cost of per layer. MLA projects all heads into a shared latent vector of dimension , then recovers per-head keys and values via learned up-projections at inference time. The per-token KV cache memory reduces from to per layer. For DeepSeek-V2 with , , and , this is a compression ratio of approximately .
Intuition
Standard attention stores separate key and value vectors for each head. But these vectors are often correlated across heads because they derive from the same input representation. MLA exploits this by storing only a compressed representation and reconstructing per-head keys and values on the fly. You trade a small amount of compute (the up-projection) for a large memory saving.
Proof Sketch
The compression works if the joint key-value representation across heads lies approximately in a low-dimensional subspace. MLA learns this subspace during training. The up-projection matrices are absorbed into the attention computation, so the only additional cost is matrix multiplications with the projection matrices, which is small relative to the memory savings for long sequences.
Why It Matters
KV cache is the primary memory bottleneck for long-context inference with large models. For a 671B parameter model serving sequences of 100K+ tokens, the KV cache can exceed the model weights in memory. MLA makes long-context inference feasible on hardware that would otherwise be insufficient. This is one of the reasons DeepSeek-V2 and V3 can offer competitive long-context performance at lower cost.
Failure Mode
If the key-value representations across heads are not well-approximated by a low-rank structure, compression introduces approximation error that degrades attention quality. In practice, the learned latent space captures most of the relevant information, but there may be tasks where fine-grained per-head information matters and MLA slightly underperforms standard attention. The compression ratio is also fixed at training time; it cannot adapt to the difficulty of individual sequences.
Why DeepSeek Matters for the Field
Cost efficiency. DeepSeek-V3's reported training cost of ~$5.6M challenges the assumption that frontier models require hundreds of millions in compute. Even if the true fully-loaded cost is higher (the estimate excludes researcher salaries, failed experiments, and infrastructure), it demonstrates that architectural efficiency (MoE, MLA) can substantially reduce the compute needed for frontier quality.
Open weights for reasoning models. Before DeepSeek-R1, reasoning models (OpenAI o1, o1-pro) were closed. R1's open release let researchers study how reasoning emerges from RL, what the chain-of-thought traces look like internally, and how to distill reasoning into smaller models.
Hardware constraints driving innovation. Chinese labs have restricted access to top-tier Nvidia GPUs (H100 vs. H800). This constraint may have incentivized architectural innovations (MoE, MLA) that reduce compute requirements. Constraints sometimes accelerate innovation.
Common Confusions
671B parameters does not mean 671B inference cost
DeepSeek-V3 has 671B total parameters but only activates 37B per token. The inference FLOPs per token are comparable to a ~37B dense model. However, all 671B parameters must be loaded into GPU memory, so the memory footprint is still large. The efficiency gain is in compute per token, not memory.
R1-Zero reasoning is not the same as prompting for chain-of-thought
When you prompt a standard model with "think step by step", you are using a pattern the model learned during pretraining. DeepSeek-R1-Zero's reasoning emerged from RL training on outcomes. It was never shown examples of reasoning traces. The model discovered that generating intermediate steps helps it get correct answers, purely from the reward signal.
Exercises
Problem
DeepSeek-V3 has 256 routed experts and activates 8 per token, plus 1 shared expert. What fraction of the routed expert parameters are active for any given token? If total parameters are 671B and shared/non-expert parameters account for roughly 37B, estimate the total routed expert parameters.
Problem
MLA compresses the KV cache from dimensions per token per layer to dimensions. For a model with 60 layers, heads, , sequence length 100K tokens, and , compute the KV cache size in GB for both standard attention and MLA (using float16, 2 bytes per value).
References
Canonical:
- DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024)
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024)
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)
Current:
- Liu et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" (2024)
Next Topics
- Model comparison table: structured comparison across frontier model families
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Mixture of ExpertsLayer 4