LLM Construction
Mixture of Experts
Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.
Prerequisites
Why This Matters
The biggest language models are MoE models. GPT-4 is widely rumored to be MoE based on third-party reports (SemiAnalysis 2023), but not confirmed by OpenAI. Mixtral 8x7B demonstrated that a 47B-parameter MoE model with only 13B active parameters per token can match or beat a dense 70B model. DeepSeek-V3 pushes this further with fine-grained experts and auxiliary-loss-free balancing.
MoE is the key architectural idea that decouples total model capacity (how much the model can store) from compute per token (how expensive each forward pass is). This decoupling is why we can build models with hundreds of billions of parameters that are still affordable to serve.
Mental Model
In a standard transformer, every token passes through the same feed-forward network (FFN). In an MoE transformer, the FFN is replaced by "expert" FFNs plus a small router network. For each token, the router selects the top- experts (typically or ), and only those experts process the token. The other experts are not computed.
The result: the model has times as many FFN parameters, but each token only activates of them. More knowledge stored, similar compute cost.
Formal Setup
Expert Layer
An MoE layer replaces a single FFN with expert networks and a gating function . For input token :
where are the gating weights, with most set to zero.
In modern MoE transformers (Mixtral, DeepSeek-V3, Qwen-MoE), each is typically a SwiGLU or GeGLU-gated FFN with three projection matrices (gate, up, down), not a 2-matrix vanilla FFN. This matters for parameter accounting.
Top-k Routing
The router (or gating network) is typically a learned linear projection followed by top- selection:
Only the experts with highest router scores are computed. The gating weights are normalized over the selected experts via softmax.
Main Theorems
MoE Routing as Token-Expert Assignment
Statement
The routing problem can be viewed as a linear assignment: given tokens and experts each with capacity (the expected number of tokens per expert under uniform routing), the router must assign each token to experts while approximately satisfying the capacity constraint. The combined objective is:
where is the set of experts assigned to token .
Intuition
Without capacity constraints, the router would send all tokens to the single best expert (expert collapse). The capacity constraint forces distribution across experts, ensuring all experts get trained and contribute at inference time. The routing problem is a constrained optimization trading off expert quality per token against load balance across experts.
Why It Matters
This formulation makes explicit the core tension in MoE: letting the router freely choose the best expert per token would waste most parameters, while forcing uniform assignment would ignore token-expert affinity. Every MoE design navigates this tradeoff.
Auxiliary Load Balancing Loss
Statement
The load balancing loss penalizes routing imbalance by encouraging the fraction of tokens sent to each expert to be close to . Define:
as the fraction of tokens routed to expert , and:
as the average router probability for expert . The auxiliary loss is:
where is a small coefficient, commonly set on the order of (Switch Transformer, GShard); different papers use different values. Under perfect load balance with top- routing, and , with and . The top-1 case recovers .
Intuition
The product is high when expert both receives many tokens and has high average probability. Summing over experts and comparing to the uniform case creates a differentiable penalty for imbalance. Crucially, involves a non-differentiable indicator, but is differentiable through the softmax, so gradients flow through to update the router.
Why It Matters
Without some balancing mechanism, MoE training reliably collapses: one or two experts capture all tokens, and the remaining experts receive no gradients and become useless dead parameters. This auxiliary loss is the standard mechanism in Switch Transformer and GShard. Modern alternatives include expert choice routing (Zhou et al. 2022) and auxiliary-loss-free bias-term balancing (DeepSeek-V3). Some balancing mechanism is essential; this specific loss is not the only option.
Failure Mode
If is too large, the balancing loss dominates the language modeling loss and the router learns to distribute tokens uniformly regardless of content. defeating the purpose of conditional computation. If is too small, expert collapse occurs. Tuning is one of the main practical challenges of MoE training. Wang et al. 2024 ("Auxiliary-Loss-Free Load Balancing", arXiv 2408.15664) introduced a bias-term approach that avoids this tuning entirely, with related ideas in DeepSeek-V2 (Liu et al. 2024, arXiv 2405.04434); DeepSeek-V3 used it at scale.
Routing Strategies
Top-1 routing: Each token goes to exactly one expert. Cheapest computation but loses the ability to combine expert outputs. Used in Switch Transformer.
Top-2 routing: Each token goes to two experts, outputs are combined via gating weights. Standard in Mixtral and many production models. Provides redundancy; Shazeer et al. 2017 used top-2 for regularization.
Soft routing (expert choice): Instead of tokens choosing experts, experts choose their top- tokens from the batch (Zhou et al. 2022, "Mixture-of- Experts with Expert Choice Routing", arXiv 2202.09368). Guarantees perfect load balance by construction but requires batch-level coordination.
Shared + routed experts: Some experts process every token (shared) while others are conditionally routed (specialized). DeepSeek-MoE uses this pattern to maintain a baseline capacity across all tokens.
Expert Collapse and Failure Modes
Expert collapse is the default
Without explicit balancing mechanisms, MoE training will collapse. The rich- get-richer dynamic is strong: an expert that performs slightly better on early batches gets more tokens, gets more gradient updates, improves further, and starves other experts of training signal. This is not a rare failure mode. it happens deterministically without the load balancing loss or equivalent mechanisms.
MoE is not "just ensembling." In an ensemble, all models see all inputs and their outputs are combined. In MoE, different experts see different subsets of tokens, and the routing is learned and dynamic. It changes based on the input. The router is a core part of the model, not a post-hoc aggregation mechanism. MoE is conditional computation, not model averaging.
Scaling Properties
The key scaling relationship: for an MoE model with experts, top- routing, and expert size equal to the dense baseline FFN:
- Total parameters: the dense baseline FFN parameters
- Active parameters per token: the dense baseline
- FLOPs per token: the total parameters' worth of FLOPs
This means an 8-expert, top-2 MoE model has 8x the FFN parameters but only 2x the compute. In practice, the quality improvement scales with total parameters (knowledge capacity) while cost scales with active parameters.
Serving MoE Models
Serving MoE is harder than serving dense models of equivalent active size. Expert parallelism shards experts across devices, so each token's top- experts may live on different GPUs, forcing all-to-all communication per MoE layer. Load imbalance at inference time causes straggler experts that stall the batch; expert-parallel load balancers (EPLB in DeepSeek-V3) migrate or replicate hot experts. Memory is dominated by the full expert set, not the active subset, so KV cache plus all expert weights must fit. Batching is complicated because different tokens in a batch route to different experts, breaking uniform per-layer compute. Frameworks: vLLM, SGLang, and DeepSpeed-MoE provide expert-parallel kernels and routing-aware schedulers.
Common Confusions
MoE is not free capacity
MoE reduces compute per token but not memory. All expert weights must be loaded into memory (or efficiently swapped). For serving, this means MoE models need more GPU memory than dense models with the same active parameter count. The compute savings are real; the memory savings are not.
Experts do not learn interpretable specializations
Despite the name, experts rarely specialize in clean semantic categories ("one expert for math, one for language") under vanilla top- routing. Empirically, routing patterns are complex and overlapping. Some tokens route to the same expert regardless of domain. Do not assume you can interpret what each expert "does." Fine-grained designs change this somewhat: DeepSeek-V3 (Liu et al. 2024, arXiv 2412.19437) reports partial specialization when combining shared experts with a large number of fine-grained routed experts, though the interpretations remain noisy.
Summary
- MoE replaces the dense FFN with expert FFNs + a learned router
- Only of experts are active per token. sparse computation
- Total parameters active parameters, decoupling capacity from compute
- Some balancing mechanism (auxiliary loss, expert choice, or bias terms) is essential to prevent expert collapse
- MoE saves compute per token but not memory. all experts must be loaded
- Routing is learned and dynamic, not ensembling
Exercises
Problem
Mixtral 8x7B has 8 experts with top-2 routing. The "7B" refers to the base single-expert model size, not the per-expert FFN in isolation. Only the FFN sublayer is replicated across experts; attention and embeddings are shared. Given this, explain why the total parameter count is approximately 46.7B (not 56B) and why active parameters per token are approximately 12.9B (not 14B).
Problem
Adopt the convention used above, where is the fraction of dispatch events routed to expert so (each of the tokens contributes dispatches), and is the average softmax probability so . Because both and are determined by the same router logits , they are positively associated: experts with higher average logits receive both more tokens and more probability mass. Under this co-sorting constraint, show that attains its minimum at the balanced point , , where the loss equals . Verify that in the top-1 case () this reduces to and , matching the Switch Transformer formulation.
Problem
Wang et al. 2024 ("Auxiliary-Loss-Free Load Balancing", arXiv 2408.15664), building on ideas from DeepSeek-V2 (Liu et al. 2024, arXiv 2405.04434) and deployed at scale in DeepSeek-V3, proposes auxiliary-loss-free load balancing using per-expert bias terms added to router logits, updated via exponential moving average of expert utilization. Why might this work better than the standard auxiliary loss? What assumption about the routing landscape does it make?
Related Comparisons
References
Canonical:
- Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017)
- Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models" (JMLR 2022)
Current:
- Jiang et al., "Mixtral of Experts" (2024, arXiv 2401.04088)
- Zhou et al., "Mixture-of-Experts with Expert Choice Routing" (NeurIPS 2022, arXiv 2202.09368)
- Liu et al., "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024, arXiv 2405.04434)
- Wang et al., "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (2024, arXiv 2408.15664)
- Liu et al., "DeepSeek-V3 Technical Report" (2024, arXiv 2412.19437)
Next Topics
The natural next steps from mixture of experts:
- Speculative decoding and quantization: efficient inference for large (often MoE) models
- Mamba and state-space models: alternative architectures that challenge the transformer-MoE paradigm
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
Builds on This
- DeepSeek ModelsLayer 5