Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Mixture of Experts

Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.

AdvancedTier 2Current~55 min

Why This Matters

The biggest language models are MoE models. GPT-4 is widely rumored to be MoE based on third-party reports (SemiAnalysis 2023), but not confirmed by OpenAI. Mixtral 8x7B demonstrated that a 47B-parameter MoE model with only 13B active parameters per token can match or beat a dense 70B model. DeepSeek-V3 pushes this further with fine-grained experts and auxiliary-loss-free balancing.

MoE is the key architectural idea that decouples total model capacity (how much the model can store) from compute per token (how expensive each forward pass is). This decoupling is why we can build models with hundreds of billions of parameters that are still affordable to serve.

Mental Model

In a standard transformer, every token passes through the same feed-forward network (FFN). In an MoE transformer, the FFN is replaced by NN "expert" FFNs plus a small router network. For each token, the router selects the top-kk experts (typically k=1k = 1 or k=2k = 2), and only those experts process the token. The other NkN - k experts are not computed.

The result: the model has NN times as many FFN parameters, but each token only activates k/Nk/N of them. More knowledge stored, similar compute cost.

Formal Setup

Definition

Expert Layer

An MoE layer replaces a single FFN with NN expert networks {E1,,EN}\{E_1, \ldots, E_N\} and a gating function GG. For input token x\mathbf{x}:

MoE(x)=i=1Ngi(x)Ei(x)\text{MoE}(\mathbf{x}) = \sum_{i=1}^{N} g_i(\mathbf{x}) \cdot E_i(\mathbf{x})

where gi(x)g_i(\mathbf{x}) are the gating weights, with most set to zero.

In modern MoE transformers (Mixtral, DeepSeek-V3, Qwen-MoE), each EiE_i is typically a SwiGLU or GeGLU-gated FFN with three projection matrices (gate, up, down), not a 2-matrix vanilla FFN. This matters for parameter accounting.

Definition

Top-k Routing

The router (or gating network) is typically a learned linear projection WgRd×N\mathbf{W}_g \in \mathbb{R}^{d \times N} followed by top-kk selection:

h(x)=WgxRN\mathbf{h}(\mathbf{x}) = \mathbf{W}_g \mathbf{x} \in \mathbb{R}^N gi(x)={exp(hi(x))jTop-kexp(hj(x))if iTop-k(h)0otherwiseg_i(\mathbf{x}) = \begin{cases} \frac{\exp(h_i(\mathbf{x}))}{\sum_{j \in \text{Top-}k} \exp(h_j(\mathbf{x}))} & \text{if } i \in \text{Top-}k(\mathbf{h}) \\ 0 & \text{otherwise} \end{cases}

Only the kk experts with highest router scores are computed. The gating weights are normalized over the selected experts via softmax.

Main Theorems

Proposition

MoE Routing as Token-Expert Assignment

Statement

The routing problem can be viewed as a linear assignment: given BB tokens and NN experts each with capacity C=kB/NC = k \cdot B / N (the expected number of tokens per expert under uniform routing), the router must assign each token to kk experts while approximately satisfying the capacity constraint. The combined objective is:

maxπj=1Biπ(j)hi(xj)subject to{j:iπ(j)}C    i\max_{\pi} \sum_{j=1}^{B} \sum_{i \in \pi(j)} h_i(\mathbf{x}_j) \quad \text{subject to} \quad |\{j : i \in \pi(j)\}| \leq C \;\;\forall\, i

where π(j)\pi(j) is the set of experts assigned to token jj.

Intuition

Without capacity constraints, the router would send all tokens to the single best expert (expert collapse). The capacity constraint forces distribution across experts, ensuring all experts get trained and contribute at inference time. The routing problem is a constrained optimization trading off expert quality per token against load balance across experts.

Why It Matters

This formulation makes explicit the core tension in MoE: letting the router freely choose the best expert per token would waste most parameters, while forcing uniform assignment would ignore token-expert affinity. Every MoE design navigates this tradeoff.

Proposition

Auxiliary Load Balancing Loss

Statement

The load balancing loss penalizes routing imbalance by encouraging the fraction of tokens sent to each expert to be close to 1/N1/N. Define:

fi=1Bj=1B1[iTop-k(h(xj))]f_i = \frac{1}{B}\sum_{j=1}^{B} \mathbf{1}[i \in \text{Top-}k(\mathbf{h}(\mathbf{x}_j))]

as the fraction of tokens routed to expert ii, and:

pi=1Bj=1Bsoftmax(h(xj))ip_i = \frac{1}{B}\sum_{j=1}^{B} \text{softmax}(\mathbf{h}(\mathbf{x}_j))_i

as the average router probability for expert ii. The auxiliary loss is:

Lbal=αNi=1Nfipi\mathcal{L}_{\text{bal}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot p_i

where α\alpha is a small coefficient, commonly set on the order of 10210^{-2} (Switch Transformer, GShard); different papers use different values. Under perfect load balance with top-kk routing, fi=k/Nf_i = k/N and pi=1/Np_i = 1/N, with ifi=k\sum_i f_i = k and ipi=1\sum_i p_i = 1. The top-1 case recovers fi=1/Nf_i = 1/N.

Intuition

The product fipif_i \cdot p_i is high when expert ii both receives many tokens and has high average probability. Summing over experts and comparing to the uniform case creates a differentiable penalty for imbalance. Crucially, fif_i involves a non-differentiable indicator, but pip_i is differentiable through the softmax, so gradients flow through pip_i to update the router.

Why It Matters

Without some balancing mechanism, MoE training reliably collapses: one or two experts capture all tokens, and the remaining experts receive no gradients and become useless dead parameters. This auxiliary loss is the standard mechanism in Switch Transformer and GShard. Modern alternatives include expert choice routing (Zhou et al. 2022) and auxiliary-loss-free bias-term balancing (DeepSeek-V3). Some balancing mechanism is essential; this specific loss is not the only option.

Failure Mode

If α\alpha is too large, the balancing loss dominates the language modeling loss and the router learns to distribute tokens uniformly regardless of content. defeating the purpose of conditional computation. If α\alpha is too small, expert collapse occurs. Tuning α\alpha is one of the main practical challenges of MoE training. Wang et al. 2024 ("Auxiliary-Loss-Free Load Balancing", arXiv 2408.15664) introduced a bias-term approach that avoids this tuning entirely, with related ideas in DeepSeek-V2 (Liu et al. 2024, arXiv 2405.04434); DeepSeek-V3 used it at scale.

Routing Strategies

Top-1 routing: Each token goes to exactly one expert. Cheapest computation but loses the ability to combine expert outputs. Used in Switch Transformer.

Top-2 routing: Each token goes to two experts, outputs are combined via gating weights. Standard in Mixtral and many production models. Provides redundancy; Shazeer et al. 2017 used top-2 for regularization.

Soft routing (expert choice): Instead of tokens choosing experts, experts choose their top-kk tokens from the batch (Zhou et al. 2022, "Mixture-of- Experts with Expert Choice Routing", arXiv 2202.09368). Guarantees perfect load balance by construction but requires batch-level coordination.

Shared + routed experts: Some experts process every token (shared) while others are conditionally routed (specialized). DeepSeek-MoE uses this pattern to maintain a baseline capacity across all tokens.

Expert Collapse and Failure Modes

Watch Out

Expert collapse is the default

Without explicit balancing mechanisms, MoE training will collapse. The rich- get-richer dynamic is strong: an expert that performs slightly better on early batches gets more tokens, gets more gradient updates, improves further, and starves other experts of training signal. This is not a rare failure mode. it happens deterministically without the load balancing loss or equivalent mechanisms.

Common Fake Understanding

MoE is not "just ensembling." In an ensemble, all models see all inputs and their outputs are combined. In MoE, different experts see different subsets of tokens, and the routing is learned and dynamic. It changes based on the input. The router is a core part of the model, not a post-hoc aggregation mechanism. MoE is conditional computation, not model averaging.

Scaling Properties

The key scaling relationship: for an MoE model with NN experts, top-kk routing, and expert size equal to the dense baseline FFN:

  • Total parameters: N×\sim N \times the dense baseline FFN parameters
  • Active parameters per token: k×\sim k \times the dense baseline
  • FLOPs per token: k/N×\sim k/N \times the total parameters' worth of FLOPs

This means an 8-expert, top-2 MoE model has 8x the FFN parameters but only 2x the compute. In practice, the quality improvement scales with total parameters (knowledge capacity) while cost scales with active parameters.

Serving MoE Models

Serving MoE is harder than serving dense models of equivalent active size. Expert parallelism shards experts across devices, so each token's top-kk experts may live on different GPUs, forcing all-to-all communication per MoE layer. Load imbalance at inference time causes straggler experts that stall the batch; expert-parallel load balancers (EPLB in DeepSeek-V3) migrate or replicate hot experts. Memory is dominated by the full expert set, not the active subset, so KV cache plus all expert weights must fit. Batching is complicated because different tokens in a batch route to different experts, breaking uniform per-layer compute. Frameworks: vLLM, SGLang, and DeepSpeed-MoE provide expert-parallel kernels and routing-aware schedulers.

Common Confusions

Watch Out

MoE is not free capacity

MoE reduces compute per token but not memory. All expert weights must be loaded into memory (or efficiently swapped). For serving, this means MoE models need more GPU memory than dense models with the same active parameter count. The compute savings are real; the memory savings are not.

Watch Out

Experts do not learn interpretable specializations

Despite the name, experts rarely specialize in clean semantic categories ("one expert for math, one for language") under vanilla top-kk routing. Empirically, routing patterns are complex and overlapping. Some tokens route to the same expert regardless of domain. Do not assume you can interpret what each expert "does." Fine-grained designs change this somewhat: DeepSeek-V3 (Liu et al. 2024, arXiv 2412.19437) reports partial specialization when combining shared experts with a large number of fine-grained routed experts, though the interpretations remain noisy.

Summary

  • MoE replaces the dense FFN with NN expert FFNs + a learned router
  • Only kk of NN experts are active per token. sparse computation
  • Total parameters \gg active parameters, decoupling capacity from compute
  • Some balancing mechanism (auxiliary loss, expert choice, or bias terms) is essential to prevent expert collapse
  • MoE saves compute per token but not memory. all experts must be loaded
  • Routing is learned and dynamic, not ensembling

Exercises

ExerciseCore

Problem

Mixtral 8x7B has 8 experts with top-2 routing. The "7B" refers to the base single-expert model size, not the per-expert FFN in isolation. Only the FFN sublayer is replicated across experts; attention and embeddings are shared. Given this, explain why the total parameter count is approximately 46.7B (not 56B) and why active parameters per token are approximately 12.9B (not 14B).

ExerciseAdvanced

Problem

Adopt the convention used above, where fif_i is the fraction of dispatch events routed to expert ii so ifi=k\sum_i f_i = k (each of the BB tokens contributes kk dispatches), and pip_i is the average softmax probability so ipi=1\sum_i p_i = 1. Because both fif_i and pip_i are determined by the same router logits h\mathbf{h}, they are positively associated: experts with higher average logits receive both more tokens and more probability mass. Under this co-sorting constraint, show that Lbal=Nifipi\mathcal{L}_{\text{bal}} = N \sum_i f_i p_i attains its minimum at the balanced point fi=k/Nf_i = k/N, pi=1/Np_i = 1/N, where the loss equals kk. Verify that in the top-1 case (k=1k = 1) this reduces to fi=pi=1/Nf_i = p_i = 1/N and Lbal=1\mathcal{L}_{\text{bal}} = 1, matching the Switch Transformer formulation.

ExerciseResearch

Problem

Wang et al. 2024 ("Auxiliary-Loss-Free Load Balancing", arXiv 2408.15664), building on ideas from DeepSeek-V2 (Liu et al. 2024, arXiv 2405.04434) and deployed at scale in DeepSeek-V3, proposes auxiliary-loss-free load balancing using per-expert bias terms added to router logits, updated via exponential moving average of expert utilization. Why might this work better than the standard auxiliary loss? What assumption about the routing landscape does it make?

Related Comparisons

References

Canonical:

  • Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017)
  • Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models" (JMLR 2022)

Current:

  • Jiang et al., "Mixtral of Experts" (2024, arXiv 2401.04088)
  • Zhou et al., "Mixture-of-Experts with Expert Choice Routing" (NeurIPS 2022, arXiv 2202.09368)
  • Liu et al., "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024, arXiv 2405.04434)
  • Wang et al., "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (2024, arXiv 2408.15664)
  • Liu et al., "DeepSeek-V3 Technical Report" (2024, arXiv 2412.19437)

Next Topics

The natural next steps from mixture of experts:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics