Mixture of Experts

Sneiderman, Robby

LLM Construction

Mixture of Experts

Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.

AdvancedTier 2CurrentFrontier watch~55 min

Prerequisites

Transformer Architecture Model Compression and Pruning Speculative Decoding and Quantization

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 2. This page has 3 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Speculative Decoding and Quantization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The biggest language models are MoE models. GPT-4 is widely rumored to be MoE based on third-party reports (SemiAnalysis 2023), but not confirmed by OpenAI. Mixtral 8x7B demonstrated that a 47B-parameter MoE model with only 13B active parameters per token can match or beat a dense 70B model. DeepSeek-V3 pushes this further with fine-grained experts and auxiliary-loss-free balancing. Kimi K2 (Moonshot AI 2026) extends the design to sparsity-48 (384 experts, 8 active per token) at 1.04T total / 32B activated, halving the attention-head count and reducing dense layers to one — the published frontier of sparsity at trillion-parameter scale.

MoE is the key architectural idea that decouples total model capacity (how much the model can store) from compute per token (how expensive each forward pass is). This decoupling is why we can build models with hundreds of billions of parameters that are still affordable to serve.

Current Checkpoint

The current MoE frontier is less about the basic top-k idea and more about operational control:

Routing quality: tokens should reach useful experts without collapsing onto a small subset.
Load balance: hot experts create stragglers and wasted hardware.
Communication cost: expert parallelism makes all-to-all transfer a first-order serving concern.
Memory pressure: active parameters are small, but total expert weights still need to live somewhere.
Post-training stability: sparse models can be sensitive to routing shifts during fine-tuning, quantization, or long-context serving.

For learning products, MoE is a useful analogy: do not make one giant tutor do every job. Route tasks to specialized systems when there is a real difference in evidence, tools, or grading rules. Keep the user experience coherent, but keep the internals modular.

Build It This Way by Default

When using MoE models or MoE-like product architecture, track both quality and utilization. Accuracy alone can hide overloaded experts, high tail latency, and expensive routes that make the system hard to serve.

Mental Model

In a standard transformer, every token passes through the same feed-forward network (FFN). In an MoE transformer, the FFN is replaced by $N$ "expert" FFNs plus a small router network. For each token, the router selects the top- $k$ experts (typically $k = 1$ or $k = 2$ ), and only those experts process the token. The other $N - k$ experts are not computed.

The result: the model has $N$ times as many FFN parameters, but each token only activates $k/N$ of them. More knowledge stored, similar compute cost.

Formal Setup

Definition

Expert Layer

An MoE layer replaces a single FFN with $N$ expert networks $\{E_1, \ldots, E_N\}$ and a gating function $G$ . For input token $\mathbf{x}$ :

$\text{MoE}(\mathbf{x}) = \sum_{i=1}^{N} g_i(\mathbf{x}) \cdot E_i(\mathbf{x})$

where $g_i(\mathbf{x})$ are the gating weights, with most set to zero.

In modern MoE transformers (Mixtral, DeepSeek-V3, Qwen-MoE), each $E_i$ is typically a SwiGLU or GeGLU-gated FFN with three projection matrices (gate, up, down), not a 2-matrix vanilla FFN. This matters for parameter accounting.

Definition

Top-k Routing

The router (or gating network) is typically a learned linear projection $\mathbf{W}_g \in \mathbb{R}^{d \times N}$ followed by top- $k$ selection:

$\mathbf{h}(\mathbf{x}) = \mathbf{W}_g \mathbf{x} \in \mathbb{R}^N$ $g_i(\mathbf{x}) = \begin{cases} \frac{\exp(h_i(\mathbf{x}))}{\sum_{j \in \text{Top-}k} \exp(h_j(\mathbf{x}))} & \text{if } i \in \text{Top-}k(\mathbf{h}) \\ 0 & \text{otherwise} \end{cases}$

Only the $k$ experts with highest router scores are computed. The gating weights are normalized over the selected experts via softmax.

Main Theorems

Proposition

MoE Routing as Token-Expert Assignment

Statement

The routing problem can be viewed as a linear assignment: given $B$ tokens and $N$ experts each with capacity $C = k \cdot B / N$ (the expected number of tokens per expert under uniform routing), the router must assign each token to $k$ experts while approximately satisfying the capacity constraint. The combined objective is:

$\max_{\pi} \sum_{j=1}^{B} \sum_{i \in \pi(j)} h_i(\mathbf{x}_j) \quad \text{subject to} \quad |\{j : i \in \pi(j)\}| \leq C \;\;\forall\, i$

where $\pi(j)$ is the set of experts assigned to token $j$ .

Intuition

Without capacity constraints, the router would send all tokens to the single best expert (expert collapse). The capacity constraint forces distribution across experts, ensuring all experts get trained and contribute at inference time. The routing problem is a constrained optimization trading off expert quality per token against load balance across experts.

Why It Matters

This formulation makes explicit the core tension in MoE: letting the router freely choose the best expert per token would waste most parameters, while forcing uniform assignment would ignore token-expert affinity. Every MoE design navigates this tradeoff.

report a correction →

Proposition

Auxiliary Load Balancing Loss

Statement

The load balancing loss penalizes routing imbalance by encouraging the fraction of tokens sent to each expert to be close to $1/N$ . Define:

$f_i = \frac{1}{B}\sum_{j=1}^{B} \mathbf{1}[i \in \text{Top-}k(\mathbf{h}(\mathbf{x}_j))]$

as the fraction of tokens routed to expert $i$ , and:

$p_i = \frac{1}{B}\sum_{j=1}^{B} \text{softmax}(\mathbf{h}(\mathbf{x}_j))_i$

as the average router probability for expert $i$ . The auxiliary loss is:

$\mathcal{L}_{\text{bal}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot p_i$

where $\alpha$ is a small coefficient, commonly set on the order of $10^{-2}$ (Switch Transformer, GShard); different papers use different values. Under perfect load balance with top- $k$ routing, $f_i = k/N$ and $p_i = 1/N$ , with $\sum_i f_i = k$ and $\sum_i p_i = 1$ . The top-1 case recovers $f_i = 1/N$ .

Intuition

The product $f_i \cdot p_i$ is high when expert $i$ both receives many tokens and has high average probability. Summing over experts and comparing to the uniform case creates a differentiable penalty for imbalance. Crucially, $f_i$ involves a non-differentiable indicator, but $p_i$ is differentiable through the softmax, so gradients flow through $p_i$ to update the router.

Why It Matters

Without some balancing mechanism, MoE training reliably collapses: one or two experts capture all tokens, and the remaining experts receive no gradients and become useless dead parameters. This auxiliary loss is the standard mechanism in Switch Transformer and GShard. Modern alternatives include expert choice routing (Zhou et al. 2022) and auxiliary-loss-free bias-term balancing (DeepSeek-V3). Some balancing mechanism is essential; this specific loss is not the only option.

Failure Mode

If $\alpha$ is too large, the balancing loss dominates the language modeling loss and the router learns to distribute tokens uniformly regardless of content. defeating the purpose of conditional computation. If $\alpha$ is too small, expert collapse occurs. Tuning $\alpha$ is one of the main practical challenges of MoE training. Wang et al. 2024 ("Auxiliary-Loss-Free Load Balancing", arXiv 2408.15664) introduced a bias-term approach that avoids this tuning entirely, with related ideas in DeepSeek-V2 (Liu et al. 2024, arXiv 2405.04434); DeepSeek-V3 used it at scale.

report a correction →

Routing Strategies

Top-1 routing: Each token goes to exactly one expert. Cheapest computation but loses the ability to combine expert outputs. Used in Switch Transformer.

Top-2 routing: Each token goes to two experts, outputs are combined via gating weights. Standard in Mixtral and many production models. Provides redundancy; Shazeer et al. 2017 used top-2 for regularization.

Soft routing (expert choice): Instead of tokens choosing experts, experts choose their top- $k$ tokens from the batch (Zhou et al. 2022, "Mixture-of- Experts with Expert Choice Routing", arXiv 2202.09368). Guarantees perfect load balance by construction but requires batch-level coordination.

Shared + routed experts: Some experts process every token (shared) while others are conditionally routed (specialized). DeepSeek-MoE uses this pattern to maintain a baseline capacity across all tokens.

Capacity Factor and Token Dropping

In practice, even with load-balancing losses, per-batch routing is uneven: some experts receive more tokens than the static $B \cdot k / N$ average. To bound memory and to keep tensor shapes static for fast kernels, training frameworks fix a capacity factor $C \geq 1$ and cap each expert's buffer at $\lceil C \cdot B \cdot k / N \rceil$ tokens per batch. Tokens that would overflow are dropped (their MoE-layer output is the residual stream alone, skipping the FFN) or rerouted to a backup expert.

Typical values are $C \approx 1.0$ at inference (dropping is usually safe because routing is more stable on a trained model) and $C \in [1.25, 2.0]$ during training (Switch Transformer used $C = 1.0$ to $1.25$ ; GShard used $C = 2.0$ ). Higher $C$ means fewer drops but more padding compute and more memory. Token dropping is a real source of training noise and is one reason expert-choice routing (where capacity is exact by construction) is attractive.

Watch Out

Expert collapse is the default

Without explicit balancing mechanisms, MoE training will collapse. The rich- get-richer dynamic is strong: an expert that performs slightly better on early batches gets more tokens, gets more gradient updates, improves further, and starves other experts of training signal. This is not a rare failure mode. it happens deterministically without the load balancing loss or equivalent mechanisms.

Common Fake Understanding

MoE is not "just ensembling." In an ensemble, all models see all inputs and their outputs are combined. In MoE, different experts see different subsets of tokens, and the routing is learned and dynamic. It changes based on the input. The router is a core part of the model, not a post-hoc aggregation mechanism. MoE is conditional computation, not model averaging.

Scaling Properties

The key scaling relationship: for an MoE model with $N$ experts, top- $k$ routing, and expert size equal to the dense baseline FFN:

Total parameters: $\sim N \times$ the dense baseline FFN parameters
Active parameters per token: $\sim k \times$ the dense baseline
FLOPs per token: $\sim k/N \times$ the total parameters' worth of FLOPs

This means an 8-expert, top-2 MoE model has 8x the FFN parameters but only 2x the compute. In practice, the quality improvement scales with total parameters (knowledge capacity) while cost scales with active parameters.

Serving MoE Models

Serving MoE is harder than serving dense models of equivalent active size. Expert parallelism shards experts across devices, so each token's top- $k$ experts may live on different GPUs, forcing all-to-all communication per MoE layer. Load imbalance at inference time causes straggler experts that stall the batch; expert-parallel load balancers (EPLB in DeepSeek-V3) migrate or replicate hot experts. Memory is dominated by the full expert set, not the active subset, so KV cache plus all expert weights must fit. Batching is complicated because different tokens in a batch route to different experts, breaking uniform per-layer compute. Frameworks: vLLM, SGLang, and DeepSpeed-MoE provide expert-parallel kernels and routing-aware schedulers.

Common Confusions

Watch Out

MoE is not free capacity

MoE reduces compute per token but not memory. All expert weights must be loaded into memory (or efficiently swapped). For serving, this means MoE models need more GPU memory than dense models with the same active parameter count. The compute savings are real; the memory savings are not.

Watch Out

Experts do not learn interpretable specializations

Despite the name, experts rarely specialize in clean semantic categories ("one expert for math, one for language") under vanilla top- $k$ routing. Empirically, routing patterns are complex and overlapping. Some tokens route to the same expert regardless of domain. Do not assume you can interpret what each expert "does." Fine-grained designs change this somewhat: DeepSeek-V3 (Liu et al. 2024, arXiv 2412.19437) reports partial specialization when combining shared experts with a large number of fine-grained routed experts, though the interpretations remain noisy.

Summary

MoE replaces the dense FFN with $N$ expert FFNs + a learned router
Only $k$ of $N$ experts are active per token. sparse computation
Total parameters $\gg$ active parameters, decoupling capacity from compute
Some balancing mechanism (auxiliary loss, expert choice, or bias terms) is essential to prevent expert collapse
MoE saves compute per token but not memory. all experts must be loaded
Routing is learned and dynamic, not ensembling

Exercises

ExerciseCore

Problem

Mixtral 8x7B has 8 experts with top-2 routing. The "7B" refers to the base single-expert model size, not the per-expert FFN in isolation. Only the FFN sublayer is replicated across experts; attention and embeddings are shared. Given this, explain why the total parameter count is approximately 46.7B (not 56B) and why active parameters per token are approximately 12.9B (not 14B).

ExerciseAdvanced

Problem

Adopt the convention used above, where $f_i$ is the fraction of dispatch events routed to expert $i$ so $\sum_i f_i = k$ (each of the $B$ tokens contributes $k$ dispatches), and $p_i$ is the average softmax probability so $\sum_i p_i = 1$ . Because both $f_i$ and $p_i$ are determined by the same router logits $\mathbf{h}$ , they are positively associated: experts with higher average logits receive both more tokens and more probability mass. Under this co-sorting constraint, show that $\mathcal{L}_{\text{bal}} = N \sum_i f_i p_i$ attains its minimum at the balanced point $f_i = k/N$ , $p_i = 1/N$ , where the loss equals $k$ . Verify that in the top-1 case ( $k = 1$ ) this reduces to $f_i = p_i = 1/N$ and $\mathcal{L}_{\text{bal}} = 1$ , matching the Switch Transformer formulation.

ExerciseResearch

Problem

Wang et al. 2024 ("Auxiliary-Loss-Free Load Balancing", arXiv 2408.15664), building on ideas from DeepSeek-V2 (Liu et al. 2024, arXiv 2405.04434) and deployed at scale in DeepSeek-V3, proposes auxiliary-loss-free load balancing using per-expert bias terms added to router logits, updated via exponential moving average of expert utilization. Why might this work better than the standard auxiliary loss? What assumption about the routing landscape does it make?

Related Comparisons

Dense Transformers vs. Mixture-of-Experts

References

Canonical:

Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017)
Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models" (JMLR 2022)

Current:

Jiang et al., "Mixtral of Experts" (2024, arXiv 2401.04088)
Zhou et al., "Mixture-of-Experts with Expert Choice Routing" (NeurIPS 2022, arXiv 2202.09368)
Liu et al., "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024, arXiv 2405.04434)
Wang et al., "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (2024, arXiv 2408.15664)
Liu et al., "DeepSeek-V3 Technical Report" (2024, arXiv 2412.19437)
Moonshot AI, "Kimi K2 Technical Report" (2025). Large sparse MoE model with 1T total parameters and 32B activated parameters.

Next Topics

The natural next steps from mixture of experts:

Speculative decoding and quantization: efficient inference for large (often MoE) models
Mamba and state-space models: alternative architectures that challenge the transformer-MoE paradigm

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Model Compression and Pruninglayer 3 · tier 2
Transformer Architecturelayer 4 · tier 2
Speculative Decoding and Quantizationlayer 5 · tier 2

Derived topics

4

DeepSeek Modelslayer 5 · tier 1
Mamba and State-Space Modelslayer 4 · tier 2
Mistral Modelslayer 4 · tier 2
LLaMA and Open Weight Modelslayer 5 · tier 2

Graph-backed continuations

Mamba and State-Space Models DeepSeek Models LLaMA and Open Weight Models Mistral Models