What Each Does
Both architectures process sequences of tokens through layers of attention and feedforward blocks. They differ in how the feedforward computation is organized.
Dense transformers pass every token through every parameter in every layer. If the feedforward block has parameters, every token uses all of them.
Mixture-of-experts (MoE) replaces the single feedforward block with parallel "expert" networks and a router that selects experts per token. Total parameters grow with , but per-token compute grows only with .
Side-by-Side Architecture
Dense Feedforward Block
For each token representation :
where , , and is the activation function. All tokens share the same weights.
MoE Feedforward Block
Given expert networks and a router :
Only the top- experts are evaluated. Typically or while ranges from 8 to 128 or more.
Where Each Is Stronger
Dense wins on simplicity
Dense models have no routing mechanism, no load balancing loss, and no expert parallelism requirements. Every GPU processes the same computation for every token. Gradient flow is uniform across all parameters. There are no auxiliary losses to tune and no risk of expert collapse (where the router sends all tokens to a single expert).
MoE wins on parameter efficiency per FLOP
A 1.8T-parameter MoE model with of experts uses roughly the same FLOPs per token as a 70B dense model. The extra parameters provide additional capacity for memorization and rare knowledge without proportional compute cost. Mixtral 8x7B demonstrated this: total parameters of 46.7B, but only 12.9B active per token, matching or exceeding the quality of a 34B dense model.
Dense wins on fine-tuning and transfer
Fine-tuning a dense model is straightforward: all parameters receive gradients from every example. MoE models have uneven parameter utilization during fine-tuning because different experts see different subsets of data. This makes MoE models harder to adapt to narrow downstream tasks where some experts may receive too few updates.
Where Each Fails
Dense fails at extreme scale
Scaling a dense model from 70B to 700B parameters requires 10x more FLOPs per token. Training and inference costs scale linearly with parameter count. At the frontier, this makes dense models prohibitively expensive for a given quality target compared to MoE alternatives.
MoE fails on load balancing
If the router consistently prefers a subset of experts, the remaining experts receive few tokens and learn nothing useful. This wastes capacity. Auxiliary load-balancing losses (penalizing uneven expert utilization) help but add a hyperparameter and can degrade performance if weighted too heavily.
MoE fails on memory and communication
Although per-token FLOPs are low, all expert parameter sets must be stored in memory or distributed across devices. Expert parallelism requires all-to-all communication to route tokens to the correct device, which becomes a bottleneck at large scale. A 128-expert model needs expert parameters spread across many GPUs, and the communication overhead can negate the FLOP savings.
Key Assumptions That Differ
| Dense | MoE | |
|---|---|---|
| Parameters active per token | All | fraction |
| Total parameters | = active parameters | active parameters |
| FLOPs per token | approx. | |
| Memory | Proportional to active params | Proportional to total params |
| Training stability | Standard | Requires load balancing |
| Communication | Data/tensor parallel | All-to-all for expert routing |
The Routing Problem
Router Collapse and Load Balancing
Statement
Without auxiliary losses, a learned router tends to converge to a degenerate solution where a small number of experts handle most tokens. This happens because popular experts receive more gradient signal, improving faster, which makes them even more popular.
The standard fix is an auxiliary load-balancing loss:
where is the fraction of tokens routed to expert and is the mean router probability for expert . The coefficient (typically 0.01 to 0.1) controls the strength of the balancing incentive.
Intuition
The product is minimized when load is uniform () and probabilities are uniform (). The loss gently pushes toward even utilization without forcing rigid assignment.
Failure Mode
Setting too high forces uniform routing regardless of input content, which wastes the benefit of specialization. Setting too low allows collapse. The right value is empirical and varies by model scale.
When a Researcher Would Use Each
Training a model under a fixed FLOP budget
Use MoE. If your compute budget allows a 7B-active-parameter model, an MoE with 8 experts gives you 56B total parameters at the same per-token cost. The extra capacity improves quality on knowledge-heavy benchmarks without increasing training FLOPs proportionally.
Deploying a model on a single GPU
Use dense. A dense 7B model fits in 14GB of memory (FP16). An MoE with the same active compute but 8 experts requires storing all 56B parameters, needing at least 112GB. The memory and communication requirements of MoE make single-device deployment impractical for large expert counts.
Fine-tuning for a specialized domain
Use dense or consider expert-specific fine-tuning strategies. Dense models distribute gradient signal evenly across all parameters. MoE models may leave some experts undertrained on narrow domains. If using MoE, consider fine-tuning only the router and a subset of experts.
Common Confusions
MoE does not reduce total compute, only per-token compute
An MoE model with experts and active has total parameters times larger than the equivalent dense model. Training still requires processing gradients for the active experts, and the router adds overhead. The savings come from not activating all experts per token, not from having fewer parameters.
More experts does not always help
Increasing while keeping fixed adds parameters but also increases routing difficulty, communication cost, and the risk of expert collapse. Empirically, returns diminish beyond a certain expert count. The optimal depends on the scale of the model and the available hardware topology.
References
Canonical:
- Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (ICLR 2017)
- Fedus, Zoph, Shazeer, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (JMLR 2022)
Current:
- Jiang et al., Mixtral of Experts (2024)
- Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models (2024)