Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Dense Transformers vs. Mixture-of-Experts

Dense transformers activate all parameters for every token, giving simple training but high compute per token. Mixture-of-experts routes each token to k of N experts, achieving higher total capacity with lower per-token compute, at the cost of routing complexity and load balancing challenges.

What Each Does

Both architectures process sequences of tokens through layers of attention and feedforward blocks. They differ in how the feedforward computation is organized.

Dense transformers pass every token through every parameter in every layer. If the feedforward block has dmodel×dffd_{\text{model}} \times d_{\text{ff}} parameters, every token uses all of them.

Mixture-of-experts (MoE) replaces the single feedforward block with NN parallel "expert" networks and a router that selects kNk \ll N experts per token. Total parameters grow with NN, but per-token compute grows only with kk.

Side-by-Side Architecture

Definition

Dense Feedforward Block

For each token representation xRdx \in \mathbb{R}^{d}:

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2

where W1Rdff×dW_1 \in \mathbb{R}^{d_{\text{ff}} \times d}, W2Rd×dffW_2 \in \mathbb{R}^{d \times d_{\text{ff}}}, and σ\sigma is the activation function. All tokens share the same weights.

Definition

MoE Feedforward Block

Given NN expert networks {E1,,EN}\{E_1, \ldots, E_N\} and a router g:RdRNg: \mathbb{R}^d \to \mathbb{R}^N:

MoE(x)=iTopK(g(x))softmax(g(x))iEi(x)\text{MoE}(x) = \sum_{i \in \text{TopK}(g(x))} \text{softmax}(g(x))_i \cdot E_i(x)

Only the top-kk experts are evaluated. Typically k=1k = 1 or k=2k = 2 while NN ranges from 8 to 128 or more.

Where Each Is Stronger

Dense wins on simplicity

Dense models have no routing mechanism, no load balancing loss, and no expert parallelism requirements. Every GPU processes the same computation for every token. Gradient flow is uniform across all parameters. There are no auxiliary losses to tune and no risk of expert collapse (where the router sends all tokens to a single expert).

MoE wins on parameter efficiency per FLOP

A 1.8T-parameter MoE model with k=2k = 2 of N=128N = 128 experts uses roughly the same FLOPs per token as a 70B dense model. The extra parameters provide additional capacity for memorization and rare knowledge without proportional compute cost. Mixtral 8x7B demonstrated this: total parameters of 46.7B, but only 12.9B active per token, matching or exceeding the quality of a 34B dense model.

Dense wins on fine-tuning and transfer

Fine-tuning a dense model is straightforward: all parameters receive gradients from every example. MoE models have uneven parameter utilization during fine-tuning because different experts see different subsets of data. This makes MoE models harder to adapt to narrow downstream tasks where some experts may receive too few updates.

Where Each Fails

Dense fails at extreme scale

Scaling a dense model from 70B to 700B parameters requires 10x more FLOPs per token. Training and inference costs scale linearly with parameter count. At the frontier, this makes dense models prohibitively expensive for a given quality target compared to MoE alternatives.

MoE fails on load balancing

If the router consistently prefers a subset of experts, the remaining experts receive few tokens and learn nothing useful. This wastes capacity. Auxiliary load-balancing losses (penalizing uneven expert utilization) help but add a hyperparameter and can degrade performance if weighted too heavily.

MoE fails on memory and communication

Although per-token FLOPs are low, all NN expert parameter sets must be stored in memory or distributed across devices. Expert parallelism requires all-to-all communication to route tokens to the correct device, which becomes a bottleneck at large scale. A 128-expert model needs expert parameters spread across many GPUs, and the communication overhead can negate the FLOP savings.

Key Assumptions That Differ

DenseMoE
Parameters active per tokenAllk/Nk/N fraction
Total parameters= active parameters\gg active parameters
FLOPs per tokenO(ddff)O(d \cdot d_{\text{ff}})O(kddff/N)O(k \cdot d \cdot d_{\text{ff}} / N) approx.
MemoryProportional to active paramsProportional to total params
Training stabilityStandardRequires load balancing
CommunicationData/tensor parallelAll-to-all for expert routing

The Routing Problem

Theorem

Router Collapse and Load Balancing

Statement

Without auxiliary losses, a learned router g(x)g(x) tends to converge to a degenerate solution where a small number of experts handle most tokens. This happens because popular experts receive more gradient signal, improving faster, which makes them even more popular.

The standard fix is an auxiliary load-balancing loss:

Laux=αNi=1Nfipi\mathcal{L}_{\text{aux}} = \alpha \cdot N \cdot \sum_{i=1}^N f_i \cdot p_i

where fif_i is the fraction of tokens routed to expert ii and pip_i is the mean router probability for expert ii. The coefficient α\alpha (typically 0.01 to 0.1) controls the strength of the balancing incentive.

Intuition

The product fipif_i \cdot p_i is minimized when load is uniform (fi=1/Nf_i = 1/N) and probabilities are uniform (pi=1/Np_i = 1/N). The loss gently pushes toward even utilization without forcing rigid assignment.

Failure Mode

Setting α\alpha too high forces uniform routing regardless of input content, which wastes the benefit of specialization. Setting α\alpha too low allows collapse. The right value is empirical and varies by model scale.

When a Researcher Would Use Each

Example

Training a model under a fixed FLOP budget

Use MoE. If your compute budget allows a 7B-active-parameter model, an MoE with 8 experts gives you 56B total parameters at the same per-token cost. The extra capacity improves quality on knowledge-heavy benchmarks without increasing training FLOPs proportionally.

Example

Deploying a model on a single GPU

Use dense. A dense 7B model fits in 14GB of memory (FP16). An MoE with the same active compute but 8 experts requires storing all 56B parameters, needing at least 112GB. The memory and communication requirements of MoE make single-device deployment impractical for large expert counts.

Example

Fine-tuning for a specialized domain

Use dense or consider expert-specific fine-tuning strategies. Dense models distribute gradient signal evenly across all parameters. MoE models may leave some experts undertrained on narrow domains. If using MoE, consider fine-tuning only the router and a subset of experts.

Common Confusions

Watch Out

MoE does not reduce total compute, only per-token compute

An MoE model with NN experts and kk active has total parameters N/k\sim N/k times larger than the equivalent dense model. Training still requires processing gradients for the active experts, and the router adds overhead. The savings come from not activating all experts per token, not from having fewer parameters.

Watch Out

More experts does not always help

Increasing NN while keeping kk fixed adds parameters but also increases routing difficulty, communication cost, and the risk of expert collapse. Empirically, returns diminish beyond a certain expert count. The optimal NN depends on the scale of the model and the available hardware topology.

References

Canonical:

Current: