Dense Transformers vs Mixture-of-Experts

What Each Does

Both architectures process sequences of tokens through layers of attention and feedforward blocks. They differ in how the feedforward computation is organized.

Dense transformers pass every token through every parameter in every layer. If the feedforward block has $d_{\text{model}} \times d_{\text{ff}}$ parameters, every token uses all of them.

Mixture-of-experts (MoE) replaces the single feedforward block with $N$ parallel "expert" networks and a router that selects $k \ll N$ experts per token. Total parameters grow with $N$ , but per-token compute grows only with $k$ .

Side-by-Side Architecture

Definition

Dense Feedforward Block

For each token representation $x \in \mathbb{R}^{d}$ :

$\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2$

where $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}$ , $W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}$ , and $\sigma$ is the activation function. All tokens share the same weights.

Definition

MoE Feedforward Block

Given $N$ expert networks $\{E_1, \ldots, E_N\}$ and a router $g: \mathbb{R}^d \to \mathbb{R}^N$ :

$\text{MoE}(x) = \sum_{i \in \text{TopK}(g(x))} \text{softmax}(g(x))_i \cdot E_i(x)$

Only the top- $k$ experts are evaluated. Typically $k = 1$ or $k = 2$ while $N$ ranges from 8 to 128 or more.

Where Each Is Stronger

Dense wins on simplicity

Dense models have no routing mechanism, no load balancing loss, and no expert parallelism requirements. Every GPU processes the same computation for every token. Gradient flow is uniform across all parameters. There are no auxiliary losses to tune and no risk of expert collapse (where the router sends all tokens to a single expert).

MoE wins on parameter efficiency per FLOP

A 1.8T-parameter MoE model with $k = 2$ of $N = 128$ experts uses roughly the same FLOPs per token as a 70B dense model. The extra parameters provide additional capacity for memorization and rare knowledge without proportional compute cost. Mixtral 8x7B demonstrated this: total parameters of 46.7B, but only 12.9B active per token, matching or exceeding the quality of Llama 2 70B (Jiang et al. 2024, arXiv:2401.04088).

Dense wins on fine-tuning and transfer

Fine-tuning a dense model is straightforward: all parameters receive gradients from every example. MoE models have uneven parameter utilization during fine-tuning because different experts see different subsets of data. This makes MoE models harder to adapt to narrow downstream tasks where some experts may receive too few updates.

Where Each Fails

Dense fails at extreme scale

Scaling a dense model from 70B to 700B parameters requires 10x more FLOPs per token. Training and inference costs scale linearly with parameter count. At the frontier, this makes dense models prohibitively expensive for a given quality target compared to MoE alternatives.

MoE fails on load balancing

If the router consistently prefers a subset of experts, the remaining experts receive few tokens and learn nothing useful. This wastes capacity. Auxiliary load-balancing losses (penalizing uneven expert utilization) help but add a hyperparameter and can degrade performance if weighted too heavily.

MoE fails on memory and communication

Although per-token FLOPs are low, all $N$ expert parameter sets must be stored in memory or distributed across devices. Expert parallelism requires all-to-all communication to route tokens to the correct device, which becomes a bottleneck at large scale. A 128-expert model needs expert parameters spread across many GPUs, and the communication overhead can negate the FLOP savings.

Key Assumptions That Differ

	Dense	MoE
Parameters active per token	All	$k/N$ fraction
Total parameters	= active parameters	$\gg$ active parameters
FLOPs per token	$O(d \cdot d_{\text{ff}})$	$O(k \cdot d \cdot d_{\text{ff}} / N)$ approx.
Memory	Proportional to active params	Proportional to total params
Training stability	Standard	Requires load balancing
Communication	Data/tensor parallel	All-to-all for expert routing

The Routing Problem

Proposition

Router Collapse and Load Balancing

Statement

Without auxiliary losses, a learned router $g(x)$ tends to converge to a degenerate solution where a small number of experts handle most tokens. This happens because popular experts receive more gradient signal, improving faster, which makes them even more popular.

The standard fix is an auxiliary load-balancing loss:

$\mathcal{L}_{\text{aux}} = \alpha \cdot N \cdot \sum_{i=1}^N f_i \cdot p_i$

where $f_i$ is the fraction of tokens routed to expert $i$ and $p_i$ is the mean router probability for expert $i$ . The coefficient $\alpha$ (typically 0.01 to 0.1) controls the strength of the balancing incentive.

Intuition

The product $f_i \cdot p_i$ is minimized when load is uniform ( $f_i = 1/N$ ) and probabilities are uniform ( $p_i = 1/N$ ). The loss gently pushes toward even utilization without forcing rigid assignment.

Failure Mode

Setting $\alpha$ too high forces uniform routing regardless of input content, which wastes the benefit of specialization. Setting $\alpha$ too low allows collapse. The right value is empirical and varies by model scale.

report a correction →

When a Researcher Would Use Each

Example

Training a model under a fixed FLOP budget

Use MoE. If your compute budget allows a 7B-active-parameter model, an MoE with 8 experts gives you 56B total parameters at the same per-token cost. The extra capacity improves quality on knowledge-heavy benchmarks without increasing training FLOPs proportionally.

Example

Deploying a model on a single GPU

Use dense. A dense 7B model fits in 14GB of memory (FP16). An MoE with the same active compute but 8 experts requires storing all 56B parameters, needing at least 112GB. The memory and communication requirements of MoE make single-device deployment impractical for large expert counts.

Example

Fine-tuning for a specialized domain

Use dense or consider expert-specific fine-tuning strategies. Dense models distribute gradient signal evenly across all parameters. MoE models may leave some experts undertrained on narrow domains. If using MoE, consider fine-tuning only the router and a subset of experts.

Common Confusions

Watch Out

MoE does not reduce total compute, only per-token compute

An MoE model with $N$ experts and $k$ active has total parameters $\sim N/k$ times larger than the equivalent dense model. Training still requires processing gradients for the active experts, and the router adds overhead. The savings come from not activating all experts per token, not from having fewer parameters.

Watch Out

More experts does not always help

Increasing $N$ while keeping $k$ fixed adds parameters but also increases routing difficulty, communication cost, and the risk of expert collapse. Empirically, returns diminish beyond a certain expert count. The optimal $N$ depends on the scale of the model and the available hardware topology.

References

Canonical:

Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (ICLR 2017)
Fedus, Zoph, Shazeer, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (JMLR 2022)

Current:

Jiang et al., Mixtral of Experts (2024)
Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models (2024)