Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

SwiGLU vs. GELU vs. ReLU

ReLU is the simplest activation: zero for negative inputs, identity for positive. GELU applies a smooth, probabilistic gate based on the Gaussian CDF. SwiGLU combines the Swish activation with a gated linear unit, using an extra linear projection to gate the hidden representation. SwiGLU outperforms GELU and ReLU in transformer feed-forward networks at the cost of additional parameters. LLaMA, PaLM, and Gemma use SwiGLU. GPT-2 and BERT use GELU.

What Each Does

These three activation functions are used in the feed-forward network (FFN) block of transformers. The FFN processes each token independently after the attention layer.

ReLU (Nair and Hinton, 2010):

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Applied elementwise. The standard FFN is FFN(x)=W2ReLU(W1x+b1)+b2\text{FFN}(x) = W_2 \, \text{ReLU}(W_1 x + b_1) + b_2 where W1Rdff×dW_1 \in \mathbb{R}^{d_{\text{ff}} \times d} and W2Rd×dffW_2 \in \mathbb{R}^{d \times d_{\text{ff}}}.

GELU (Hendrycks and Gimpel, 2016):

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

where Φ(x)\Phi(x) is the Gaussian CDF. The common approximation is GELU(x)0.5x(1+tanh[2/π(x+0.044715x3)])\text{GELU}(x) \approx 0.5 x (1 + \tanh[\sqrt{2/\pi}(x + 0.044715 x^3)]). Applied in the same FFN structure as ReLU.

SwiGLU (Shazeer, 2020):

SwiGLU(x)=(Swish(W1x)(W3x))W2\text{SwiGLU}(x) = (\text{Swish}(W_1 x) \odot (W_3 x)) \cdot W_2

where Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x) and σ\sigma is the sigmoid function. This introduces a third weight matrix W3Rdff×dW_3 \in \mathbb{R}^{d_{\text{ff}} \times d} that produces a gating signal. The elementwise product \odot between the Swish-activated projection and the gate projection is the key operation.

The Gating Mechanism

The critical difference between SwiGLU and the other two is the gated linear unit (GLU) structure. In a standard FFN, the hidden representation passes through a single nonlinearity. In a GLU variant, the input is projected twice: once through a nonlinear activation and once through a linear (or differently activated) path. The elementwise product of these two projections gates the information flow.

GLU was introduced by Dauphin et al. (2017) for language modeling:

GLU(x)=(W1x+b1)σ(W2x+b2)\text{GLU}(x) = (W_1 x + b_1) \odot \sigma(W_2 x + b_2)

SwiGLU replaces the sigmoid gate with Swish and the linear path with a plain projection. The gating allows the network to learn which dimensions of the hidden representation to keep and which to suppress, providing finer control than a fixed nonlinearity applied uniformly.

Side-by-Side Comparison

PropertyReLUGELUSwiGLU
Formulamax(0,x)\max(0, x)xΦ(x)x \cdot \Phi(x)Swish(W1x)W3x\text{Swish}(W_1 x) \odot W_3 x
SmoothnessNot smooth at 0Smooth everywhereSmooth everywhere
GatedNoNoYes (multiplicative gate)
Weight matrices in FFN2 (W1W_1, W2W_2)2 (W1W_1, W2W_2)3 (W1W_1, W2W_2, W3W_3)
FFN parameters2ddff2 \cdot d \cdot d_{\text{ff}}2ddff2 \cdot d \cdot d_{\text{ff}}3ddff3 \cdot d \cdot d_{\text{ff}}
Common dffd_{\text{ff}} multiplier4d4d4d4d83d\frac{8}{3}d (to match param count)
Dead neuronsYes (zero gradient for x<0x < 0)Rare (gradient is small but nonzero)No (gating adapts)
Computational costCheapestModerate (tanh approximation)Highest (extra matmul + sigmoid)
Used inOriginal Transformer (2017)GPT-2, GPT-3, BERTLLaMA, PaLM, Gemma, Mistral
Year introduced201020162020

Parameter Budget Adjustment

SwiGLU uses three weight matrices instead of two, increasing FFN parameters by 50% at the same hidden dimension dffd_{\text{ff}}. To keep the total parameter count comparable, practitioners reduce dffd_{\text{ff}}. The standard approach uses dff=83dd_{\text{ff}} = \frac{8}{3}d for SwiGLU instead of dff=4dd_{\text{ff}} = 4d for ReLU/GELU. This gives:

At equal parameter count, SwiGLU consistently outperforms both ReLU and GELU on language modeling perplexity. Shazeer (2020) showed improvements of 0.05-0.1 perplexity across multiple model sizes, and this advantage has held at the 7B-70B scale in LLaMA and PaLM.

Why Gating Helps

The multiplicative interaction in SwiGLU allows different dimensions of the hidden representation to be independently amplified or suppressed based on the input. In a standard ReLU FFN, a dimension is either on (positive pre-activation) or off (negative pre-activation). In GELU, the transition is smoother but still determined by a single scalar. In SwiGLU, the gate W3xW_3 x provides an input-dependent scaling that is independent of the activation path Swish(W1x)\text{Swish}(W_1 x).

This can be understood as conditional computation at the dimension level: the gate learns which features are relevant for the current input and suppresses irrelevant features multiplicatively. The bilinear interaction between two projections of the same input creates a richer function class than a single projection followed by a fixed nonlinearity.

When Each Is Appropriate

ReLU is still used in vision transformers (ViT, DeiT) and in smaller models where the simplicity and speed of ReLU outweigh the quality gains from gating. ReLU also produces genuinely sparse activations, which can be exploited for faster inference.

GELU is the standard for encoder-only models (BERT family) and for GPT-2/GPT-3 scale models. It provides smooth gradients that help optimization without the parameter overhead of gating.

SwiGLU is the clear choice for large autoregressive language models trained from scratch. The parameter overhead is neutralized by reducing dffd_{\text{ff}}, and the quality improvement is consistent across scales. There is no known case where SwiGLU underperforms GELU at matched parameter count for language modeling.

Common Confusions

Watch Out

SwiGLU is not just Swish applied to a standard FFN

SwiGLU is a structural change to the FFN, not just swapping the activation function. It adds a third weight matrix and a multiplicative gating interaction. Simply replacing ReLU with Swish in a two-matrix FFN gives a different (and weaker) architecture.

Watch Out

The hidden dimension reduction for SwiGLU is not optional

Without reducing dffd_{\text{ff}} from 4d4d to 83d\frac{8}{3}d, SwiGLU uses 50% more parameters in the FFN. Comparisons must be made at matched parameter count. SwiGLU with dff=4dd_{\text{ff}} = 4d outperforms GELU with dff=4dd_{\text{ff}} = 4d, but it also uses more parameters. The fair comparison is at equal total parameters.

Watch Out

GELU is not the Gaussian Error function

GELU uses the Gaussian CDF Φ(x)=P(Zx)\Phi(x) = P(Z \leq x) for ZN(0,1)Z \sim \mathcal{N}(0,1), not the error function directly. The relationship is Φ(x)=12[1+erf(x/2)]\Phi(x) = \frac{1}{2}[1 + \text{erf}(x/\sqrt{2})]. The approximation used in practice involves tanh\tanh, not erf\text{erf}.

References

  1. Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv:2002.05202. (SwiGLU and other GLU variants, perplexity comparisons.)
  2. Hendrycks, D. and Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415. (GELU definition and motivation.)
  3. Dauphin, Y. et al. (2017). "Language Modeling with Gated Convolutional Networks." ICML 2017. (Original GLU for language modeling.)
  4. Nair, V. and Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML 2010. (ReLU introduction.)
  5. Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.1 (SwiGLU usage with dff=834dd_{\text{ff}} = \frac{8}{3} \cdot 4d rounded to nearest multiple of 256.)
  6. Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. Section 2 (SwiGLU in PaLM architecture.)
  7. Ramachandran, P. et al. (2017). "Searching for Activation Functions." arXiv:1710.05941. (Swish discovery via neural architecture search.)