SwiGLU vs GELU vs ReLU: Activation Functions in Modern Transformers

What Each Does

These three activation functions are used in the feed-forward network (FFN) block of transformers. The FFN processes each token independently after the attention layer.

ReLU (Nair and Hinton, 2010):

$\text{ReLU}(x) = \max(0, x)$

Applied elementwise. The standard FFN is $\text{FFN}(x) = W_2 \, \text{ReLU}(W_1 x + b_1) + b_2$ where $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}$ and $W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}$ .

GELU (Hendrycks and Gimpel, 2016):

$\text{GELU}(x) = x \cdot \Phi(x)$

where $\Phi(x)$ is the Gaussian CDF. The common approximation is $\text{GELU}(x) \approx 0.5 x (1 + \tanh[\sqrt{2/\pi}(x + 0.044715 x^3)])$ . Applied in the same FFN structure as ReLU.

SwiGLU (Shazeer, 2020):

$\text{SwiGLU}(x) = (\text{Swish}(W_1 x) \odot (W_3 x)) \cdot W_2$

where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\sigma$ is the sigmoid function. This introduces a third weight matrix $W_3 \in \mathbb{R}^{d_{\text{ff}} \times d}$ that produces a gating signal. The elementwise product $\odot$ between the Swish-activated projection and the gate projection is the key operation.

The Gating Mechanism

The critical difference between SwiGLU and the other two is the gated linear unit (GLU) structure. In a standard FFN, the hidden representation passes through a single nonlinearity. In a GLU variant, the input is projected twice: once through a nonlinear activation and once through a linear (or differently activated) path. The elementwise product of these two projections gates the information flow.

GLU was introduced by Dauphin et al. (2017) for language modeling:

$\text{GLU}(x) = (W_1 x + b_1) \odot \sigma(W_2 x + b_2)$

SwiGLU replaces the sigmoid gate with Swish and the linear path with a plain projection. The gating allows the network to learn which dimensions of the hidden representation to keep and which to suppress, providing finer control than a fixed nonlinearity applied uniformly.

Side-by-Side Comparison

Property	ReLU	GELU	SwiGLU
Formula	$\max(0, x)$	$x \cdot \Phi(x)$	$\text{Swish}(W_1 x) \odot W_3 x$
Smoothness	Not smooth at 0	Smooth everywhere	Smooth everywhere
Gated	No	No	Yes (multiplicative gate)
Weight matrices in FFN	2 ( $W_1$ , $W_2$ )	2 ( $W_1$ , $W_2$ )	3 ( $W_1$ , $W_2$ , $W_3$ )
FFN parameters	$2 \cdot d \cdot d_{\text{ff}}$	$2 \cdot d \cdot d_{\text{ff}}$	$3 \cdot d \cdot d_{\text{ff}}$
Common $d_{\text{ff}}$ multiplier	$4d$	$4d$	$\frac{8}{3}d$ (to match param count)
Dead neurons	Yes (zero gradient for $x < 0$ )	Rare (gradient is small but nonzero)	No (gating adapts)
Computational cost	Cheapest	Moderate (tanh approximation)	Highest (extra matmul + sigmoid)
Used in	Original Transformer (2017)	GPT-2, GPT-3, BERT	LLaMA, PaLM, Gemma, Mistral
Year introduced	2010	2016	2020

Parameter Budget Adjustment

SwiGLU uses three weight matrices instead of two, increasing FFN parameters by 50% at the same hidden dimension $d_{\text{ff}}$ . To keep the total parameter count comparable, practitioners reduce $d_{\text{ff}}$ . The standard approach uses $d_{\text{ff}} = \frac{8}{3}d$ for SwiGLU instead of $d_{\text{ff}} = 4d$ for ReLU/GELU. This gives:

ReLU/GELU FFN parameters: $2 \cdot d \cdot 4d = 8d^2$
SwiGLU FFN parameters: $3 \cdot d \cdot \frac{8}{3}d = 8d^2$

At equal parameter count, SwiGLU consistently outperforms both ReLU and GELU on language modeling perplexity. Shazeer (2020) showed improvements of 0.05-0.1 perplexity across multiple model sizes, and this advantage has held at the 7B-70B scale in LLaMA and PaLM.

Why Gating Helps

The multiplicative interaction in SwiGLU allows different dimensions of the hidden representation to be independently amplified or suppressed based on the input. In a standard ReLU FFN, a dimension is either on (positive pre-activation) or off (negative pre-activation). In GELU, the transition is smoother but still determined by a single scalar. In SwiGLU, the gate $W_3 x$ provides an input-dependent scaling that is independent of the activation path $\text{Swish}(W_1 x)$ .

This can be understood as conditional computation at the dimension level: the gate learns which features are relevant for the current input and suppresses irrelevant features multiplicatively. The bilinear interaction between two projections of the same input creates a richer function class than a single projection followed by a fixed nonlinearity.

When Each Is Appropriate

ReLU is still used in vision transformers (ViT, DeiT) and in smaller models where the simplicity and speed of ReLU outweigh the quality gains from gating. ReLU also produces genuinely sparse activations, which can be exploited for faster inference.

GELU is the standard for encoder-only models (BERT family) and for GPT-2/GPT-3 scale models. It provides smooth gradients that help optimization without the parameter overhead of gating.

SwiGLU is the clear choice for large autoregressive language models trained from scratch. The parameter overhead is neutralized by reducing $d_{\text{ff}}$ , and the quality improvement is consistent across scales. There is no known case where SwiGLU underperforms GELU at matched parameter count for language modeling.

Common Confusions

Watch Out

SwiGLU is not just Swish applied to a standard FFN

SwiGLU is a structural change to the FFN, not just swapping the activation function. It adds a third weight matrix and a multiplicative gating interaction. Simply replacing ReLU with Swish in a two-matrix FFN gives a different (and weaker) architecture.

Watch Out

The hidden dimension reduction for SwiGLU is not optional

Without reducing $d_{\text{ff}}$ from $4d$ to $\frac{8}{3}d$ , SwiGLU uses 50% more parameters in the FFN. Comparisons must be made at matched parameter count. SwiGLU with $d_{\text{ff}} = 4d$ outperforms GELU with $d_{\text{ff}} = 4d$ , but it also uses more parameters. The fair comparison is at equal total parameters.

Watch Out

GELU is not the Gaussian Error function

GELU uses the Gaussian CDF $\Phi(x) = P(Z \leq x)$ for $Z \sim \mathcal{N}(0,1)$ , not the error function directly. The relationship is $\Phi(x) = \frac{1}{2}[1 + \text{erf}(x/\sqrt{2})]$ . The approximation used in practice involves $\tanh$ , not $\text{erf}$ .

References

Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv:2002.05202. (SwiGLU and other GLU variants, perplexity comparisons.)
Hendrycks, D. and Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415. (GELU definition and motivation.)
Dauphin, Y. et al. (2017). "Language Modeling with Gated Convolutional Networks." ICML 2017. (Original GLU for language modeling.)
Nair, V. and Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML 2010. (ReLU introduction.)
Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.1 (SwiGLU usage with $d_{\text{ff}} = \frac{8}{3} \cdot 4d$ rounded to nearest multiple of 256.)
Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. Section 2 (SwiGLU in PaLM architecture.)
Ramachandran, P. et al. (2017). "Searching for Activation Functions." arXiv:1710.05941. (Swish discovery via neural architecture search.)