What Each Does
These three activation functions are used in the feed-forward network (FFN) block of transformers. The FFN processes each token independently after the attention layer.
ReLU (Nair and Hinton, 2010):
Applied elementwise. The standard FFN is where and .
GELU (Hendrycks and Gimpel, 2016):
where is the Gaussian CDF. The common approximation is . Applied in the same FFN structure as ReLU.
SwiGLU (Shazeer, 2020):
where and is the sigmoid function. This introduces a third weight matrix that produces a gating signal. The elementwise product between the Swish-activated projection and the gate projection is the key operation.
The Gating Mechanism
The critical difference between SwiGLU and the other two is the gated linear unit (GLU) structure. In a standard FFN, the hidden representation passes through a single nonlinearity. In a GLU variant, the input is projected twice: once through a nonlinear activation and once through a linear (or differently activated) path. The elementwise product of these two projections gates the information flow.
GLU was introduced by Dauphin et al. (2017) for language modeling:
SwiGLU replaces the sigmoid gate with Swish and the linear path with a plain projection. The gating allows the network to learn which dimensions of the hidden representation to keep and which to suppress, providing finer control than a fixed nonlinearity applied uniformly.
Side-by-Side Comparison
| Property | ReLU | GELU | SwiGLU |
|---|---|---|---|
| Formula | |||
| Smoothness | Not smooth at 0 | Smooth everywhere | Smooth everywhere |
| Gated | No | No | Yes (multiplicative gate) |
| Weight matrices in FFN | 2 (, ) | 2 (, ) | 3 (, , ) |
| FFN parameters | |||
| Common multiplier | (to match param count) | ||
| Dead neurons | Yes (zero gradient for ) | Rare (gradient is small but nonzero) | No (gating adapts) |
| Computational cost | Cheapest | Moderate (tanh approximation) | Highest (extra matmul + sigmoid) |
| Used in | Original Transformer (2017) | GPT-2, GPT-3, BERT | LLaMA, PaLM, Gemma, Mistral |
| Year introduced | 2010 | 2016 | 2020 |
Parameter Budget Adjustment
SwiGLU uses three weight matrices instead of two, increasing FFN parameters by 50% at the same hidden dimension . To keep the total parameter count comparable, practitioners reduce . The standard approach uses for SwiGLU instead of for ReLU/GELU. This gives:
- ReLU/GELU FFN parameters:
- SwiGLU FFN parameters:
At equal parameter count, SwiGLU consistently outperforms both ReLU and GELU on language modeling perplexity. Shazeer (2020) showed improvements of 0.05-0.1 perplexity across multiple model sizes, and this advantage has held at the 7B-70B scale in LLaMA and PaLM.
Why Gating Helps
The multiplicative interaction in SwiGLU allows different dimensions of the hidden representation to be independently amplified or suppressed based on the input. In a standard ReLU FFN, a dimension is either on (positive pre-activation) or off (negative pre-activation). In GELU, the transition is smoother but still determined by a single scalar. In SwiGLU, the gate provides an input-dependent scaling that is independent of the activation path .
This can be understood as conditional computation at the dimension level: the gate learns which features are relevant for the current input and suppresses irrelevant features multiplicatively. The bilinear interaction between two projections of the same input creates a richer function class than a single projection followed by a fixed nonlinearity.
When Each Is Appropriate
ReLU is still used in vision transformers (ViT, DeiT) and in smaller models where the simplicity and speed of ReLU outweigh the quality gains from gating. ReLU also produces genuinely sparse activations, which can be exploited for faster inference.
GELU is the standard for encoder-only models (BERT family) and for GPT-2/GPT-3 scale models. It provides smooth gradients that help optimization without the parameter overhead of gating.
SwiGLU is the clear choice for large autoregressive language models trained from scratch. The parameter overhead is neutralized by reducing , and the quality improvement is consistent across scales. There is no known case where SwiGLU underperforms GELU at matched parameter count for language modeling.
Common Confusions
SwiGLU is not just Swish applied to a standard FFN
SwiGLU is a structural change to the FFN, not just swapping the activation function. It adds a third weight matrix and a multiplicative gating interaction. Simply replacing ReLU with Swish in a two-matrix FFN gives a different (and weaker) architecture.
The hidden dimension reduction for SwiGLU is not optional
Without reducing from to , SwiGLU uses 50% more parameters in the FFN. Comparisons must be made at matched parameter count. SwiGLU with outperforms GELU with , but it also uses more parameters. The fair comparison is at equal total parameters.
GELU is not the Gaussian Error function
GELU uses the Gaussian CDF for , not the error function directly. The relationship is . The approximation used in practice involves , not .
References
- Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv:2002.05202. (SwiGLU and other GLU variants, perplexity comparisons.)
- Hendrycks, D. and Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415. (GELU definition and motivation.)
- Dauphin, Y. et al. (2017). "Language Modeling with Gated Convolutional Networks." ICML 2017. (Original GLU for language modeling.)
- Nair, V. and Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML 2010. (ReLU introduction.)
- Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.1 (SwiGLU usage with rounded to nearest multiple of 256.)
- Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. Section 2 (SwiGLU in PaLM architecture.)
- Ramachandran, P. et al. (2017). "Searching for Activation Functions." arXiv:1710.05941. (Swish discovery via neural architecture search.)