Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Activation Functions

Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability.

CoreTier 1Stable~45 min

Why This Matters

z = 1.5
dead zone (grad=0)ReLUf=1.500 f'=1.000Sigmoidf=0.818 f'=0.149GELUf=1.400 f'=1.128-1012-3-2-10123Input z= gradient f'(z)

Without nonlinear activations, a feedforward network of any depth computes only an affine function. A composition of affine maps is affine. The activation function is what gives neural networks their representational power.

The choice of activation directly controls gradient flow during backpropagation. Sigmoid and tanh saturate for large inputs, driving gradients toward zero and stalling learning in deep networks. ReLU solved this for positive inputs but introduced dead neurons. Modern activations like GELU and SiLU provide smooth alternatives with better empirical performance in transformers. Proper weight initialization and batch normalization also help control gradient magnitude.

Formal Setup

Consider a feedforward network with LL layers. Layer ll computes:

z(l)=W(l)a(l1)+b(l),a(l)=σ(z(l))z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}, \quad a^{(l)} = \sigma(z^{(l)})

where σ\sigma is the activation function applied elementwise, W(l)W^{(l)} is the weight matrix, b(l)b^{(l)} is the bias vector, z(l)z^{(l)} is the pre-activation, and a(l)a^{(l)} is the post-activation.

The gradient of the loss with respect to z(l)z^{(l)} depends on σ(z(l))\sigma'(z^{(l)}). If σ\sigma' is small, gradients vanish. If σ\sigma' is large, gradients explode.

Core Definitions

Definition

Sigmoid

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

Output range: (0,1)(0, 1). Derivative: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)). Maximum derivative is 1/41/4 at x=0x = 0.

Definition

Hyperbolic Tangent

tanh(x)=exexex+ex=2σ(2x)1\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

Output range: (1,1)(-1, 1). Zero-centered. Derivative: tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x). Maximum derivative is 11 at x=0x = 0.

Definition

ReLU

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Output range: [0,)[0, \infty). Derivative: 11 for x>0x > 0, 00 for x<0x < 0, undefined at x=0x = 0 (conventionally set to 00).

Definition

Leaky ReLU

LeakyReLU(x)=max(αx,x)\text{LeakyReLU}(x) = \max(\alpha x, x)

where α\alpha is a small positive constant, typically α=0.01\alpha = 0.01. Derivative: 11 for x>0x > 0, α\alpha for x<0x < 0.

Definition

GELU

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

where Φ(x)\Phi(x) is the standard normal CDF. Smooth function that weights inputs by their percentile under a Gaussian. Used in BERT, GPT-2, and most modern transformers.

Exact form: xΦ(x)x \cdot \Phi(x), with Φ\Phi the Gaussian CDF. Fast approximation (Hendrycks & Gimpel 2016): GELU(x)0.5x(1+tanh[2/π(x+0.0447x3)]).\text{GELU}(x) \approx 0.5 x \left(1 + \tanh\left[\sqrt{2/\pi}\left(x + 0.0447 x^3\right)\right]\right). Frameworks sometimes default to the approximation for speed; numerical outputs differ at the third decimal place.

Definition

SiLU / Swish

SiLU(x)=xσ(x)\text{SiLU}(x) = x \cdot \sigma(x)

where σ\sigma is the sigmoid function. Smooth, non-monotone (slightly negative for x1.28x \approx -1.28). Used in architectures like EfficientNet and LLaMA.

Definition

SwiGLU

SwiGLU(x,W,V)=SiLU(xW)(xV)\text{SwiGLU}(x, W, V) = \text{SiLU}(xW) \otimes (xV)

Gated linear unit variant (Shazeer 2020) built from SiLU and an elementwise-product gate. The feedforward block becomes (SiLU(xW)(xV))Wout(\text{SiLU}(xW) \otimes (xV)) W_{\text{out}}, doubling the parameter count of a plain MLP of the same hidden width, so implementations typically shrink the gated width by 2/32/3 to match the FLOPs of a standard FFN. Used in LLaMA 1/2/3, Mistral, Gemma, and PaLM. A plain activation function page is incomplete without it since it is the FFN nonlinearity in most modern open-weight LLMs.

Main Theorems

Theorem

Nonlinearity Is Necessary for Universal Approximation

Statement

If σ(x)=x\sigma(x) = x (the identity function), then a feedforward network of any depth LL computes an affine function f(x)=Ax+bf(x) = Ax + b for some matrix AA and vector bb. The network cannot approximate arbitrary continuous functions.

Intuition

Each layer computes W(l)a(l1)+b(l)W^{(l)} a^{(l-1)} + b^{(l)}. Composing LL such maps gives W(L)W(L1)W(1)x+b~W^{(L)} W^{(L-1)} \cdots W^{(1)} x + \tilde{b}, which is a single affine map. Depth without nonlinearity adds no representational power.

Proof Sketch

By induction on LL. Base case: a single affine layer is affine. Inductive step: if fL1(x)=AL1x+bL1f_{L-1}(x) = A_{L-1}x + b_{L-1}, then fL(x)=W(L)(AL1x+bL1)+b(L)=(W(L)AL1)x+(W(L)bL1+b(L))f_L(x) = W^{(L)}(A_{L-1}x + b_{L-1}) + b^{(L)} = (W^{(L)} A_{L-1})x + (W^{(L)} b_{L-1} + b^{(L)}), which is affine.

Why It Matters

This result explains why activation functions exist at all. The universal approximation theorem (Cybenko 1989, Hornik 1991) shows that a single hidden layer with a non-polynomial activation can approximate any continuous function on a compact set. Without nonlinearity, no amount of depth helps.

Failure Mode

The result says nonlinearity is necessary but says nothing about which nonlinearity is best. All non-polynomial activations yield universal approximation in theory. The practical differences are about gradient flow and optimization, not expressiveness.

Gradient Flow Analysis

Sigmoid and Tanh: Vanishing Gradients

During backpropagation through LL layers, the gradient includes a product of activation derivatives:

Lz(1)l=1Lσ(z(l))\frac{\partial \mathcal{L}}{\partial z^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)})

For sigmoid, σ(x)1/4\sigma'(x) \leq 1/4 everywhere. After LL layers, the gradient magnitude is at most (1/4)L(1/4)^L, which vanishes exponentially. With L=10L = 10 layers, the gradient shrinks by a factor of at least 10610^6.

Tanh is better: tanh(0)=1\tanh'(0) = 1, so near-zero pre-activations pass gradients without attenuation. But for large z|z|, tanh(z)0\tanh'(z) \to 0 and the same saturation problem occurs.

ReLU: Constant Gradients, But Dead Neurons

ReLU has derivative 11 for positive inputs. No saturation, no vanishing gradient for active neurons. This is why ReLU enabled training of deep networks (Krizhevsky et al. 2012).

The problem: if z<0z < 0, the gradient is exactly 00. A neuron whose pre-activation is negative for all training examples receives zero gradient updates and never recovers. This is the dying ReLU problem.

Leaky ReLU: No Dead Neurons

By setting σ(x)=α>0\sigma'(x) = \alpha > 0 for x<0x < 0, Leaky ReLU ensures every neuron receives a nonzero gradient. In practice, α=0.01\alpha = 0.01 is standard. Parametric ReLU (PReLU) learns α\alpha from data.

Canonical Examples

Example

Sigmoid saturation in a 5-layer network

Consider a network with sigmoid activations and all pre-activations initialized at z=3z = 3. Then σ(3)=σ(3)(1σ(3))0.95×0.05=0.045\sigma'(3) = \sigma(3)(1 - \sigma(3)) \approx 0.95 \times 0.05 = 0.045. The gradient through 5 layers is attenuated by a factor of (0.045)51.8×107(0.045)^5 \approx 1.8 \times 10^{-7}. Training will stall in the early layers.

With ReLU and positive pre-activations, the gradient factor is 15=11^5 = 1. No attenuation.

Common Confusions

Watch Out

ReLU is not differentiable at zero, but this does not matter in practice

ReLU has a kink at x=0x = 0 where the derivative is undefined. In practice, the probability that any pre-activation is exactly zero is zero (under continuous distributions). Implementations set σ(0)=0\sigma'(0) = 0 by convention. This has no measurable effect on training.

Watch Out

All common activations yield universal approximation

The universal approximation theorem holds for any non-polynomial continuous activation. ReLU, sigmoid, tanh, GELU, and SiLU all have this property. The choice of activation does not affect what functions a network can represent in the infinite-width limit. It affects what functions gradient-based optimization will find in finite time with finite width.

Watch Out

GELU is not just smooth ReLU

GELU is sometimes described as a smooth approximation to ReLU. This is misleading. GELU weights each input by its probability under a standard Gaussian: GELU(x)=xP(Zx)\text{GELU}(x) = x \cdot P(Z \leq x) where ZN(0,1)Z \sim \mathcal{N}(0,1). For large positive xx, GELU(x)x\text{GELU}(x) \approx x. For large negative xx, GELU(x)0\text{GELU}(x) \approx 0. The transition is smooth and data-dependent, unlike the hard threshold of ReLU.

Key Takeaways

  • Without nonlinear activations, depth is useless: the network computes an affine map
  • Sigmoid and tanh saturate for large x|x|, causing vanishing gradients in deep networks
  • ReLU solved vanishing gradients for positive inputs but introduced dead neurons
  • Leaky ReLU, GELU, and SiLU avoid dead neurons while maintaining good gradient flow
  • Activation choice affects optimization dynamics, not theoretical expressiveness
  • Modern transformers use GELU, SiLU, or SwiGLU; SwiGLU is the FFN nonlinearity in LLaMA, Mistral, Gemma, and PaLM

Exercises

ExerciseCore

Problem

Compute σ(x)\sigma'(x) for the sigmoid function σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}) and show that maxxσ(x)=1/4\max_x \sigma'(x) = 1/4.

ExerciseCore

Problem

A 10-layer network uses sigmoid activations. Assume all pre-activations are at x=0x = 0 (best case for gradient flow). What is the maximum gradient attenuation factor through all 10 layers from activation derivatives alone?

ExerciseAdvanced

Problem

Show that for any α>0\alpha > 0, Leaky ReLU with slope α\alpha for negative inputs has no dead neurons. Precisely: show that for any weight configuration, LeakyReLU(z)/z0\partial \text{LeakyReLU}(z) / \partial z \neq 0 for all z0z \neq 0.

Related Comparisons

References

Canonical:

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.3
  • Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989)
  • Hornik, "Approximation Capabilities of Multilayer Feedforward Networks" (1991)

Current:

  • Hendrycks & Gimpel, "Gaussian Error Linear Units (GELUs)" (2016), arXiv:1606.08415
  • Ramachandran, Zoph, Le, "Searching for Activation Functions" (2017) (Swish/SiLU), arXiv:1710.05941
  • Shazeer, "GLU Variants Improve Transformer" (2020), arXiv:2002.05202 (SwiGLU)
  • Nair & Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines" (2010)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics