ReLU Family Lab
Compare hard hinges, rescue slopes, smooth gates, and rounded thresholds. This lab is about the real tradeoff: what happens to negative evidence and what gradient survives the trip back.
The activation chooses what a neuron does with negative evidence and what gradient survives the trip back
ReLU-family choices are not cosmetic. They decide whether negative pre-activations die, leak, saturate smoothly, or get softly weighted. That changes gradient flow, sparsity, numerical behavior, and which architectures feel stable in practice.
Does the unit die, leak, saturate, or keep a soft weighted gradient?
This is the real comparison zone: hard threshold versus smooth shoulder versus probabilistic gate.
The lower plot tells you whether learning signal survives, not just whether the output looks nice.
ReLU keeps the active side brutally simple and kills the negative side completely
ReLU became the default fix for old sigmoid stacks because positive activations keep slope 1 instead of saturating. The cost is that negative pre-activations can go entirely silent.
Classic CNNs, MLP baselines, many residual blocks, and any place you want a plain fast hinge before trying fancier modern variants.
The local derivative is close to 1, so learning can flow through this unit cleanly.
ReLU is the fastest baseline because its geometry is easy: off on the left, perfectly linear on the right.
Active highway
Once the input is positive, ReLU stops being subtle: the unit is just a line of slope 1.
Compare this to ELU, GELU, and Softplus to feel how much geometry you trade away for that simplicity.
The ReLU positive branch preserves gradient magnitude exactly, which is why deep ReLU networks trained so much better than deep sigmoid stacks.
This is the cheapest way to keep deep positive activations trainable. When it works, the network gets sparse gating and clean gradient flow on the active side.
Classic CNNs, MLP baselines, many residual blocks, and any place you want a plain fast hinge before trying fancier modern variants.
Start here when you want the simplest strong baseline and you are not already seeing dead-unit problems.
If a unit spends every batch left of zero, its local slope is exactly 0 and it can stop learning entirely.