Activation functions — ReLU and sigmoid
Before we begin
Without activations, a stack of layers is still one big linear function. Activations inject non-linearity so depth matters.
Why do we need activation functions? So networks can learn curves, edges, and combinations — not just straight lines.
Figure
Sigmoid vs ReLU
What you will learn
- Compare ReLU and sigmoid for hidden layers.
- Pick output activations for classification.
- Describe vanishing gradients in plain language.
Before this lesson
Sigmoid
σ(z) = 1 / (1 + e⁻ᶻ)
- Output range (0, 1) — nice for probabilities.
- Saturates when |z| is large → derivative ≈ 0 → vanishing gradient in deep stacks.
- Still common on binary output neurons; less common in hidden layers today.
ReLU
ReLU(z) = max(0, z)
- 0 for negative z, z for positive z.
- Simple and fast; avoids saturation on the positive side.
- Default choice for hidden layers in most vision/MLP models.
- Dead ReLU: neuron always outputs 0 if weights push z negative forever (usually manageable).
Other activations (awareness)
- Tanh — like sigmoid but centered at 0.
- GELU / Swish — used in transformers (later phases).
- Softmax — not per-neuron; normalizes output vector to probabilities (digits 0–9).
Where to use which
| Layer | Typical choice |
|---|---|
| Hidden | ReLU (or variant) |
| Binary output | Sigmoid |
| Multi-class output (MNIST) | Softmax (with cross-entropy loss) |
Vanishing gradient
In deep sigmoid networks, backprop multiplies many small derivatives. Early layers receive tiny updates and learn slowly.
ReLU helped revive deep learning because gradients flow more easily for active neurons.
Checkpoint: Why is ReLU often preferred over sigmoid in hidden layers?
Answer sketch
ReLU does not saturate for positive z, so gradients are less likely to vanish across many layers.