Activation functions — ReLU and sigmoid

Before we begin

Without activations, a stack of layers is still one big linear function. Activations inject non-linearity so depth matters.

Why do we need activation functions? So networks can learn curves, edges, and combinations — not just straight lines.

Figure

Sigmoid vs ReLU

Sigmoid saturates at extremes; ReLU stays active for positive z.

σ(z) = 1 / (1 + e⁻ᶻ)

Output range (0, 1) — nice for probabilities.
Saturates when |z| is large → derivative ≈ 0 → vanishing gradient in deep stacks.
Still common on binary output neurons; less common in hidden layers today.

ReLU(z) = max(0, z)

0 for negative z, z for positive z.
Simple and fast; avoids saturation on the positive side.
Default choice for hidden layers in most vision/MLP models.
Dead ReLU: neuron always outputs 0 if weights push z negative forever (usually manageable).

Tanh — like sigmoid but centered at 0.
GELU / Swish — used in transformers (later phases).
Softmax — not per-neuron; normalizes output vector to probabilities (digits 0–9).

Layer	Typical choice
Hidden	ReLU (or variant)
Binary output	Sigmoid
Multi-class output (MNIST)	Softmax (with cross-entropy loss)

In deep sigmoid networks, backprop multiplies many small derivatives. Early layers receive tiny updates and learn slowly.

ReLU helped revive deep learning because gradients flow more easily for active neurons.

Checkpoint: Why is ReLU often preferred over sigmoid in hidden layers?

Answer sketch

ReLU does not saturate for positive z, so gradients are less likely to vanish across many layers.