← Back to curriculum

Module 3 — Neural networks basics

Activation functions — ReLU & sigmoid

Why non-linearity matters, comparing ReLU and sigmoid, output activations for classification, and vanishing gradients.

~70 min read + exercises

Activation functions — ReLU and sigmoid

Before we begin

Without activations, a stack of layers is still one big linear function. Activations inject non-linearity so depth matters.

Why do we need activation functions? So networks can learn curves, edges, and combinations — not just straight lines.

Figure

Sigmoid vs ReLU

Sigmoid vs ReLU (hidden layers)Sigmoid — saturatesflat tails → tiny gradientsReLU — active if z > 0no saturation for z > 0
Sigmoid saturates at extremes; ReLU stays active for positive z.

What you will learn

  • Compare ReLU and sigmoid for hidden layers.
  • Pick output activations for classification.
  • Describe vanishing gradients in plain language.

Before this lesson


Sigmoid

σ(z) = 1 / (1 + e⁻ᶻ)

  • Output range (0, 1) — nice for probabilities.
  • Saturates when |z| is large → derivative ≈ 0 → vanishing gradient in deep stacks.
  • Still common on binary output neurons; less common in hidden layers today.

ReLU

ReLU(z) = max(0, z)

  • 0 for negative z, z for positive z.
  • Simple and fast; avoids saturation on the positive side.
  • Default choice for hidden layers in most vision/MLP models.
  • Dead ReLU: neuron always outputs 0 if weights push z negative forever (usually manageable).

Other activations (awareness)

  • Tanh — like sigmoid but centered at 0.
  • GELU / Swish — used in transformers (later phases).
  • Softmax — not per-neuron; normalizes output vector to probabilities (digits 0–9).

Where to use which

LayerTypical choice
HiddenReLU (or variant)
Binary outputSigmoid
Multi-class output (MNIST)Softmax (with cross-entropy loss)

Vanishing gradient

In deep sigmoid networks, backprop multiplies many small derivatives. Early layers receive tiny updates and learn slowly.

ReLU helped revive deep learning because gradients flow more easily for active neurons.

Checkpoint: Why is ReLU often preferred over sigmoid in hidden layers?

Answer sketch

ReLU does not saturate for positive z, so gradients are less likely to vanish across many layers.


What's next

Lesson 3 — Forward propagation