← Back to curriculum

Module 3 — Neural networks basics

Backpropagation — how networks learn

Chain rule intuition, backward gradient flow, what each weight learns, and connection to Module 1 gradient descent.

~80 min read + exercises

Backpropagation — how networks learn

Before we begin

Forward pass gives a prediction. Backpropagation answers:

For each weight, how much did it contribute to the loss?

Those values are gradients — same role as slope in Module 1, extended to every layer via the chain rule.

Figure

Backward flow

Backprop: loss → gradients flow backwardinputhiddenoutputloss∂loss/∂w on each layer
Loss at the end; gradients propagate backward to update each W and b.

What you will learn

  • State what backprop computes (gradients of loss w.r.t. weights).
  • Connect backprop to Module 1 gradient descent.
  • Describe vanishing gradients without heavy calculus.

Before this lesson


Training loop (full picture)

  1. Forward — compute prediction.
  2. Loss — compare prediction to true label (e.g. cross-entropy).
  3. Backward — compute ∂loss/∂w for every weight.
  4. Update — w ← w − learning_rate × ∂loss/∂w (and same for biases).
  5. Repeat for many batches and epochs.

PyTorch does steps 2–4 with loss.backward() and optimizer.step() — but the idea is what you learned in Module 1.


Chain rule intuition

If loss depends on h, which depends on W₁:

“Tweak W₁ slightly → h changes slightly → loss changes slightly.”

Multiply local effects along the path from loss back to each weight. That product is the gradient for that weight.

Deep networks = longer chains → risk of vanishing (product of many small numbers) or exploding (product of many large numbers).


What backprop does not do

  • It does not search random weights until lucky.
  • It does not replace the need for labels.
  • It does not guarantee global optimum — you find a good local minimum.

Manual vs automatic

Module 3 project optionally sketches NumPy backprop for a tiny network — powerful for understanding. Production work uses PyTorch autograd (automatic differentiation) so you rarely derive by hand.


Checkpoint

What does backpropagation actually compute?

Answer sketch

Gradients of the loss with respect to each parameter — how to nudge each weight to reduce loss.


What's next

Lesson 5 — Loss functions