Derivatives and gradient descent — how learning works

Before we begin

You now know how data becomes numbers, how models compare lists, and why training averages over noisy samples. One question remains — the question that defines learning:

How does the model know which way to adjust its weights to get better?

You could try random guesses. That fails with more than a few weights. Instead, almost all training uses gradient descent:

Measure how wrong the model is (error / loss).
Measure which direction increases that error for each weight (slope).
Move each weight a small step in the opposite direction.
Repeat.

Derivative here means slope — how error changes when you nudge one weight. Gradient means “all the slopes at once” when you have many weights.

This lesson stays with one weight until the loop is crystal clear. Your project uses three weights — same loop, three slopes.

Figure

Walking downhill on a loss curve

Each step moves the weight slightly to reduce error. Too big a step overshoots; too small a step crawls.

What you will learn

Describe slope as “which way is uphill on the error graph.”
Read slope sign: increase or decrease the weight?
State the update rule in plain English.
Diagnose learning rate problems from how the error curve behaves.

Before this lesson

Lesson 3 — Probability

Slope in everyday language

Picture a simple graph:

Horizontal axis — one weight (one knob the model can turn).
Vertical axis — error (how wrong predictions are).

The graph often looks like a bowl or valley — high error on the sides, lower error near the bottom.

Uphill — turning the knob increases error (bad direction).
Downhill — turning the knob decreases error (good direction).

The slope at your current knob setting tells you the tilt:

Slope sign	What it means	What to try
Positive	Error rises if you increase the weight	Decrease the weight
Negative	Error falls if you increase the weight	Increase the weight
Zero	Flat — you may be at the bottom (or on a flat plateau)	Stop or adjust carefully

Analogy: You are hiking in fog. You cannot see the valley, but you can feel the ground tilt. Slope is that tilt. You want to step downhill, not uphill.

With many weights, each knob has its own slope. Together they form the gradient — a list of slopes, one per weight.

Tiny numeric example (follow every step)

Model: prediction = weight × input

One training example: input = 2, true answer = 10

So the correct weight is 5, because 5 × 2 = 10.

Error for this example: take the difference (true − prediction), square it, and (for convenience in the math) often use half of that square. You do not need to derive why half is used — it keeps numbers neat.

At weight = 1

Prediction = 1 × 2 = 2
Difference = 10 − 2 = 8 (very wrong)
Squared error = 64

The slope at weight = 1 works out to -16 (negative).

Negative slope → increasing the weight reduces error → you should increase the weight.

At weight = 5

Prediction = 5 × 2 = 10
Difference = 0 — perfect
Slope = 0 — flat bottom of the bowl

Checkpoint: Slope negative — increase or decrease weight?

Increase the weight.

The gradient descent loop (the training algorithm)

Pick a learning rate — how big each step is. Start small (for example 0.01) if unsure.

Repeat until error stops improving:

Run the model on your data → get predictions.
Compute average error (MSE from the last lesson).
Compute slope for each weight — how error changes if that weight goes up.
Update each weight:
new weight = old weight − learning rate × slope

The minus sign is the whole trick. Slope points uphill; you step downhill.

One manual step (weight = 1, slope = -16, learning rate = 0.01)

new weight = 1 − 0.01 × (-16) = 1 + 0.16 = 1.16

Still far from 5, but closer. Repeat hundreds of times and you approach the best weight.

Three weights in your project

Same loop — three slopes instead of one:

predicted brightness = w0 + w1 × x + w2 × y

Each of w0, w1, w2 gets its own slope and its own update every pass through the data.

Later phases: PyTorch computes slopes automatically with .backward(). You still choose learning rate and watch the error curve — those human choices remain.

Learning rate — the knob that controls step size

Figure

Three learning rates

Too small: slow progress. Just right: smooth drop. Too large: bounce or explode.

What you observe	Likely cause
Error barely moves after many steps	Learning rate too small — steps too tiny
Error jumps, spikes, or becomes NaN	Learning rate too large — overshooting the valley
Smooth downward curve that flattens	Learning rate in a reasonable range

There is no universal perfect value. Practitioners plot error vs epoch, then adjust. Normalizing inputs (x, y, brightness to 0–1) often makes tuning easier — your project lesson mentions this.

Exercise: Error over four steps: 2.0 → 0.1 → 5.0 → 20. What happened?

The optimizer overshot the bottom — learning rate too high.

Connect to your upcoming project

You will:

Build a table of pixels — each row [1, x, y] and a true brightness.
Start with weights at zero (or small random values).
Loop: predict → compute average squared error → compute slopes → update weights.
Plot error vs epoch — you should see a downward trend like the figure above.

When that curve drops, you are watching learning happen — the same core process used in models with billions of parameters, just with three knobs instead of billions.

Summary

Piece	Role
Error / loss	One number: how wrong are we on average?
Slope / gradient	Which way is uphill for each weight?
Update rule	new weight = old weight − learning rate × slope
Learning rate	Step size — too small crawls, too large explodes

What's next

Module 1 quiz and review — pause and check understanding before writing code.