Derivatives and gradient descent — how learning works
Before we begin
You now know how data becomes numbers, how models compare lists, and why training averages over noisy samples. One question remains — the question that defines learning:
How does the model know which way to adjust its weights to get better?
You could try random guesses. That fails with more than a few weights. Instead, almost all training uses gradient descent:
- Measure how wrong the model is (error / loss).
- Measure which direction increases that error for each weight (slope).
- Move each weight a small step in the opposite direction.
- Repeat.
Derivative here means slope — how error changes when you nudge one weight. Gradient means “all the slopes at once” when you have many weights.
This lesson stays with one weight until the loop is crystal clear. Your project uses three weights — same loop, three slopes.
Figure
Walking downhill on a loss curve
What you will learn
- Describe slope as “which way is uphill on the error graph.”
- Read slope sign: increase or decrease the weight?
- State the update rule in plain English.
- Diagnose learning rate problems from how the error curve behaves.
Before this lesson
Slope in everyday language
Picture a simple graph:
- Horizontal axis — one weight (one knob the model can turn).
- Vertical axis — error (how wrong predictions are).
The graph often looks like a bowl or valley — high error on the sides, lower error near the bottom.
- Uphill — turning the knob increases error (bad direction).
- Downhill — turning the knob decreases error (good direction).
The slope at your current knob setting tells you the tilt:
| Slope sign | What it means | What to try |
|---|---|---|
| Positive | Error rises if you increase the weight | Decrease the weight |
| Negative | Error falls if you increase the weight | Increase the weight |
| Zero | Flat — you may be at the bottom (or on a flat plateau) | Stop or adjust carefully |
Analogy: You are hiking in fog. You cannot see the valley, but you can feel the ground tilt. Slope is that tilt. You want to step downhill, not uphill.
With many weights, each knob has its own slope. Together they form the gradient — a list of slopes, one per weight.
Tiny numeric example (follow every step)
Model: prediction = weight × input
One training example: input = 2, true answer = 10
So the correct weight is 5, because 5 × 2 = 10.
Error for this example: take the difference (true − prediction), square it, and (for convenience in the math) often use half of that square. You do not need to derive why half is used — it keeps numbers neat.
At weight = 1
- Prediction = 1 × 2 = 2
- Difference = 10 − 2 = 8 (very wrong)
- Squared error = 64
The slope at weight = 1 works out to -16 (negative).
Negative slope → increasing the weight reduces error → you should increase the weight.
At weight = 5
- Prediction = 5 × 2 = 10
- Difference = 0 — perfect
- Slope = 0 — flat bottom of the bowl
Checkpoint: Slope negative — increase or decrease weight?
Increase the weight.
The gradient descent loop (the training algorithm)
Pick a learning rate — how big each step is. Start small (for example 0.01) if unsure.
Repeat until error stops improving:
- Run the model on your data → get predictions.
- Compute average error (MSE from the last lesson).
- Compute slope for each weight — how error changes if that weight goes up.
- Update each weight:
new weight = old weight − learning rate × slope
The minus sign is the whole trick. Slope points uphill; you step downhill.
One manual step (weight = 1, slope = -16, learning rate = 0.01)
new weight = 1 − 0.01 × (-16) = 1 + 0.16 = 1.16
Still far from 5, but closer. Repeat hundreds of times and you approach the best weight.
Three weights in your project
Same loop — three slopes instead of one:
predicted brightness = w0 + w1 × x + w2 × y
Each of w0, w1, w2 gets its own slope and its own update every pass through the data.
Later phases: PyTorch computes slopes automatically with .backward(). You still choose learning rate and watch the error curve — those human choices remain.
Learning rate — the knob that controls step size
Figure
Three learning rates
| What you observe | Likely cause |
|---|---|
| Error barely moves after many steps | Learning rate too small — steps too tiny |
| Error jumps, spikes, or becomes NaN | Learning rate too large — overshooting the valley |
| Smooth downward curve that flattens | Learning rate in a reasonable range |
There is no universal perfect value. Practitioners plot error vs epoch, then adjust. Normalizing inputs (x, y, brightness to 0–1) often makes tuning easier — your project lesson mentions this.
Exercise: Error over four steps: 2.0 → 0.1 → 5.0 → 20. What happened?
The optimizer overshot the bottom — learning rate too high.
Connect to your upcoming project
You will:
- Build a table of pixels — each row
[1, x, y]and a true brightness. - Start with weights at zero (or small random values).
- Loop: predict → compute average squared error → compute slopes → update weights.
- Plot error vs epoch — you should see a downward trend like the figure above.
When that curve drops, you are watching learning happen — the same core process used in models with billions of parameters, just with three knobs instead of billions.
Summary
| Piece | Role |
|---|---|
| Error / loss | One number: how wrong are we on average? |
| Slope / gradient | Which way is uphill for each weight? |
| Update rule | new weight = old weight − learning rate × slope |
| Learning rate | Step size — too small crawls, too large explodes |
What's next
Module 1 quiz and review — pause and check understanding before writing code.