Probability — when measurements lie a little

Before we begin

Take two photos of the same white wall, one second apart, same phone, same settings. They look identical. Now zoom until you see individual pixel values — they will not match exactly. One pixel might be 142, the next capture 139 or 145.

That is not a bug in your eyes. Every measurement carries noise — from light, from the sensor, from compression. If you treat a single pixel as absolute truth, you will misunderstand both photography and machine learning.

Probability (used lightly here) gives us language for uncertainty:

Average — what value we expect over many repeats
Spread — how much individual readings jump around
Noise model — a simple story: “true value + random wobble”

Models rarely see one perfect number. They see many noisy examples and learn patterns that stay true even when individual pixels lie a little.

Figure

Same scene, slightly different numbers every time

Each pixel can be written as true brightness plus random noise. More noise means a grainier look.

What you will learn

Describe a pixel reading as signal + noise in plain English.
Compute a simple average from a table of chances.
Explain spread and why grainy photos have more of it.
Connect averaging error over many samples to how models train.

Before this lesson

Lesson 2 — Dot products

Random does not mean “anything goes”

We say a pixel value is random when it can change between trials even if the scene is fixed. Random here does not mean “completely unpredictable” — it means “follows a pattern of likelihoods.”

Example readings at the same wall pixel:

Photo 1: 142
Photo 2: 139
Photo 3: 145

We might summarize: “usually near 140, rarely below 120 or above 160.” That summary is a distribution — a description of which values are common and which are rare.

Checkpoint: Why can’t you trust a single pixel as perfect ground truth?

Sensors, exposure, and processing all add variation. One sample is informative but not exact.

Average and spread: a toy sensor story

Imagine a broken sensor that only outputs 0 (black) or 255 (white), nothing in between:

Value	Chance
0	30%
255	70%

Average (expected value)

If you took millions of readings, what single number would they cluster around?

0 × 30% + 255 × 70% = 178.5

We call 178.5 the average or expected value. Individual readings are still only 0 or 255 — extreme — but the long-run center is 178.5.

Spread

Spread asks: how far do typical readings sit from that average?

Here spread is huge — values jump between extremes. A stable sensor might read 140, 141, 139, 142 — low spread. A grainy night photo might swing more — high spread.

Tools compute standard deviation as one measure of spread; you do not need the formula today. Remember the idea: low spread = trustworthy individual readings; high spread = noisy data.

Bell-shaped noise (the usual wobble)

In many real systems, noise looks like a bell curve when plotted:

measurement = true value + random noise

Read it aloud: “What we record equals the real brightness, plus a small random error.”

Most errors are small (near zero).
Large positive or negative errors are rarer.

That is the bell shape — many small wobbles, few big ones.

Grain in dark photos (why beginners notice noise at night)

In dim light, each pixel collects fewer photons. Relative to that weak signal, random error looks larger — the image looks grainy. Bright, well-exposed regions often look cleaner because the signal is stronger compared to the noise. Same math story, different feel in the image.

Histograms: see the distribution without formulas

A histogram counts how often each brightness appears in a patch:

A tall bar at 140 → many pixels near that brightness (flat gray wall).
A wide histogram → many different values (high contrast or heavy noise).
A narrow histogram → values clustered tightly (uniform region or blur).

Histograms let your eyes see average and spread without calculating them — useful when debugging datasets.

Why training uses average error

When a model predicts brightness predicted and the true value is true, one pixel’s squared error is:

(true − predicted)²

Squaring makes big mistakes count more than small ones — a reasonable choice when large errors are worse.

If you stopped at one pixel, noise would dominate: the model might chase random wobble instead of real structure. So training uses the average over many pixels and many images:

average error = (sum of all squared errors) ÷ (number of samples)

This is often called mean squared error (MSE).

Averaging:

Smooths random noise — errors partly cancel instead of steering the model randomly.
Gives one number to improve each step — “how wrong are we overall?”

Your Module 1 project plots this average error over training steps. When the curve goes down, the model is genuinely improving on the pattern, not memorizing one noisy pixel.

Summary

Term	Plain meaning
Random variable	A number that can change between repeated measurements
Average	Long-run typical value
Spread	How much individual values scatter
Bell-shaped noise	Small errors common, large errors rare
Mean squared error	Average squared difference between prediction and truth

What's next

Derivatives and gradient descent — how learning works — you can measure error; next you learn how models reduce it automatically.