Dot products — measuring similarity

Before we begin

In the last lesson, you learned that a patch of an image can become a list of numbers — a vector. The next question is one of the most important in all of machine learning:

“How similar are these two lists?”

If two patches look alike, their lists should score high on some similarity measure. If they look different, the score should be low. The dot product is the simplest such measure — and despite its simplicity, it appears everywhere: template matching, linear classifiers, image filters, and (much later) attention in large language models.

This lesson takes time with three layers:

How to compute it — multiply matching pairs, add the results.
What the result means — alignment and similarity in plain language.
Where it is used — finding patterns in images and making yes/no decisions.

Figure

Dot product = how much two lists line up

When two lists point in a similar direction, the dot product is larger. Think of it as a similarity score.

What you will learn

Compute a dot product by hand on small lists.
Explain the result as “how much two patterns rise and fall together.”
Understand brightness bias — why raw dot products can mislead you on images.
Use cosine similarity as a fairer comparison when brightness changes.
See how dot products become classifier scores and filter outputs.

Before this lesson

Lesson 1 — Vectors, matrices, and image data

The recipe: multiply pairs, then add

Suppose two lists have the same length (same number of entries):

List a: [a1, a2, a3, …]
List b: [b1, b2, b3, …]

The dot product combines them like this:

a1×b1 + a2×b2 + a3×b3 + …

That is the whole mechanical recipe. No magic — just multiply each aligned pair and sum.

Worked example 1

[1, 2, 3] dot [4, 5, 6]

1×4 = 4
2×5 = 10
3×6 = 18
Sum: 4 + 10 + 18 = 32

Worked example 2 (try this yourself)

[2, 0, 1] dot [3, 4, 5]

2×3 = 6
0×4 = 0
1×5 = 5
Sum: 6 + 0 + 5 = 11

Why length must match

You cannot dot a 6-number list with a 48-number list — there are no aligned pairs. Length mismatch in code is the same class of bug as flattening with the wrong shape: the operation is undefined.

What the number actually tells you

Imagine two flattened patches from an image. Each list encodes a pattern of brighter and darker values.

Dot product result	Plain English
Large positive	Where one list is high, the other tends to be high too — patterns move together
Near zero	No simple linear relationship — patterns do not line up
Negative	Where one is high, the other tends to be low — opposite trends

Intuition: If two patches show the same edge or gradient direction, their lists often produce a ** larger** dot product than two random patches.

Checkpoint: Two identical lists — dot product of the list with itself: large or small?

Large positive, unless every value is zero (black patch).

Finding a pattern in an image (template matching)

Suppose you save a small template — a corner, a logo, a fingerprint pattern — as a list t. You slide a window across a larger image. At each position, you flatten the window into list p and compute t dot p.

High score → “this region looks like the template.”
Low score → “probably not a match.”

Face unlock, document scanners, and quality-control cameras use variations of this idea. They rarely show you the dot product directly, but the math underneath is the same family.

The brightness problem (important)

Raw dot products are sensitive to overall brightness. If patch p is twice as bright as patch q but has the same pattern, every number in p is roughly doubled — and t dot p can be roughly twice t dot q even though the shapes match.

That is unfair if you care about pattern, not brightness. The fix is cosine similarity (next section).

Cosine similarity: compare shape, not brightness

Cosine similarity adjusts for how “long” each list is:

cosine similarity = dot product ÷ (length of a × length of b)

Length of a list (for example [3, 4]):

Square each number: 9 + 16 = 25
Square root: 5

So [3, 4] has length 5.

Results usually fall between -1 and 1:

Close to 1 → lists point the same direction (similar pattern)
Close to 0 → unrelated
Close to -1 → opposite patterns

Worked comparison

Patch A = [1, 1, 1, 1]
Patch B = [2, 2, 2, 2] (same pattern, twice as bright)
Template T = [1, 0, 1, 0]

Raw dot T with A vs T with B: B’s score is twice A’s — brightness bias.
Cosine similarity with T: same for A and B — pattern shape wins.

When comparing image patches, cosine similarity (or a related normalized score) is often more honest than a raw dot product.

Dot products in classification

A linear classifier often decides using:

score = (weights dot features) + bias

features — your input list (flattened patch, pixel values, measurements)
weights — learned importance of each feature (positive weight = “this feature pushes toward yes”)
bias — a constant nudge up or down

If score > 0 → predict class A; if score < 0 → predict class B (simplified story). Training (later phases) finds good weights from labeled examples. The decision rule is still multiply-and-add — a dot product.

Example in words: if dark corners in a patch push toward “indoor” and bright sky pushes toward “outdoor,” weights encode those tendencies. New patch → compute score → pick a label.

Link to image filters

When you blur or sharpen a photo, each output pixel is often computed from a small neighborhood of neighbors. That computation is frequently a dot product between:

a fixed kernel (list of weights like a small matrix), and
the neighborhood’s pixel values (as a list)

Convolutional neural networks learn those weight lists from data instead of hand-designing them — but each operation is still, at core, weighted sums built from dot products.

Summary

Idea	Remember
Dot product	Multiply aligned pairs, add all products
Meaning	Similarity / co-movement of two lists
Raw dot	Fast but biased by overall scale (brightness)
Cosine similarity	Dot product normalized by list lengths
Classifiers	score = weights dot features + bias

What's next

Probability — when measurements lie a little — real pixel values are noisy; training averages over many samples so models learn stable patterns instead of random wobble.