Vectors, matrices, and image data

Before we begin

When you look at a photo on your phone, you see faces, colors, and memories. When a computer or an AI model looks at the same photo, it sees something else entirely: a grid of numbers.

That gap — between how humans experience images and how machines process them — is where most beginners get lost. Tutorials often jump straight to code and model names. This lesson slows down and builds the mental model first. Once you understand how a picture becomes numbers, and how those numbers become lists a model can learn from, everything later (training, PyTorch, neural networks) will feel connected instead of magical.

We introduce two ideas:

A matrix — a table of numbers arranged in rows and columns. A grayscale image is a matrix.
A vector — a single ordered list of numbers. Models often want data in this form.

We use images as the running example because you can picture them. The same patterns show up in audio, text, and sensor data later — but images make the story easy to follow.

You do not need prior linear algebra. If you can add, multiply, and read a small table, you have enough.

What you will learn

By the end of this lesson you should be able to:

Explain what a digital image actually stores under the hood (not just “a file on disk”).
Describe a matrix and a vector in your own words, with a concrete example.
Flatten a small patch into a list, count how many numbers you get, and explain why the order matters.
Predict what happens when sizes do not match in code — and why that error is a good thing.

Before this lesson

Read Welcome — start here if you have not already.
You only need comfort with basic arithmetic and reading small tables.

A picture is a grid of numbers

What you see vs what the machine stores

Zoom into any digital photo until you see tiny squares. Each square is a pixel (short for “picture element”). One pixel is the smallest piece of the image the computer controls independently.

For a grayscale (black-and-white) image, each pixel holds one number that represents brightness:

0 usually means black (or as dark as the format allows).
255 usually means white (or as bright as the format allows).
Values in between are shades of gray.

So a grayscale image is not “a picture” to the CPU — it is a rectangle of numbers. A 640×480 image contains 640 columns and 480 rows of pixels, which means 640 × 480 = 307,200 numbers. The computer never “sees” a sunset; it reads those 307,200 values and runs math on them.

Color images: three numbers per pixel

A color (RGB) image stores three numbers per pixel — how much red, how much green, and how much blue. Think of each pixel as a tiny mixture of three dimmable lights.

That same 640×480 photo in color has 640 × 480 × 3 ≈ 921,000 numbers. Same scene, three times the data.

Image type	Numbers per pixel	Example size (640×480)
Grayscale	1 (brightness)	307,200
RGB color	3 (R, G, B)	~921,000

Why this matters for AI

Every model you will train reads numbers, not JPEG files. Before any learning happens, something (your code, a library, or a camera pipeline) has already turned the image into arrays of integers or decimals. If you do not know that step exists, debugging becomes guesswork.

Height and width tell you how many rows and columns are in the grid. When we arrange those numbers in a table, we call that table a matrix. You already know matrices — you just may not have used that word. A spreadsheet of pixel brightness values is a matrix.

Mini example: a 3×3 patch

Imagine a tiny patch (3 pixels wide, 3 pixels tall):

| 120 | 125 | 130 | | 118 | 122 | 128 | | 115 | 120 | 125 |

Each cell is one pixel’s brightness. The whole table is a 3×3 matrix (3 rows, 3 columns). A full photo is the same idea — just vastly larger.

Checkpoint: In your own words, what does the computer store for one grayscale pixel?

Answer sketch: One number for brightness, typically in a range like 0–255 (or 0.0–1.0 after normalization in some pipelines).

What is a vector?

A vector is an ordered list of numbers. The word sounds technical, but the idea is familiar:

A GPS coordinate [latitude, longitude] is a list of two numbers.
An RGB color [255, 128, 0] is a list of three numbers.
Your last seven daily step counts form a list of seven numbers.

Order matters. [1, 2, 3] is not the same as [3, 2, 1]. If you shuffle the order when feeding data to a model, you destroy the meaning — the model assumes position 1 always means the same thing (for example “top-left pixel brightness”).

The length of a vector is simply how many numbers it contains. A list of three RGB values has length 3.

Why models prefer lists

Many algorithms are built to accept one fixed-length list per example. They do not naturally accept a 2D grid with arbitrary width and height. So we often convert a patch of an image into one long list — that conversion is called flattening (next section).

Think of it like filling out a form: the model expects exactly 48 boxes filled in a fixed order. The image patch gives you the values; flattening decides which value goes in which box.

Flattening: from grid to list

The problem flattening solves

Suppose you crop an 8×6 patch from a photo. That patch is a small matrix — 8 columns, 6 rows, one number per cell. But your model might expect one row of 48 numbers, not a 6×8 table.

Flattening means: pick a rule for reading the grid, then write the numbers one after another into a single list.

The usual rule is row by row, left to right, top to bottom — like reading English text.

Walkthrough: a 2×3 patch

| 10 | 20 | 30 | | 40 | 50 | 60 |

Row 1 left to right: 10, 20, 30
Row 2 left to right: 40, 50, 60
Combined list: [10, 20, 30, 40, 50, 60] — 6 numbers

If you instead read column by column, you would get [10, 40, 20, 50, 30, 60] — a different list from the same patch. Neither is “wrong,” but you must pick one rule and never change it during training. Mixing orders would confuse the model the same way shuffling exam answer sheets would confuse a grader.

The figure: 4×3 patch → 12 numbers

The diagram below uses a slightly larger patch (4 columns × 3 rows = 12 pixels). Follow the purple tags 1 through 12 on the grid, then see the same values appear in the list below.

Figure

From grid (matrix) to list (vector)

Read tags 1→12 on the grid, then see the same values as one list below.

How to read the figure:

Grid (matrix) — 12 cells. Brighter cells have larger numbers (closer to white). Purple tags show read order.
Down arrow (flatten) — “copy these values in order into one list.”
List (vector) — all 12 brightness values in one place. This is what many models consume as one input example (or as part of a larger feature list).

For an 8×6 patch, the same logic gives 8 × 6 = 48 numbers. For a 16×16 grayscale patch: 256 numbers. For 16×16 RGB: 16 × 16 × 3 = 768 numbers — a common beginner mistake is to forget the ×3 for color.

Checkpoint: An 8×6 grayscale patch — how many numbers after flattening?

Answer: 8 × 6 = 48.

What is a matrix? (the name for what you already saw)

We used the word matrix above without a formal definition. Here it is:

A matrix is a 2D table of numbers with rows and columns.

A grayscale image is a matrix.
A small cropped patch is a smaller matrix cut from the big one.
Later, models also store weights in matrices — tables of learned numbers that transform inputs into outputs.

If a table has 2 rows and 3 columns, we call it a 2×3 matrix. Saying “2×3” is shorthand for shape — it tells you how many entries exist (2 × 3 = 6) and how they are arranged.

Matrix × list (preview only)

Sometimes a matrix transforms a list into another list. Each output number is built from a weighted combination of input numbers — multiply pairs, add them up (the next lesson’s dot product is the building block).

Example — matrix [[1, 0], [0, 2]] applied to list [3, 4]:

First output: 1×3 + 0×4 = 3 (first input unchanged)
Second output: 0×3 + 2×4 = 8 (second input doubled)
Result: [3, 8]

You do not need to master this today. The takeaway: matrices and lists work together to transform data — and deep learning stacks many such transformations.

Shape mistakes (the most common beginner bug)

When you wire data into a model, sizes must match exactly:

A list of 48 numbers needs a model expecting 48 inputs — not 47, not 49.
An RGB patch needs you to count three channels, not one.

If sizes mismatch, Python (or PyTorch later) raises an error. That feels frustrating, but it is protecting you — the math would be nonsense otherwise, and silent nonsense is worse than a clear error.

Real scenario: You flatten a 16×16 grayscale patch → 256 numbers. Your friend flattens a 16×16 RGB patch → 768 numbers. Who has the longer list, and by how much?

Answer: RGB is 3 times longer (768 vs 256) because each pixel carries three color values.

Another common mistake: Using row-major flattening during training but column-major during testing. Same patch, different list, broken model. Pick one order and document it.

Why this matters for your project

In the Module 1 project, you will:

Crop a smooth region from a photo (or generate a synthetic gradient).
For each pixel, build a small input list: [1, x, y] where x and y are column and row position, and the 1 is a constant that lets the model learn a baseline brightness.
Predict that pixel’s brightness from those inputs using three weights you train with gradient descent.

Every step is lists and tables of numbers — exactly what this lesson covered. If flattening and shape make sense here, the project code will feel like an implementation of ideas you already understand, not a copy-paste mystery.

Summary

Word	Simple meaning
Pixel	One cell in the image grid; stores numeric brightness or color
Matrix	Table of numbers (rows × columns); a grayscale image is a matrix
Vector	One ordered list of numbers; order must stay consistent
Flatten	Read a grid in a fixed order and join into one list
Shape / size	Row count, column count, channels — must match what the model expects

What's next

When this lesson feels solid — not just memorized — continue to Dot products — measuring similarity. That lesson explains how models compare two lists of numbers, which is the heart of matching, scoring, and classification.