CNNs — convolution for images

Before we begin

A plain MLP on MNIST flattens 784 pixels — it works, but ignores who your neighbors are. Convolutional Neural Networks (CNNs) keep the 2D grid and scan local patches with shared filters.

Why CNNs work better for images: they exploit locality, hierarchy, and translation-friendly features.

Figure

Filter sliding over a grid

Same 3×3 weights detect an edge anywhere on the image.

What you will learn

Explain conv, stride, and pooling in plain language.
Describe parameter sharing and why it helps.
Sketch a simple CNN stack for image classification.

Before this lesson

Module 4 welcome
Module 1 — Convolution intuition (optional deeper CV path)

Convolution layer

A filter (kernel) is a small weight grid — e.g. 3×3.

At each position, compute dot product between filter and underlying patch → one number in the feature map.

Same filter slides across the whole image → parameter sharing.

Early filters often learn edges; deeper layers combine them into parts and objects.

Stride and padding

Stride — how many pixels you shift the filter (stride 2 → downsample).
Padding — border of zeros to keep spatial size (same-size conv).

Output size (no padding, stride 1): (W − K + 1) per side for width W and kernel K.

Pooling

Max pooling (2×2, stride 2): take max in each window → halves width/height.

Effects:

Reduces compute
Adds local invariance (small shifts matter less)

Modern architectures sometimes use strided conv instead of pooling — same idea, learned downsampling.

Typical CNN stack

Conv + ReLU
Conv + ReLU
Pool
Repeat
Flatten → fully connected → class logits

MNIST with CNN often beats MLP with fewer parameters and better generalization.

Why not flatten?

MLP on flat pixels	CNN
Each pixel weight is separate	Local patterns reused everywhere
Huge parameter count	Shared filters
No explicit 2D structure	Builds hierarchy of spatial features

Checkpoint

Why is parameter sharing useful?

Answer sketch

One edge detector applies at every location — fewer weights, better sample efficiency, and tolerance to where an object appears in the frame.

What's next

Lesson 2 — Recurrent neural networks