CNNs — convolution for images
Before we begin
A plain MLP on MNIST flattens 784 pixels — it works, but ignores who your neighbors are. Convolutional Neural Networks (CNNs) keep the 2D grid and scan local patches with shared filters.
Why CNNs work better for images: they exploit locality, hierarchy, and translation-friendly features.
Figure
Filter sliding over a grid
What you will learn
- Explain conv, stride, and pooling in plain language.
- Describe parameter sharing and why it helps.
- Sketch a simple CNN stack for image classification.
Before this lesson
- Module 4 welcome
- Module 1 — Convolution intuition (optional deeper CV path)
Convolution layer
A filter (kernel) is a small weight grid — e.g. 3×3.
At each position, compute dot product between filter and underlying patch → one number in the feature map.
Same filter slides across the whole image → parameter sharing.
Early filters often learn edges; deeper layers combine them into parts and objects.
Stride and padding
- Stride — how many pixels you shift the filter (stride 2 → downsample).
- Padding — border of zeros to keep spatial size (same-size conv).
Output size (no padding, stride 1): (W − K + 1) per side for width W and kernel K.
Pooling
Max pooling (2×2, stride 2): take max in each window → halves width/height.
Effects:
- Reduces compute
- Adds local invariance (small shifts matter less)
Modern architectures sometimes use strided conv instead of pooling — same idea, learned downsampling.
Typical CNN stack
- Conv + ReLU
- Conv + ReLU
- Pool
- Repeat
- Flatten → fully connected → class logits
MNIST with CNN often beats MLP with fewer parameters and better generalization.
Why not flatten?
| MLP on flat pixels | CNN |
|---|---|
| Each pixel weight is separate | Local patterns reused everywhere |
| Huge parameter count | Shared filters |
| No explicit 2D structure | Builds hierarchy of spatial features |
Checkpoint
Why is parameter sharing useful?
Answer sketch
One edge detector applies at every location — fewer weights, better sample efficiency, and tolerance to where an object appears in the frame.