← Back to curriculum

Module 4 — Deep learning architectures

CNNs — convolution for images

Filters, feature maps, pooling, parameter sharing, and why conv beats flattening for spatial data.

~75 min read + exercises

CNNs — convolution for images

Before we begin

A plain MLP on MNIST flattens 784 pixels — it works, but ignores who your neighbors are. Convolutional Neural Networks (CNNs) keep the 2D grid and scan local patches with shared filters.

Why CNNs work better for images: they exploit locality, hierarchy, and translation-friendly features.

Figure

Filter sliding over a grid

Conv filter scans local patches — same weights everywhere3×3feature
Same 3×3 weights detect an edge anywhere on the image.

What you will learn

  • Explain conv, stride, and pooling in plain language.
  • Describe parameter sharing and why it helps.
  • Sketch a simple CNN stack for image classification.

Before this lesson


Convolution layer

A filter (kernel) is a small weight grid — e.g. 3×3.

At each position, compute dot product between filter and underlying patch → one number in the feature map.

Same filter slides across the whole image → parameter sharing.

Early filters often learn edges; deeper layers combine them into parts and objects.


Stride and padding

  • Stride — how many pixels you shift the filter (stride 2 → downsample).
  • Padding — border of zeros to keep spatial size (same-size conv).

Output size (no padding, stride 1): (W − K + 1) per side for width W and kernel K.


Pooling

Max pooling (2×2, stride 2): take max in each window → halves width/height.

Effects:

  • Reduces compute
  • Adds local invariance (small shifts matter less)

Modern architectures sometimes use strided conv instead of pooling — same idea, learned downsampling.


Typical CNN stack

  1. Conv + ReLU
  2. Conv + ReLU
  3. Pool
  4. Repeat
  5. Flatten → fully connected → class logits

MNIST with CNN often beats MLP with fewer parameters and better generalization.


Why not flatten?

MLP on flat pixelsCNN
Each pixel weight is separateLocal patterns reused everywhere
Huge parameter countShared filters
No explicit 2D structureBuilds hierarchy of spatial features

Checkpoint

Why is parameter sharing useful?

Answer sketch

One edge detector applies at every location — fewer weights, better sample efficiency, and tolerance to where an object appears in the frame.


What's next

Lesson 2 — Recurrent neural networks