Convolutional networks for images
You are now at the point where hand-designed filters become learned filters. This lesson focuses on the inductive biases of CNNs and how training turns raw pixels into task-relevant representations.
Figure
A classic CNN backbone — shrink spatially, grow channels
Learning objectives
- Map a classic “conv → nonlinearity → pool” stack to receptive fields and translation equivariance.
- Explain parameter sharing and why it matters for images.
- Describe a training loop at a high level: forward pass, loss, backward pass, optimizer step.
Prerequisites
- Convolution lesson (spatial locality and kernels).
- Basic understanding of supervised learning (input → label).
Step 1 — Why fully-connected layers on full images explode
If you flatten a RGB image, you have inputs. A single dense layer to 1000 units needs on the order of hundreds of millions of weights — huge, data-hungry, and blind to spatial structure.
Convolutional layers reuse the same weights at every spatial location: parameter sharing.
Checkpoint: What is the computational intuition behind “same kernel swept across the image”?
Step 2 — Conv blocks build receptive fields
Each successive convolution expands the receptive field: the region of the input image that can influence a particular output unit.
- Stacking small kernels (e.g. 3×3 repeated) can mimic larger effective neighborhoods with fewer parameters than one giant kernel (depending on depth and channels).
- Dilation (later topics) expands receptive field without shrinking resolution as aggressively.
Figure
Receptive field grows with depth
Exercise: For three stride-1 3×3 convolutions stacked, give a rough receptive field size for a center output neuron (ignore boundaries).
Step 3 — Nonlinearities and depth
A stack of linear operations is still linear. Nonlinear activations (ReLU family, GELU, etc.) let the network represent curved decision boundaries and compose hierarchical features.
Typical story (not a law of nature, but useful):
- Early layers: oriented edges and textures.
- Mid layers: parts and local patterns.
- Late layers: object- or class-specific cues.
Checkpoint: Why would a deep linear network be pointless for classification?
Step 4 — Pooling and downsampling
Pooling (max/average) or strided convolutions reduce spatial resolution and increase effective context per parameter.
Trade-offs:
- Downsampling too early can lose small objects.
- Avoiding downsampling costs memory.
Exercise: Compare max pooling vs average pooling on a sharp edge feature map — what differs qualitatively?
Step 5 — Training objective (classification example)
For multi-class classification, cross-entropy compares predicted logits to a one-hot label distribution.
The training loop:
- Sample a minibatch of images and labels.
- Forward pass: compute predictions.
- Compute loss.
- Backward pass: gradients via automatic differentiation.
- Optimizer step (SGD, Adam, …) updates weights.
Checkpoint: What is overfitting, and what is one symptom you would see on your loss curves?
Step 6 — Data augmentation as inductive bias
Common augmentations (crops, flips, color jitter) teach invariance the network might not discover from limited data.
Important nuance: augmentations must respect the label semantics (e.g. don’t vertical-flip digit “6” into “9” unless labels swap too).
Check your understanding
- What does translation equivariance mean for a conv layer?
- Why is parameter sharing appropriate for natural photographs but maybe less for tabular spreadsheets?
- Name one phenomenon that indicates your receptive field is too small for the task.
Lab-style stretch goal (optional)
Train a tiny CNN on CIFAR-10 with and without data augmentation; plot validation accuracy vs epoch and compare.