Convolutional networks for images
Hand-designed filters become learned filters. This lesson covers CNN inductive biases, how backpropagation trains conv layers, and the training stack (loss, augmentation, normalization) you need before detection and deployment.
Figure
A classic CNN backbone — shrink spatially, grow channels
Learning objectives
- Explain parameter sharing, local connectivity, and translation equivariance.
- Compute receptive field growth with stride, padding, and dilation.
- Trace backpropagation through one conv layer (weights and input gradients).
- Describe batch normalization, skip connections, and why depth trains.
- Set up a classification training loop with cross-entropy and diagnose overfitting.
- Apply data augmentation without breaking label semantics.
Prerequisites
- Convolution lesson (kernels, padding, stride).
- Basic supervised learning (loss, gradient descent).
Step 1 — Why fully-connected layers on full images fail
Flattening yields 150,528 inputs. One dense layer to 1000 units ≈ 150M weights — huge, data-hungry, and translation blind (shifting the cat one pixel reweights entirely different connections).
A conv layer with filters of size uses roughly parameters shared at every spatial location.
Checkpoint: What is the computational intuition behind “same kernel swept across the image”?
Each output location sees the same pattern detector; spatial structure is preserved in the feature map grid.
Step 2 — Conv layer mechanics
Output (no bias, stride 1, sufficient padding):
- Stride : output size — downsamples without separate pool.
- Padding “same”: keep spatial size for .
- Dilation : effective kernel size — expands receptive field without shrinking map as fast.
Figure
Receptive field grows with depth
Receptive field recurrence (stack of layers with kernel , stride ):
(start ). Exercise: Three stride-1 3×3 layers → . Add stride-2 pool after layer 1 — how does grow?
Step 3 — Equivariance vs invariance
- Equivariance: if input shifts, output feature map shifts the same way (before pooling).
- Invariance: output unchanged under shift — built via pooling, global average pool, or learned attention.
Checkpoint: Why is a deep linear network useless for classification?
Composition of linear maps is linear — no curvature to separate classes.
Step 4 — Nonlinearities, normalization, regularization
| Component | Role |
|---|---|
| ReLU | Sparse activations, cheap; dead neurons if lr too high |
| GELU / SiLU | Smoother; common in ViT backbones |
| Batch norm | Stabilize scale of activations; learnable |
| Dropout | Zero random activations at train; reduces co-adaptation |
| Weight decay | L2 penalty on weights — simpler margins |
Batch norm (per channel, minibatch statistics at train, running average at eval):
Step 5 — Depth, skip connections, and receptive field
Vanishing gradients in deep stacks motivated ResNet blocks:
Gradients can flow through the identity shortcut; learns residual corrections.
Typical hierarchy (empirical, not law):
- Early: edges, color blobs.
- Mid: textures, parts.
- Late: object-level semantics.
Pooling / strided conv trade spatial resolution for channel depth and context per parameter.
Exercise: Max pool vs average pool on a thin vertical edge — which preserves peak response better?
Max pool — average dilutes sharp peak.
Step 6 — Training loop and cross-entropy
For classes, logits , softmax probabilities . One-hot label :
Minibatch SGD / Adam:
- Forward: .
- Loss .
- Backward: via autodiff.
- Update .
Backprop through conv (sketch)
If is known, then
and is a full convolution of with flipped — same structure as forward, which is why conv nets are efficient on GPUs.
Checkpoint: What is overfitting?
Train loss ↓ while validation loss ↑ — model memorizes idiosyncrasies (backgrounds, watermarks).
Step 7 — Data augmentation as curriculum
| Augmentation | Teaches |
|---|---|
| Random crop / flip | Translation / mirror invariance |
| Color jitter | Lighting invariance |
| Cutout / CutMix | Robustness to occlusion, label mixing |
| Mixup | Linear regions between classes |
Rule: augmentations must preserve label meaning (no vertical flip for “6”/“9” unless labels swap).
Transfer learning: freeze early layers pretrained on ImageNet; train head on small dataset — fewer labels needed when low-level filters already exist.
Deep dive — what to watch in training curves
| Pattern | Likely issue |
|---|---|
| Train acc 99%, val acc 60% | Overfit — more data, dropout, weight decay |
| Both acc low | Underfit — bigger model, train longer, check labels |
| Loss NaN | lr too high, bad normalization, mixed precision overflow |
| Val improves then degrades | Late overfit — early stopping |
Check your understanding
- What does translation equivariance mean for a conv layer?
- Why is parameter sharing appropriate for photos but not for spreadsheet columns?
- Name one symptom of receptive field too small for the task.
- Why does batch norm behave differently at train vs eval?
- How does a residual connection change gradient flow?
Lab-style stretch goals
Train a small CNN on CIFAR-10 with and without augmentation; plot train/val accuracy. Try freezing a pretrained backbone + linear head on 500 images per class.
Debug: Visualize first-layer filters — do you see Gabor-like edge detectors emerge?