Convolutional networks for images

Hand-designed filters become learned filters. This lesson covers CNN inductive biases, how backpropagation trains conv layers, and the training stack (loss, augmentation, normalization) you need before detection and deployment.

Figure

A classic CNN backbone — shrink spatially, grow channels

Each block is a feature map produced by conv + activation (optionally pooled). Resolution drops while channels expand to encode richer concepts.

Learning objectives

Explain parameter sharing, local connectivity, and translation equivariance.
Compute receptive field growth with stride, padding, and dilation.
Trace backpropagation through one conv layer (weights and input gradients).
Describe batch normalization, skip connections, and why depth trains.
Set up a classification training loop with cross-entropy and diagnose overfitting.
Apply data augmentation without breaking label semantics.

Prerequisites

Convolution lesson (kernels, padding, stride).
Basic supervised learning (loss, gradient descent).

Step 1 — Why fully-connected layers on full images fail

Flattening $224 \times 224 \times 3$ yields 150,528 inputs. One dense layer to 1000 units ≈ 150M weights — huge, data-hungry, and translation blind (shifting the cat one pixel reweights entirely different connections).

A conv layer with $C_{\text{out}}$ filters of size $k \times k$ uses roughly $C_{\text{in}} \cdot k^2 \cdot C_{\text{out}}$ parameters shared at every spatial location.

Checkpoint: What is the computational intuition behind “same kernel swept across the image”?

Each output location sees the same pattern detector; spatial structure is preserved in the feature map grid.

Step 2 — Conv layer mechanics

Output (no bias, stride 1, sufficient padding):

Y[i,j,c] = \sum_{c',u,v} W[c,c',u,v]\, X[i+u, j+v, c'] + b[c]

Stride $s$ : output size $\approx \lfloor (H-k)/s \rfloor + 1$ — downsamples without separate pool.
Padding “same”: keep spatial size for $s=1$ .
Dilation $d$ : effective kernel size $(d(k-1)+1)$ — expands receptive field without shrinking map as fast.

Figure

Receptive field grows with depth

One 3×3 layer sees a 3×3 region; two stacked see 5×5; three see 7×7 — with far fewer parameters than a single 7×7 kernel.

Receptive field recurrence (stack of $L$ layers with kernel $k_l$ , stride $s_l$ ):

r_\ell = r_{\ell-1} + (k_\ell - 1) \prod_{i=1}^{\ell-1} s_i

(start $r_0=1$ ). Exercise: Three stride-1 3×3 layers → $r=7$ . Add stride-2 pool after layer 1 — how does $r_3$ grow?

Step 3 — Equivariance vs invariance

Equivariance: if input shifts, output feature map shifts the same way (before pooling).
Invariance: output unchanged under shift — built via pooling, global average pool, or learned attention.

Checkpoint: Why is a deep linear network useless for classification?

Composition of linear maps is linear — no curvature to separate classes.

Step 4 — Nonlinearities, normalization, regularization

Component	Role
ReLU $\max(0,x)$	Sparse activations, cheap; dead neurons if lr too high
GELU / SiLU	Smoother; common in ViT backbones
Batch norm	Stabilize scale of activations; $\gamma,\beta$ learnable
Dropout	Zero random activations at train; reduces co-adaptation
Weight decay	L2 penalty on weights — simpler margins

Batch norm (per channel, minibatch statistics at train, running average at eval):

\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y = \gamma \hat{x} + \beta

Step 5 — Depth, skip connections, and receptive field

Vanishing gradients in deep stacks motivated ResNet blocks:

\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

Gradients can flow through the identity shortcut; $\mathcal{F}$ learns residual corrections.

Typical hierarchy (empirical, not law):

Early: edges, color blobs.
Mid: textures, parts.
Late: object-level semantics.

Pooling / strided conv trade spatial resolution for channel depth and context per parameter.

Exercise: Max pool vs average pool on a thin vertical edge — which preserves peak response better?

Max pool — average dilutes sharp peak.

Step 6 — Training loop and cross-entropy

For $K$ classes, logits $\mathbf{z}$ , softmax probabilities $p_k = e^{z_k}/\sum_j e^{z_j}$ . One-hot label $y$ :

\mathcal{L} = -\sum_k y_k \log p_k

Minibatch SGD / Adam:

Forward: $\mathbf{z} = f_\theta(\mathbf{x})$ .
Loss $\mathcal{L}$ .
Backward: $\partial \mathcal{L}/\partial \theta$ via autodiff.
Update $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$ .

Backprop through conv (sketch)

If $\partial \mathcal{L}/\partial Y$ is known, then

\frac{\partial \mathcal{L}}{\partial W[c,c',u,v]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial Y[i,j,c]} X[i+u,j+v,c']

and $\partial \mathcal{L}/\partial X$ is a full convolution of $\partial \mathcal{L}/\partial Y$ with flipped $W$ — same structure as forward, which is why conv nets are efficient on GPUs.

Checkpoint: What is overfitting?

Train loss ↓ while validation loss ↑ — model memorizes idiosyncrasies (backgrounds, watermarks).

Step 7 — Data augmentation as curriculum

Augmentation	Teaches
Random crop / flip	Translation / mirror invariance
Color jitter	Lighting invariance
Cutout / CutMix	Robustness to occlusion, label mixing
Mixup	Linear regions between classes

Rule: augmentations must preserve label meaning (no vertical flip for “6”/“9” unless labels swap).

Transfer learning: freeze early layers pretrained on ImageNet; train head on small dataset — fewer labels needed when low-level filters already exist.

Deep dive — what to watch in training curves

Pattern	Likely issue
Train acc 99%, val acc 60%	Overfit — more data, dropout, weight decay
Both acc low	Underfit — bigger model, train longer, check labels
Loss NaN	lr too high, bad normalization, mixed precision overflow
Val improves then degrades	Late overfit — early stopping

Check your understanding

What does translation equivariance mean for a conv layer?
Why is parameter sharing appropriate for photos but not for spreadsheet columns?
Name one symptom of receptive field too small for the task.
Why does batch norm behave differently at train vs eval?
How does a residual connection change gradient flow?

Lab-style stretch goals

Train a small CNN on CIFAR-10 with and without augmentation; plot train/val accuracy. Try freezing a pretrained backbone + linear head on 500 images per class.

Debug: Visualize first-layer filters — do you see Gabor-like edge detectors emerge?