← Back to curriculum

Module 3 — Deep learning for vision

Convolutional networks for images

Conv mechanics, receptive-field recurrence, batch norm and ResNet skips, backprop through conv, cross-entropy training, and augmentation.

~90 min read + exercises

Convolutional networks for images

Hand-designed filters become learned filters. This lesson covers CNN inductive biases, how backpropagation trains conv layers, and the training stack (loss, augmentation, normalization) you need before detection and deployment.

Figure

A classic CNN backbone — shrink spatially, grow channels

Shrink spatially, grow channels: a classic CNN backboneEach block is a feature map (conv + activation, optionally pooled).InputH×W×3Conv 3×3 + ReLU32 chConv + Pool64 chConv + Pool128 chDeeper conv256 chFC/Headlogits
Each block is a feature map produced by conv + activation (optionally pooled). Resolution drops while channels expand to encode richer concepts.

Learning objectives

  • Explain parameter sharing, local connectivity, and translation equivariance.
  • Compute receptive field growth with stride, padding, and dilation.
  • Trace backpropagation through one conv layer (weights and input gradients).
  • Describe batch normalization, skip connections, and why depth trains.
  • Set up a classification training loop with cross-entropy and diagnose overfitting.
  • Apply data augmentation without breaking label semantics.

Prerequisites

  • Convolution lesson (kernels, padding, stride).
  • Basic supervised learning (loss, gradient descent).

Step 1 — Why fully-connected layers on full images fail

Flattening 224×224×3224 \times 224 \times 3 yields 150,528 inputs. One dense layer to 1000 units ≈ 150M weights — huge, data-hungry, and translation blind (shifting the cat one pixel reweights entirely different connections).

A conv layer with CoutC_{\text{out}} filters of size k×kk \times k uses roughly Cink2CoutC_{\text{in}} \cdot k^2 \cdot C_{\text{out}} parameters shared at every spatial location.

Checkpoint: What is the computational intuition behind “same kernel swept across the image”?

Each output location sees the same pattern detector; spatial structure is preserved in the feature map grid.


Step 2 — Conv layer mechanics

Output (no bias, stride 1, sufficient padding):

Y[i,j,c]=c,u,vW[c,c,u,v]X[i+u,j+v,c]+b[c]Y[i,j,c] = \sum_{c',u,v} W[c,c',u,v]\, X[i+u, j+v, c'] + b[c]
  • Stride ss: output size (Hk)/s+1\approx \lfloor (H-k)/s \rfloor + 1 — downsamples without separate pool.
  • Padding “same”: keep spatial size for s=1s=1.
  • Dilation dd: effective kernel size (d(k1)+1)(d(k-1)+1) — expands receptive field without shrinking map as fast.

Figure

Receptive field grows with depth

Stacking small kernels grows the effective receptive fieldThree 3×3 conv layers cover roughly the same area as one 7×7 kernel — with fewer parameters.Layer 1 (3×3)~3×3 receptive fieldLayer 2 (3×3)~5×5 receptive fieldLayer 3 (3×3)~7×7 receptive field
One 3×3 layer sees a 3×3 region; two stacked see 5×5; three see 7×7 — with far fewer parameters than a single 7×7 kernel.

Receptive field recurrence (stack of LL layers with kernel klk_l, stride sls_l):

r=r1+(k1)i=11sir_\ell = r_{\ell-1} + (k_\ell - 1) \prod_{i=1}^{\ell-1} s_i

(start r0=1r_0=1). Exercise: Three stride-1 3×3 layers → r=7r=7. Add stride-2 pool after layer 1 — how does r3r_3 grow?


Step 3 — Equivariance vs invariance

  • Equivariance: if input shifts, output feature map shifts the same way (before pooling).
  • Invariance: output unchanged under shift — built via pooling, global average pool, or learned attention.

Checkpoint: Why is a deep linear network useless for classification?

Composition of linear maps is linear — no curvature to separate classes.


Step 4 — Nonlinearities, normalization, regularization

ComponentRole
ReLU max(0,x)\max(0,x)Sparse activations, cheap; dead neurons if lr too high
GELU / SiLUSmoother; common in ViT backbones
Batch normStabilize scale of activations; γ,β\gamma,\beta learnable
DropoutZero random activations at train; reduces co-adaptation
Weight decayL2 penalty on weights — simpler margins

Batch norm (per channel, minibatch statistics at train, running average at eval):

x^=xμBσB2+ϵ,y=γx^+β\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y = \gamma \hat{x} + \beta

Step 5 — Depth, skip connections, and receptive field

Vanishing gradients in deep stacks motivated ResNet blocks:

y=F(x)+x\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

Gradients can flow through the identity shortcut; F\mathcal{F} learns residual corrections.

Typical hierarchy (empirical, not law):

  • Early: edges, color blobs.
  • Mid: textures, parts.
  • Late: object-level semantics.

Pooling / strided conv trade spatial resolution for channel depth and context per parameter.

Exercise: Max pool vs average pool on a thin vertical edge — which preserves peak response better?

Max pool — average dilutes sharp peak.


Step 6 — Training loop and cross-entropy

For KK classes, logits z\mathbf{z}, softmax probabilities pk=ezk/jezjp_k = e^{z_k}/\sum_j e^{z_j}. One-hot label yy:

L=kyklogpk\mathcal{L} = -\sum_k y_k \log p_k

Minibatch SGD / Adam:

  1. Forward: z=fθ(x)\mathbf{z} = f_\theta(\mathbf{x}).
  2. Loss L\mathcal{L}.
  3. Backward: L/θ\partial \mathcal{L}/\partial \theta via autodiff.
  4. Update θθηθL\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}.

Backprop through conv (sketch)

If L/Y\partial \mathcal{L}/\partial Y is known, then

LW[c,c,u,v]=i,jLY[i,j,c]X[i+u,j+v,c]\frac{\partial \mathcal{L}}{\partial W[c,c',u,v]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial Y[i,j,c]} X[i+u,j+v,c']

and L/X\partial \mathcal{L}/\partial X is a full convolution of L/Y\partial \mathcal{L}/\partial Y with flipped WW — same structure as forward, which is why conv nets are efficient on GPUs.

Checkpoint: What is overfitting?

Train loss ↓ while validation loss ↑ — model memorizes idiosyncrasies (backgrounds, watermarks).


Step 7 — Data augmentation as curriculum

AugmentationTeaches
Random crop / flipTranslation / mirror invariance
Color jitterLighting invariance
Cutout / CutMixRobustness to occlusion, label mixing
MixupLinear regions between classes

Rule: augmentations must preserve label meaning (no vertical flip for “6”/“9” unless labels swap).

Transfer learning: freeze early layers pretrained on ImageNet; train head on small dataset — fewer labels needed when low-level filters already exist.


Deep dive — what to watch in training curves

PatternLikely issue
Train acc 99%, val acc 60%Overfit — more data, dropout, weight decay
Both acc lowUnderfit — bigger model, train longer, check labels
Loss NaNlr too high, bad normalization, mixed precision overflow
Val improves then degradesLate overfit — early stopping

Check your understanding

  1. What does translation equivariance mean for a conv layer?
  2. Why is parameter sharing appropriate for photos but not for spreadsheet columns?
  3. Name one symptom of receptive field too small for the task.
  4. Why does batch norm behave differently at train vs eval?
  5. How does a residual connection change gradient flow?

Lab-style stretch goals

Train a small CNN on CIFAR-10 with and without augmentation; plot train/val accuracy. Try freezing a pretrained backbone + linear head on 500 images per class.

Debug: Visualize first-layer filters — do you see Gabor-like edge detectors emerge?