← Back to curriculum

Learning-based vision

Convolutional networks for images

From fully-connected layers to conv blocks, receptive fields, pooling, and training objectives.

~75 min read + exercises

Convolutional networks for images

You are now at the point where hand-designed filters become learned filters. This lesson focuses on the inductive biases of CNNs and how training turns raw pixels into task-relevant representations.

Figure

A classic CNN backbone — shrink spatially, grow channels

Shrink spatially, grow channels: a classic CNN backboneEach block is a feature map (conv + activation, optionally pooled).InputH×W×3Conv 3×3 + ReLU32 chConv + Pool64 chConv + Pool128 chDeeper conv256 chFC/Headlogits
Each block is a feature map produced by conv + activation (optionally pooled). Resolution drops while channels expand to encode richer concepts.

Learning objectives

  • Map a classic “conv → nonlinearity → pool” stack to receptive fields and translation equivariance.
  • Explain parameter sharing and why it matters for images.
  • Describe a training loop at a high level: forward pass, loss, backward pass, optimizer step.

Prerequisites

  • Convolution lesson (spatial locality and kernels).
  • Basic understanding of supervised learning (input → label).

Step 1 — Why fully-connected layers on full images explode

If you flatten a 224×224224 \times 224 RGB image, you have 224×224×3224 \times 224 \times 3 inputs. A single dense layer to 1000 units needs on the order of hundreds of millions of weights — huge, data-hungry, and blind to spatial structure.

Convolutional layers reuse the same weights at every spatial location: parameter sharing.

Checkpoint: What is the computational intuition behind “same kernel swept across the image”?


Step 2 — Conv blocks build receptive fields

Each successive convolution expands the receptive field: the region of the input image that can influence a particular output unit.

  • Stacking small kernels (e.g. 3×3 repeated) can mimic larger effective neighborhoods with fewer parameters than one giant kernel (depending on depth and channels).
  • Dilation (later topics) expands receptive field without shrinking resolution as aggressively.

Figure

Receptive field grows with depth

Stacking small kernels grows the effective receptive fieldThree 3×3 conv layers cover roughly the same area as one 7×7 kernel — with fewer parameters.Layer 1 (3×3)~3×3 receptive fieldLayer 2 (3×3)~5×5 receptive fieldLayer 3 (3×3)~7×7 receptive field
One 3×3 layer sees a 3×3 region; two stacked see 5×5; three see 7×7 — with far fewer parameters than a single 7×7 kernel.

Exercise: For three stride-1 3×3 convolutions stacked, give a rough receptive field size for a center output neuron (ignore boundaries).


Step 3 — Nonlinearities and depth

A stack of linear operations is still linear. Nonlinear activations (ReLU family, GELU, etc.) let the network represent curved decision boundaries and compose hierarchical features.

Typical story (not a law of nature, but useful):

  • Early layers: oriented edges and textures.
  • Mid layers: parts and local patterns.
  • Late layers: object- or class-specific cues.

Checkpoint: Why would a deep linear network be pointless for classification?


Step 4 — Pooling and downsampling

Pooling (max/average) or strided convolutions reduce spatial resolution and increase effective context per parameter.

Trade-offs:

  • Downsampling too early can lose small objects.
  • Avoiding downsampling costs memory.

Exercise: Compare max pooling vs average pooling on a sharp edge feature map — what differs qualitatively?


Step 5 — Training objective (classification example)

For multi-class classification, cross-entropy compares predicted logits to a one-hot label distribution.

The training loop:

  1. Sample a minibatch of images and labels.
  2. Forward pass: compute predictions.
  3. Compute loss.
  4. Backward pass: gradients via automatic differentiation.
  5. Optimizer step (SGD, Adam, …) updates weights.

Checkpoint: What is overfitting, and what is one symptom you would see on your loss curves?


Step 6 — Data augmentation as inductive bias

Common augmentations (crops, flips, color jitter) teach invariance the network might not discover from limited data.

Important nuance: augmentations must respect the label semantics (e.g. don’t vertical-flip digit “6” into “9” unless labels swap too).


Check your understanding

  1. What does translation equivariance mean for a conv layer?
  2. Why is parameter sharing appropriate for natural photographs but maybe less for tabular spreadsheets?
  3. Name one phenomenon that indicates your receptive field is too small for the task.

Lab-style stretch goal (optional)

Train a tiny CNN on CIFAR-10 with and without data augmentation; plot validation accuracy vs epoch and compare.