Encoder–decoder & dense prediction
Before we begin
In Lesson 1 you saw that segmentation outputs an H×W map of labels. A standard image classifier does the opposite of what we need at the end:
CNN → feature maps → global average pool → one vector → softmax → "cat"That destroys spatial resolution on purpose — the network only needs one summary. For segmentation we must preserve location: the prediction at pixel (42, 17) must describe that pixel in the input.
The standard solution is an encoder–decoder network:
- Encoder — shrink spatial size, grow channels → “what is where, roughly?”
- Decoder — grow spatial size back → “label every pixel.”
Figure
Dense prediction goal
What you will learn
- Trace spatial dimensions through encoder and decoder stages.
- Explain receptive field and why downsampling helps context.
- Compare upsampling methods used in decoders.
- Apply joint augmentation rules for images and masks.
- Articulate the bottleneck problem that U-Net solves next lesson.
Before this lesson
Why not just use a big fully connected layer?
Module 3 flattened MNIST to 784 inputs. For a 256×256 RGB image:
256 × 256 × 3 = 196,608 input dimensions
→ predict 256 × 256 = 65,536 outputsThat fully connected map would have billions of weights, ignore local structure, and fail to generalize. Convs share filters across space — the right inductive bias for images.
We still need a head that outputs per-pixel logits. Encoder–decoder does that by never flattening the whole image to one vector until the very end (and often not even then).
Encoder (contracting path)
Repeat blocks of:
Conv 3×3 → ReLU → (optional second conv) → MaxPool 2×2Each pool halves height and width; each conv increases channel depth (feature richness).
Worked example: 256×256 input
Assume input (batch, 3, 256, 256):
| Stage | After block | Spatial H×W | Channels (example) | What it tends to represent |
|---|---|---|---|---|
| Input | — | 256×256 | 3 | RGB |
| Enc 1 | conv, pool | 128×128 | 64 | edges, color blobs |
| Enc 2 | conv, pool | 64×64 | 128 | parts, textures |
| Enc 3 | conv, pool | 32×32 | 256 | object-level context |
| Enc 4 | conv, pool | 16×16 | 512 | scene layout |
| Bottleneck | conv | 16×16 | 1024 | “what” without fine “where” |
Receptive field: a neuron at 16×16 “sees” a large patch of the original image — good for knowing there is a dog somewhere in the frame. Bad for knowing exactly which pixel is the ear tip unless we recover resolution in the decoder.
Checkpoint: After two stride-2 pools from 256, what is H×W?
256 → 128 → 64. Spatial size 64×64.
Decoder (expanding path)
Mirror the encoder in reverse:
Upsample 2× → Conv blocks → (repeat) → 1×1 conv to num_classesGoal: climb back from 16×16 to 256×256 (or your training resolution).
Upsampling options
| Method | How it works | Pros / cons |
|---|---|---|
| Bilinear / nearest upsample + conv | F.interpolate then 3×3 conv | Simple, smooth; common in modern U-Nets |
| Transposed convolution | Learned upsampling kernel | Flexible; can cause checkerboard artifacts if kernel/stride misaligned |
| Pixel shuffle | Channels → spatial rearrangement | Popular in super-resolution |
Your project uses transposed conv in the starter U-Net — if masks look grid-like, switch to bilinear + conv.
Final head
# logits shape: (batch, num_classes, H, W)
self.head = nn.Conv2d(base_channels, num_classes, kernel_size=1)A 1×1 conv is a per-pixel linear classifier: at each (h, w) it maps channel vector → num_classes logits.
The bottleneck problem (motivation for U-Net)
If all fine detail must pass through the smallest layer (e.g. 16×16):
- Object boundaries get blobby.
- Thin structures (hair, spokes, fingers) disappear.
- Small objects merge with background.
The decoder upsamples, but it only has coarse feature maps to work from — it must hallucinate sharp edges.
Next lesson: U-Net skip connections copy high-resolution encoder features directly to the decoder so borders do not pass only through the bottleneck.
Alignment: images and masks must stay married
Every spatial transform on the image must hit the mask identically:
| Transform | Image | Mask |
|---|---|---|
| Resize 256×256 | bilinear or bicubic | nearest neighbor (preserve class IDs) |
| Horizontal flip | yes | yes |
| Random crop | yes | same crop box |
| Color jitter | yes | no (mask has no color) |
# WRONG — mask gets soft fractional labels
mask = F.interpolate(mask.float(), scale_factor=0.5) # don't
# RIGHT — nearest keeps integer classes
mask = F.interpolate(mask.float(), scale_factor=0.5, mode="nearest")Historical note (optional)
Early semantic segmentation used FCN (Fully Convolutional Networks, 2015): take a classification CNN, replace fully connected layers with convs, upsample the output. U-Net (same year, medical imaging) added the skip connections that FCN-style models were missing — often better on small datasets and sharp boundaries.
You do not need to implement FCN — but knowing segmentation = conv all the way down + upsample helps papers make sense.
Encoder–decoder vs classifier — summary
| Image classifier | Segmentation encoder–decoder | |
|---|---|---|
| End spatial size | 1×1 (pooled) | H×W (same as input or target) |
| Output | One vector | Grid of logits |
| Loses pixel locations? | Yes, by design | Must not lose alignment |
| Typical loss | Cross-entropy once | Cross-entropy per pixel |
Checkpoint
- Why does the encoder increase channels while shrinking H×W?
- What does the decoder restore that the bottleneck alone lacks?
- Why use nearest interpolation for mask resize?
Hints: (1) Trade spatial resolution for richer feature codes / larger receptive field. (2) Full-resolution spatial layout for per-pixel labels. (3) Class IDs must stay integers 0, 1, 2 — not blended floats.