Encoder–decoder & dense prediction

Before we begin

In Lesson 1 you saw that segmentation outputs an H×W map of labels. A standard image classifier does the opposite of what we need at the end:

text

CNN → feature maps → global average pool → one vector → softmax → "cat"

That destroys spatial resolution on purpose — the network only needs one summary. For segmentation we must preserve location: the prediction at pixel (42, 17) must describe that pixel in the input.

The standard solution is an encoder–decoder network:

Encoder — shrink spatial size, grow channels → “what is where, roughly?”
Decoder — grow spatial size back → “label every pixel.”

Figure

Dense prediction goal

Same spatial grid in and out: H×W×3 RGB → H×W×C class logits.

What you will learn

Trace spatial dimensions through encoder and decoder stages.
Explain receptive field and why downsampling helps context.
Compare upsampling methods used in decoders.
Apply joint augmentation rules for images and masks.
Articulate the bottleneck problem that U-Net solves next lesson.

Before this lesson

Why not just use a big fully connected layer?

Module 3 flattened MNIST to 784 inputs. For a 256×256 RGB image:

text

256 × 256 × 3 = 196,608 input dimensions
→ predict 256 × 256 = 65,536 outputs

That fully connected map would have billions of weights, ignore local structure, and fail to generalize. Convs share filters across space — the right inductive bias for images.

We still need a head that outputs per-pixel logits. Encoder–decoder does that by never flattening the whole image to one vector until the very end (and often not even then).

Encoder (contracting path)

Repeat blocks of:

text

Conv 3×3 → ReLU → (optional second conv) → MaxPool 2×2

Each pool halves height and width; each conv increases channel depth (feature richness).

Worked example: 256×256 input

Assume input (batch, 3, 256, 256):

Stage	After block	Spatial H×W	Channels (example)	What it tends to represent
Input	—	256×256	3	RGB
Enc 1	conv, pool	128×128	64	edges, color blobs
Enc 2	conv, pool	64×64	128	parts, textures
Enc 3	conv, pool	32×32	256	object-level context
Enc 4	conv, pool	16×16	512	scene layout
Bottleneck	conv	16×16	1024	“what” without fine “where”

Receptive field: a neuron at 16×16 “sees” a large patch of the original image — good for knowing there is a dog somewhere in the frame. Bad for knowing exactly which pixel is the ear tip unless we recover resolution in the decoder.

Checkpoint: After two stride-2 pools from 256, what is H×W?

256 → 128 → 64. Spatial size 64×64.

Decoder (expanding path)

Mirror the encoder in reverse:

text

Upsample 2× → Conv blocks → (repeat) → 1×1 conv to num_classes

Goal: climb back from 16×16 to 256×256 (or your training resolution).

Upsampling options

Method	How it works	Pros / cons
Bilinear / nearest upsample + conv	`F.interpolate` then 3×3 conv	Simple, smooth; common in modern U-Nets
Transposed convolution	Learned upsampling kernel	Flexible; can cause checkerboard artifacts if kernel/stride misaligned
Pixel shuffle	Channels → spatial rearrangement	Popular in super-resolution

Your project uses transposed conv in the starter U-Net — if masks look grid-like, switch to bilinear + conv.

Final head

python

# logits shape: (batch, num_classes, H, W)
self.head = nn.Conv2d(base_channels, num_classes, kernel_size=1)

A 1×1 conv is a per-pixel linear classifier: at each (h, w) it maps channel vector → num_classes logits.

The bottleneck problem (motivation for U-Net)

If all fine detail must pass through the smallest layer (e.g. 16×16):

Object boundaries get blobby.
Thin structures (hair, spokes, fingers) disappear.
Small objects merge with background.

The decoder upsamples, but it only has coarse feature maps to work from — it must hallucinate sharp edges.

Next lesson: U-Net skip connections copy high-resolution encoder features directly to the decoder so borders do not pass only through the bottleneck.

Alignment: images and masks must stay married

Every spatial transform on the image must hit the mask identically:

Transform	Image	Mask
Resize 256×256	bilinear or bicubic	nearest neighbor (preserve class IDs)
Horizontal flip	yes	yes
Random crop	yes	same crop box
Color jitter	yes	no (mask has no color)

python

# WRONG — mask gets soft fractional labels
mask = F.interpolate(mask.float(), scale_factor=0.5)  # don't
 
# RIGHT — nearest keeps integer classes
mask = F.interpolate(mask.float(), scale_factor=0.5, mode="nearest")

Historical note (optional)

Early semantic segmentation used FCN (Fully Convolutional Networks, 2015): take a classification CNN, replace fully connected layers with convs, upsample the output. U-Net (same year, medical imaging) added the skip connections that FCN-style models were missing — often better on small datasets and sharp boundaries.

You do not need to implement FCN — but knowing segmentation = conv all the way down + upsample helps papers make sense.

Encoder–decoder vs classifier — summary

	Image classifier	Segmentation encoder–decoder
End spatial size	1×1 (pooled)	H×W (same as input or target)
Output	One vector	Grid of logits
Loses pixel locations?	Yes, by design	Must not lose alignment
Typical loss	Cross-entropy once	Cross-entropy per pixel

Checkpoint

Why does the encoder increase channels while shrinking H×W?
What does the decoder restore that the bottleneck alone lacks?
Why use nearest interpolation for mask resize?

Hints: (1) Trade spatial resolution for richer feature codes / larger receptive field. (2) Full-resolution spatial layout for per-pixel labels. (3) Class IDs must stay integers 0, 1, 2 — not blended floats.

What's next

Lesson 3 — U-Net architecture