← Back to curriculum

Module 5 — Image segmentation

Encoder–decoder & dense prediction

Spatial size trace through encoder/decoder, receptive field, upsampling choices, mask alignment rules, and bottleneck motivation.

~80 min read + exercises

Encoder–decoder & dense prediction

Before we begin

In Lesson 1 you saw that segmentation outputs an H×W map of labels. A standard image classifier does the opposite of what we need at the end:

text
CNN → feature maps → global average pool → one vector → softmax → "cat"

That destroys spatial resolution on purpose — the network only needs one summary. For segmentation we must preserve location: the prediction at pixel (42, 17) must describe that pixel in the input.

The standard solution is an encoder–decoder network:

  1. Encoder — shrink spatial size, grow channels → “what is where, roughly?”
  2. Decoder — grow spatial size back → “label every pixel.”

Figure

Dense prediction goal

1 labele.g. catH×W masklabel per pixel
Same spatial grid in and out: H×W×3 RGB → H×W×C class logits.

What you will learn

  • Trace spatial dimensions through encoder and decoder stages.
  • Explain receptive field and why downsampling helps context.
  • Compare upsampling methods used in decoders.
  • Apply joint augmentation rules for images and masks.
  • Articulate the bottleneck problem that U-Net solves next lesson.

Before this lesson


Why not just use a big fully connected layer?

Module 3 flattened MNIST to 784 inputs. For a 256×256 RGB image:

text
256 × 256 × 3 = 196,608 input dimensions
→ predict 256 × 256 = 65,536 outputs

That fully connected map would have billions of weights, ignore local structure, and fail to generalize. Convs share filters across space — the right inductive bias for images.

We still need a head that outputs per-pixel logits. Encoder–decoder does that by never flattening the whole image to one vector until the very end (and often not even then).


Encoder (contracting path)

Repeat blocks of:

text
Conv 3×3 → ReLU → (optional second conv) → MaxPool 2×2

Each pool halves height and width; each conv increases channel depth (feature richness).

Worked example: 256×256 input

Assume input (batch, 3, 256, 256):

StageAfter blockSpatial H×WChannels (example)What it tends to represent
Input256×2563RGB
Enc 1conv, pool128×12864edges, color blobs
Enc 2conv, pool64×64128parts, textures
Enc 3conv, pool32×32256object-level context
Enc 4conv, pool16×16512scene layout
Bottleneckconv16×161024“what” without fine “where”

Receptive field: a neuron at 16×16 “sees” a large patch of the original image — good for knowing there is a dog somewhere in the frame. Bad for knowing exactly which pixel is the ear tip unless we recover resolution in the decoder.

Checkpoint: After two stride-2 pools from 256, what is H×W?

256 → 128 → 64. Spatial size 64×64.


Decoder (expanding path)

Mirror the encoder in reverse:

text
Upsample 2× → Conv blocks → (repeat) → 1×1 conv to num_classes

Goal: climb back from 16×16 to 256×256 (or your training resolution).

Upsampling options

MethodHow it worksPros / cons
Bilinear / nearest upsample + convF.interpolate then 3×3 convSimple, smooth; common in modern U-Nets
Transposed convolutionLearned upsampling kernelFlexible; can cause checkerboard artifacts if kernel/stride misaligned
Pixel shuffleChannels → spatial rearrangementPopular in super-resolution

Your project uses transposed conv in the starter U-Net — if masks look grid-like, switch to bilinear + conv.

Final head

python
# logits shape: (batch, num_classes, H, W)
self.head = nn.Conv2d(base_channels, num_classes, kernel_size=1)

A 1×1 conv is a per-pixel linear classifier: at each (h, w) it maps channel vector → num_classes logits.


The bottleneck problem (motivation for U-Net)

If all fine detail must pass through the smallest layer (e.g. 16×16):

  • Object boundaries get blobby.
  • Thin structures (hair, spokes, fingers) disappear.
  • Small objects merge with background.

The decoder upsamples, but it only has coarse feature maps to work from — it must hallucinate sharp edges.

Next lesson: U-Net skip connections copy high-resolution encoder features directly to the decoder so borders do not pass only through the bottleneck.


Alignment: images and masks must stay married

Every spatial transform on the image must hit the mask identically:

TransformImageMask
Resize 256×256bilinear or bicubicnearest neighbor (preserve class IDs)
Horizontal flipyesyes
Random cropyessame crop box
Color jitteryesno (mask has no color)
python
# WRONG — mask gets soft fractional labels
mask = F.interpolate(mask.float(), scale_factor=0.5)  # don't
 
# RIGHT — nearest keeps integer classes
mask = F.interpolate(mask.float(), scale_factor=0.5, mode="nearest")

Historical note (optional)

Early semantic segmentation used FCN (Fully Convolutional Networks, 2015): take a classification CNN, replace fully connected layers with convs, upsample the output. U-Net (same year, medical imaging) added the skip connections that FCN-style models were missing — often better on small datasets and sharp boundaries.

You do not need to implement FCN — but knowing segmentation = conv all the way down + upsample helps papers make sense.


Encoder–decoder vs classifier — summary

Image classifierSegmentation encoder–decoder
End spatial size1×1 (pooled)H×W (same as input or target)
OutputOne vectorGrid of logits
Loses pixel locations?Yes, by designMust not lose alignment
Typical lossCross-entropy onceCross-entropy per pixel

Checkpoint

  1. Why does the encoder increase channels while shrinking H×W?
  2. What does the decoder restore that the bottleneck alone lacks?
  3. Why use nearest interpolation for mask resize?

Hints: (1) Trade spatial resolution for richer feature codes / larger receptive field. (2) Full-resolution spatial layout for per-pixel labels. (3) Class IDs must stay integers 0, 1, 2 — not blended floats.


What's next

Lesson 3 — U-Net architecture