← Back to curriculum

Module 5 — Segmentation & instance masks

Semantic segmentation & U-Net

Per-pixel classification, encoder–decoder architectures, skip connections, U-Net blocks, and when segmentation beats bounding boxes.

~80 min read + exercises

Semantic segmentation & U-Net

Before we begin

Semantic segmentation assigns a class label to every pixel — sky, road, person — without distinguishing two people as separate instances.


Learning objectives

  • Contrast segmentation vs classification vs detection.
  • Explain encoder–decoder and skip connections.
  • Walk through U-Net architecture.
  • Know when dense prediction needs full-resolution output.

Dense prediction

Input: H×WH \times W image. Output: H×WH \times W label map (or H×W×CH \times W \times C logits).

Encoder downsamples (max pool / stride conv) → larger receptive field, smaller spatial size.
Decoder upsamples (transpose conv / bilinear + conv) → recover resolution.

Figure

Encoder–decoder with skips

U-Net — contract path (left), expand path (right), skip connections (dashed)Encoder ↓Decoder ↑256×2563ch128×12864ch64×64128ch32×32256ch16×16512chbottleneck
U-Net copies high-resolution encoder features into the decoder.

U-Net

Proposed for biomedical segmentation; now ubiquitous.

  • Contracting path: repeated conv + pool.
  • Expanding path: upsample + concat skip + conv.
  • Skips: preserve fine boundaries (organs, pet fur edges).

Training

Per-pixel cross-entropy over classes. Ignore index for void pixels if dataset provides it.

Augmentation: flips, scales, color jitter — labels must transform identically.


Beyond U-Net

FCN, DeepLab (atrous conv), SegFormer (transformer encoder) — same dense prediction goal, different encoders.


What's next

Lesson 2 — Instance segmentation & Mask R-CNN