Semantic segmentation & U-Net

Before we begin

Semantic segmentation assigns a class label to every pixel — sky, road, person — without distinguishing two people as separate instances.

Learning objectives

Contrast segmentation vs classification vs detection.
Explain encoder–decoder and skip connections.
Walk through U-Net architecture.
Know when dense prediction needs full-resolution output.

Dense prediction

Input: $H \times W$ image. Output: $H \times W$ label map (or $H \times W \times C$ logits).

Encoder downsamples (max pool / stride conv) → larger receptive field, smaller spatial size.
Decoder upsamples (transpose conv / bilinear + conv) → recover resolution.

Figure

Encoder–decoder with skips

U-Net copies high-resolution encoder features into the decoder.

U-Net

Proposed for biomedical segmentation; now ubiquitous.

Contracting path: repeated conv + pool.
Expanding path: upsample + concat skip + conv.
Skips: preserve fine boundaries (organs, pet fur edges).

Training

Per-pixel cross-entropy over classes. Ignore index for void pixels if dataset provides it.

Augmentation: flips, scales, color jitter — labels must transform identically.

Beyond U-Net

FCN, DeepLab (atrous conv), SegFormer (transformer encoder) — same dense prediction goal, different encoders.

What's next

Lesson 2 — Instance segmentation & Mask R-CNN