Semantic segmentation & U-Net
Before we begin
Semantic segmentation assigns a class label to every pixel — sky, road, person — without distinguishing two people as separate instances.
Learning objectives
- Contrast segmentation vs classification vs detection.
- Explain encoder–decoder and skip connections.
- Walk through U-Net architecture.
- Know when dense prediction needs full-resolution output.
Dense prediction
Input: image. Output: label map (or logits).
Encoder downsamples (max pool / stride conv) → larger receptive field, smaller spatial size.
Decoder upsamples (transpose conv / bilinear + conv) → recover resolution.
Figure
Encoder–decoder with skips
U-Net
Proposed for biomedical segmentation; now ubiquitous.
- Contracting path: repeated conv + pool.
- Expanding path: upsample + concat skip + conv.
- Skips: preserve fine boundaries (organs, pet fur edges).
Training
Per-pixel cross-entropy over classes. Ignore index for void pixels if dataset provides it.
Augmentation: flips, scales, color jitter — labels must transform identically.
Beyond U-Net
FCN, DeepLab (atrous conv), SegFormer (transformer encoder) — same dense prediction goal, different encoders.