← Back to curriculum

Module 5 — Image segmentation

Beyond U-Net — FCN, DeepLab, SegFormer

FCN dense prediction, atrous conv and ASPP, pyramid pooling, transformer decoders, and when to pick each family.

~90 min read + exercises

Beyond U-Net — other segmentation models

Before we begin

U-Net is the right first model to implement — small data, clear skips, trains in an afternoon. Production and research stacks rarely stop there. Street-scene parsers use DeepLab; mobile apps use lightweight encoders; papers benchmark SegFormer and Mask2Former; instance tasks use Mask R-CNN.

This lesson maps the landscape: what each family optimizes for, how they relate to encoder–decoder ideas you already know, and when to reach past U-Net.

Figure

Segmentation model families

Semantic segmentation families — all output dense H×W class mapsFCNconv-only denseU-Netskip encoder–decoderDeepLabatrous + ASPPSegFormertransformer + MLP decInstance / panoptic: Mask R-CNN, Mask2Former — separate lesson
All build on dense prediction — they differ in context, scale, and output type.

What you will learn

  • Place FCN, U-Net, DeepLab, PSPNet, and SegFormer on one timeline.
  • Explain atrous (dilated) convolution and ASPP in plain language.
  • Choose a model family for a scenario (medical, driving, mobile, instances).
  • Know what pretrained segmentation heads buy you in practice.

Before this lesson


FCN — fully convolutional networks (2015)

Problem U-Net also solves: classification CNNs end with fully connected layers → one vector. FCN replaces FC layers with convs so the network outputs a spatial map, then upsamples coarse predictions to input size.

IdeaDetail
Skip connectionsFCN added skips from shallow layers (similar spirit to U-Net)
Coarse heatmapsEarly versions upsampled low-res class scores — blobby borders
Historical roleProved end-to-end trainable per-pixel labels on Pascal VOC

Takeaway: FCN = “make classification CNNs output grids.” U-Net = FCN-style idea + symmetric decoder + stronger skips for sharper masks on small data.


U-Net family (recap + extensions)

You built the baseline in the project. Common extensions:

VariantWhat it adds
U-Net++Nested skip pathways — features fuse at multiple scales
Attention U-NetGating on skip connections — suppress irrelevant encoder features
ResNet / EfficientNet encoderReplace vanilla conv stack with ImageNet-pretrained backbone

When U-Net is enough: limited labels (hundreds–few thousand images), binary or few-class semantic masks, teaching and prototyping.

When to upgrade: need SOTA on Cityscapes / ADE20K, very large objects + fine boundaries at once, or production latency targets.


DeepLab (Google) — context at multiple scales

Core problem: one receptive field size cannot capture both small objects and wide context (road + sky + distant cars).

Atrous (dilated) convolution

Standard 3×3 conv on a downsampled feature map “sees” a small image region. Dilated conv inserts gaps between kernel weights — same spatial resolution, larger effective field without extra pooling.

text
Normal 3×3:  sees 3×3 patch
Dilated 3×3 (rate=2): sees 5×5 patch — still H×W feature map size

ASPP — Atrous Spatial Pyramid Pooling

Run parallel branches at different dilation rates (and often global average pooling), then concatenate — multi-scale context in one layer.

DeepLabv3+ also uses an encoder–decoder structure: strong encoder (often ResNet or Xception) + lightweight decoder refines borders.

StrengthTradeoff
Excellent on street scenes (Cityscapes)Heavier than plain U-Net
Strong benchmarks with pretrained backbonesMore hyperparameters (dilation rates, output stride)

Checkpoint: Why is atrous conv preferable to another max-pool for “seeing more context”?

Pooling throws away resolution; dilation keeps H×W while expanding receptive field.


PSPNet — pyramid pooling module

PSPNet (Pyramid Scene Parsing) applies pooling at several grid scales (1×1, 2×2, 3×3, 6×6), upsamples and concatenates — another multi-scale context trick, like ASPP but pooling-based.

Often compared head-to-head with DeepLab on scene parsing benchmarks. Conceptually: “look at the scene globally and locally before labeling each pixel.”


SegFormer & transformer encoders

SegFormer (and similar) swap the CNN encoder for a hierarchical transformer (mixing local + global attention), with a lightweight MLP decoder.

CNN U-Net / DeepLabSegFormer-style
Inductive biasLocality via convAttention — flexible long range
DataWorks with modest data + pretrained CNNBenefits from scale; often uses large pretrained ViT
SpeedMature mobile optimizationsHeavier at full resolution

You do not need transformer math here — only that modern leaderboards often use attention encoders + simple decoders for semantic segmentation.


Pretrained backbones and libraries

In practice few teams train from random init. Typical pattern:

text
ImageNet-pretrained encoder (ResNet, EfficientNet, MiT, …)
  → segmentation head (U-Net decoder, ASPP, MLP head)
  → fine-tune on your masks

Libraries like segmentation_models.pytorch expose unet, deeplabv3, fpn with one line — useful after you train U-Net from scratch in the project.


Decision guide — which model when?

ScenarioReasonable starting point
Course project / <5k masksU-Net from scratch
Medical 2D slices, small dataU-Net or Attention U-Net
Driving / street scenesDeepLabv3+ or SegFormer + pretrained
Mobile portrait maskSmall encoder + U-Net decoder; INT8 deploy
Separate mask per personInstance path — next lesson (Mask R-CNN)
Need quick baseline on custom dataPretrained DeepLab / FPN fine-tune

What you are not expected to implement here

  • Full DeepLab with all dilation ablations
  • Transformer encoder from scratch
  • Panoptic multi-task training

You are expected to recognize names, compare design goals, and justify building U-Net first then fine-tuning a pretrained DeepLab as a stretch goal.


Checkpoint

  1. What does atrous convolution enlarge without downsampling further?
  2. How is ASPP similar in purpose to PSPNet’s pyramid?
  3. Why is FCN historically important if U-Net is more common in medical courses?

What's next

Lesson 5 — Instance segmentation & Mask R-CNN