Beyond U-Net — other segmentation models
Before we begin
U-Net is the right first model to implement — small data, clear skips, trains in an afternoon. Production and research stacks rarely stop there. Street-scene parsers use DeepLab; mobile apps use lightweight encoders; papers benchmark SegFormer and Mask2Former; instance tasks use Mask R-CNN.
This lesson maps the landscape: what each family optimizes for, how they relate to encoder–decoder ideas you already know, and when to reach past U-Net.
Figure
Segmentation model families
What you will learn
- Place FCN, U-Net, DeepLab, PSPNet, and SegFormer on one timeline.
- Explain atrous (dilated) convolution and ASPP in plain language.
- Choose a model family for a scenario (medical, driving, mobile, instances).
- Know what pretrained segmentation heads buy you in practice.
Before this lesson
FCN — fully convolutional networks (2015)
Problem U-Net also solves: classification CNNs end with fully connected layers → one vector. FCN replaces FC layers with convs so the network outputs a spatial map, then upsamples coarse predictions to input size.
| Idea | Detail |
|---|---|
| Skip connections | FCN added skips from shallow layers (similar spirit to U-Net) |
| Coarse heatmaps | Early versions upsampled low-res class scores — blobby borders |
| Historical role | Proved end-to-end trainable per-pixel labels on Pascal VOC |
Takeaway: FCN = “make classification CNNs output grids.” U-Net = FCN-style idea + symmetric decoder + stronger skips for sharper masks on small data.
U-Net family (recap + extensions)
You built the baseline in the project. Common extensions:
| Variant | What it adds |
|---|---|
| U-Net++ | Nested skip pathways — features fuse at multiple scales |
| Attention U-Net | Gating on skip connections — suppress irrelevant encoder features |
| ResNet / EfficientNet encoder | Replace vanilla conv stack with ImageNet-pretrained backbone |
When U-Net is enough: limited labels (hundreds–few thousand images), binary or few-class semantic masks, teaching and prototyping.
When to upgrade: need SOTA on Cityscapes / ADE20K, very large objects + fine boundaries at once, or production latency targets.
DeepLab (Google) — context at multiple scales
Core problem: one receptive field size cannot capture both small objects and wide context (road + sky + distant cars).
Atrous (dilated) convolution
Standard 3×3 conv on a downsampled feature map “sees” a small image region. Dilated conv inserts gaps between kernel weights — same spatial resolution, larger effective field without extra pooling.
Normal 3×3: sees 3×3 patch
Dilated 3×3 (rate=2): sees 5×5 patch — still H×W feature map sizeASPP — Atrous Spatial Pyramid Pooling
Run parallel branches at different dilation rates (and often global average pooling), then concatenate — multi-scale context in one layer.
DeepLabv3+ also uses an encoder–decoder structure: strong encoder (often ResNet or Xception) + lightweight decoder refines borders.
| Strength | Tradeoff |
|---|---|
| Excellent on street scenes (Cityscapes) | Heavier than plain U-Net |
| Strong benchmarks with pretrained backbones | More hyperparameters (dilation rates, output stride) |
Checkpoint: Why is atrous conv preferable to another max-pool for “seeing more context”?
Pooling throws away resolution; dilation keeps H×W while expanding receptive field.
PSPNet — pyramid pooling module
PSPNet (Pyramid Scene Parsing) applies pooling at several grid scales (1×1, 2×2, 3×3, 6×6), upsamples and concatenates — another multi-scale context trick, like ASPP but pooling-based.
Often compared head-to-head with DeepLab on scene parsing benchmarks. Conceptually: “look at the scene globally and locally before labeling each pixel.”
SegFormer & transformer encoders
SegFormer (and similar) swap the CNN encoder for a hierarchical transformer (mixing local + global attention), with a lightweight MLP decoder.
| CNN U-Net / DeepLab | SegFormer-style | |
|---|---|---|
| Inductive bias | Locality via conv | Attention — flexible long range |
| Data | Works with modest data + pretrained CNN | Benefits from scale; often uses large pretrained ViT |
| Speed | Mature mobile optimizations | Heavier at full resolution |
You do not need transformer math here — only that modern leaderboards often use attention encoders + simple decoders for semantic segmentation.
Pretrained backbones and libraries
In practice few teams train from random init. Typical pattern:
ImageNet-pretrained encoder (ResNet, EfficientNet, MiT, …)
→ segmentation head (U-Net decoder, ASPP, MLP head)
→ fine-tune on your masksLibraries like segmentation_models.pytorch expose unet, deeplabv3, fpn with one line — useful after you train U-Net from scratch in the project.
Decision guide — which model when?
| Scenario | Reasonable starting point |
|---|---|
| Course project / <5k masks | U-Net from scratch |
| Medical 2D slices, small data | U-Net or Attention U-Net |
| Driving / street scenes | DeepLabv3+ or SegFormer + pretrained |
| Mobile portrait mask | Small encoder + U-Net decoder; INT8 deploy |
| Separate mask per person | Instance path — next lesson (Mask R-CNN) |
| Need quick baseline on custom data | Pretrained DeepLab / FPN fine-tune |
What you are not expected to implement here
- Full DeepLab with all dilation ablations
- Transformer encoder from scratch
- Panoptic multi-task training
You are expected to recognize names, compare design goals, and justify building U-Net first then fine-tuning a pretrained DeepLab as a stretch goal.
Checkpoint
- What does atrous convolution enlarge without downsampling further?
- How is ASPP similar in purpose to PSPNet’s pyramid?
- Why is FCN historically important if U-Net is more common in medical courses?