Beyond U-Net — other segmentation models

Before we begin

U-Net is the right first model to implement — small data, clear skips, trains in an afternoon. Production and research stacks rarely stop there. Street-scene parsers use DeepLab; mobile apps use lightweight encoders; papers benchmark SegFormer and Mask2Former; instance tasks use Mask R-CNN.

This lesson maps the landscape: what each family optimizes for, how they relate to encoder–decoder ideas you already know, and when to reach past U-Net.

Figure

Segmentation model families

All build on dense prediction — they differ in context, scale, and output type.

What you will learn

Place FCN, U-Net, DeepLab, PSPNet, and SegFormer on one timeline.
Explain atrous (dilated) convolution and ASPP in plain language.
Choose a model family for a scenario (medical, driving, mobile, instances).
Know what pretrained segmentation heads buy you in practice.

Before this lesson

FCN — fully convolutional networks (2015)

Problem U-Net also solves: classification CNNs end with fully connected layers → one vector. FCN replaces FC layers with convs so the network outputs a spatial map, then upsamples coarse predictions to input size.

Idea	Detail
Skip connections	FCN added skips from shallow layers (similar spirit to U-Net)
Coarse heatmaps	Early versions upsampled low-res class scores — blobby borders
Historical role	Proved end-to-end trainable per-pixel labels on Pascal VOC

Takeaway: FCN = “make classification CNNs output grids.” U-Net = FCN-style idea + symmetric decoder + stronger skips for sharper masks on small data.

U-Net family (recap + extensions)

You built the baseline in the project. Common extensions:

Variant	What it adds
U-Net++	Nested skip pathways — features fuse at multiple scales
Attention U-Net	Gating on skip connections — suppress irrelevant encoder features
ResNet / EfficientNet encoder	Replace vanilla conv stack with ImageNet-pretrained backbone

When U-Net is enough: limited labels (hundreds–few thousand images), binary or few-class semantic masks, teaching and prototyping.

When to upgrade: need SOTA on Cityscapes / ADE20K, very large objects + fine boundaries at once, or production latency targets.

DeepLab (Google) — context at multiple scales

Core problem: one receptive field size cannot capture both small objects and wide context (road + sky + distant cars).

Atrous (dilated) convolution

Standard 3×3 conv on a downsampled feature map “sees” a small image region. Dilated conv inserts gaps between kernel weights — same spatial resolution, larger effective field without extra pooling.

text

Normal 3×3:  sees 3×3 patch
Dilated 3×3 (rate=2): sees 5×5 patch — still H×W feature map size

ASPP — Atrous Spatial Pyramid Pooling

Run parallel branches at different dilation rates (and often global average pooling), then concatenate — multi-scale context in one layer.

DeepLabv3+ also uses an encoder–decoder structure: strong encoder (often ResNet or Xception) + lightweight decoder refines borders.

Strength	Tradeoff
Excellent on street scenes (Cityscapes)	Heavier than plain U-Net
Strong benchmarks with pretrained backbones	More hyperparameters (dilation rates, output stride)

Checkpoint: Why is atrous conv preferable to another max-pool for “seeing more context”?

Pooling throws away resolution; dilation keeps H×W while expanding receptive field.

PSPNet — pyramid pooling module

PSPNet (Pyramid Scene Parsing) applies pooling at several grid scales (1×1, 2×2, 3×3, 6×6), upsamples and concatenates — another multi-scale context trick, like ASPP but pooling-based.

Often compared head-to-head with DeepLab on scene parsing benchmarks. Conceptually: “look at the scene globally and locally before labeling each pixel.”

SegFormer & transformer encoders

SegFormer (and similar) swap the CNN encoder for a hierarchical transformer (mixing local + global attention), with a lightweight MLP decoder.

	CNN U-Net / DeepLab	SegFormer-style
Inductive bias	Locality via conv	Attention — flexible long range
Data	Works with modest data + pretrained CNN	Benefits from scale; often uses large pretrained ViT
Speed	Mature mobile optimizations	Heavier at full resolution

You do not need transformer math here — only that modern leaderboards often use attention encoders + simple decoders for semantic segmentation.

Pretrained backbones and libraries

In practice few teams train from random init. Typical pattern:

text

ImageNet-pretrained encoder (ResNet, EfficientNet, MiT, …)
  → segmentation head (U-Net decoder, ASPP, MLP head)
  → fine-tune on your masks

Libraries like segmentation_models.pytorch expose unet, deeplabv3, fpn with one line — useful after you train U-Net from scratch in the project.

Decision guide — which model when?

Scenario	Reasonable starting point
Course project / <5k masks	U-Net from scratch
Medical 2D slices, small data	U-Net or Attention U-Net
Driving / street scenes	DeepLabv3+ or SegFormer + pretrained
Mobile portrait mask	Small encoder + U-Net decoder; INT8 deploy
Separate mask per person	Instance path — next lesson (Mask R-CNN)
Need quick baseline on custom data	Pretrained DeepLab / FPN fine-tune

What you are not expected to implement here

Full DeepLab with all dilation ablations
Transformer encoder from scratch
Panoptic multi-task training

You are expected to recognize names, compare design goals, and justify building U-Net first then fine-tuning a pretrained DeepLab as a stretch goal.

Checkpoint

What does atrous convolution enlarge without downsampling further?
How is ASPP similar in purpose to PSPNet’s pyramid?
Why is FCN historically important if U-Net is more common in medical courses?

What's next

Lesson 5 — Instance segmentation & Mask R-CNN