← Back to curriculum

Module 5 — Image segmentation

Instance segmentation & Mask R-CNN

Per-instance masks, two-stage detectors, RoIAlign, mask head, and panoptic segmentation overview.

~85 min read + exercises

Instance segmentation & Mask R-CNN

Before we begin

Semantic segmentation labels every pixel with a class — all “person” pixels share one ID. Instance segmentation must answer: which pixels belong to person A vs person B?

That requires detecting objects and drawing a mask per instance. The workhorse architecture for years has been Mask R-CNN (He et al., 2017) — built on Faster R-CNN with a small mask head per region.

Figure

Mask R-CNN pipeline

Mask R-CNN — detect regions, then predict one mask per instanceBackbone + FPNRPN proposalsRoIAlignBox + classMask head
Two-stage: propose regions → classify + box refine → predict mask inside each box.

What you will learn

  • Contrast semantic, instance, and panoptic outputs with examples.
  • Walk through Mask R-CNN stages: backbone, FPN, RPN, RoIAlign, mask head.
  • Explain why RoIAlign beats RoI Pool for mask quality.
  • Know one-stage and query-based alternatives at a high level.

Before this lesson


Why semantic models fail at instances

Two overlapping cups on a table:

Output typeWhat happens
SemanticAll cup pixels → class cup — one blob
InstanceCup 1 mask + Cup 2 mask — separable
Detection onlyTwo boxes — pixels inside box still include background

Apps that count, track, or edit one object need instance or panoptic pipelines.


Two-stage instance segmentation (Mask R-CNN)

Stage 1 — Region proposals

Backbone + FPN extract multi-scale features (same pyramid idea as detection lessons in CV foundations).

Region Proposal Network (RPN) slides anchors on the feature map, predicts objectness and box deltas → candidate boxes.

Stage 2 — Per-region prediction

For each proposed region:

  1. RoIAlign crops a fixed-size feature patch — bilinear sampling at fractional coordinates (no harsh quantization like RoI Pool).
  2. Box head — class label + refined box.
  3. Mask head — small FCN outputs K×m×m mask per class (typically 28×28), applied only for the winning class.
text
Image → ResNet+FPN → RPN proposals
  → RoIAlign per box → parallel heads: class | box | mask

Training: multi-task loss = classification + box regression + mask pixel loss (on positive regions).


RoIAlign vs RoI Pool (why masks got better)

RoI Pool snaps region boundaries to discrete grid cells → misalignment — fine for coarse boxes, bad for pixel masks.

RoIAlign samples at continuous locations → masks align with object edges — critical for instance quality.


Outputs at inference

Per detected instance you get:

  • Bounding box
  • Class score
  • Binary mask (upsampled to image size, cropped to box)

Post-processing: NMS (non-max suppression) removes duplicate boxes on the same object.


One-stage instance methods (awareness)

FamilyIdea
YOLACT / YOLO-mask variantsPredict masks in one pass — faster, often lower mask quality
SOLO / SOLOv2Assign instances to grid cells directly
Mask2Former (query-based)Transformer queries predict masks — modern panoptic leader

Mask R-CNN remains the teaching default because the two-stage story is explicit: detect → segment inside box.


Panoptic segmentation — tying it together

Panoptic = semantic labels for stuff (sky, road) + instance masks for things (people, cars).

Often two branches or unified models (e.g. Panoptic FPN, Mask2Former). Production driving stacks care deeply about not confusing road vs sidewalk (semantic) and not merging two pedestrians (instance).


When to use instance vs semantic U-Net

Use semantic U-Net / DeepLabUse instance / Mask R-CNN
Portrait background (person vs bg)Counting objects in a shelf
Road / sky parsingRobotics manipulation per object
Organ segmentation in CT sliceInteractive photo editing per person
Faster, simpler labelsNeeds box + mask or instance ID labels

Your course project is semantic (pet vs background vs border trimap) — the right first step. Instance is the natural Module 5 extension if you label separate object IDs.


Practical notes

  • Annotation cost: instance > semantic — each object needs its own mask ID.
  • Compute: Mask R-CNN is heavier than U-Net — often GPU-only at reasonable resolution.
  • Libraries: torchvision.models.detection.maskrcnn_resnet50_fpn — fine-tune on custom instance datasets.

Checkpoint

  1. What does the mask head predict relative to the box head?
  2. Why does RoIAlign matter for mask edges?
  3. Name one task that needs instance segmentation, not semantic only.

What's next

Lesson 6 — Losses & metrics