Instance segmentation & Mask R-CNN

Before we begin

Semantic segmentation labels every pixel with a class — all “person” pixels share one ID. Instance segmentation must answer: which pixels belong to person A vs person B?

That requires detecting objects and drawing a mask per instance. The workhorse architecture for years has been Mask R-CNN (He et al., 2017) — built on Faster R-CNN with a small mask head per region.

Figure

Mask R-CNN pipeline

Two-stage: propose regions → classify + box refine → predict mask inside each box.

What you will learn

Contrast semantic, instance, and panoptic outputs with examples.
Walk through Mask R-CNN stages: backbone, FPN, RPN, RoIAlign, mask head.
Explain why RoIAlign beats RoI Pool for mask quality.
Know one-stage and query-based alternatives at a high level.

Before this lesson

Why semantic models fail at instances

Two overlapping cups on a table:

Output type	What happens
Semantic	All cup pixels → class `cup` — one blob
Instance	Cup 1 mask + Cup 2 mask — separable
Detection only	Two boxes — pixels inside box still include background

Apps that count, track, or edit one object need instance or panoptic pipelines.

Two-stage instance segmentation (Mask R-CNN)

Stage 1 — Region proposals

Backbone + FPN extract multi-scale features (same pyramid idea as detection lessons in CV foundations).

Region Proposal Network (RPN) slides anchors on the feature map, predicts objectness and box deltas → candidate boxes.

Stage 2 — Per-region prediction

For each proposed region:

RoIAlign crops a fixed-size feature patch — bilinear sampling at fractional coordinates (no harsh quantization like RoI Pool).
Box head — class label + refined box.
Mask head — small FCN outputs K×m×m mask per class (typically 28×28), applied only for the winning class.

text

Image → ResNet+FPN → RPN proposals
  → RoIAlign per box → parallel heads: class | box | mask

Training: multi-task loss = classification + box regression + mask pixel loss (on positive regions).

RoIAlign vs RoI Pool (why masks got better)

RoI Pool snaps region boundaries to discrete grid cells → misalignment — fine for coarse boxes, bad for pixel masks.

RoIAlign samples at continuous locations → masks align with object edges — critical for instance quality.

Outputs at inference

Per detected instance you get:

Bounding box
Class score
Binary mask (upsampled to image size, cropped to box)

Post-processing: NMS (non-max suppression) removes duplicate boxes on the same object.

One-stage instance methods (awareness)

Family	Idea
YOLACT / YOLO-mask variants	Predict masks in one pass — faster, often lower mask quality
SOLO / SOLOv2	Assign instances to grid cells directly
Mask2Former (query-based)	Transformer queries predict masks — modern panoptic leader

Mask R-CNN remains the teaching default because the two-stage story is explicit: detect → segment inside box.

Panoptic segmentation — tying it together

Panoptic = semantic labels for stuff (sky, road) + instance masks for things (people, cars).

Often two branches or unified models (e.g. Panoptic FPN, Mask2Former). Production driving stacks care deeply about not confusing road vs sidewalk (semantic) and not merging two pedestrians (instance).

When to use instance vs semantic U-Net

Use semantic U-Net / DeepLab	Use instance / Mask R-CNN
Portrait background (person vs bg)	Counting objects in a shelf
Road / sky parsing	Robotics manipulation per object
Organ segmentation in CT slice	Interactive photo editing per person
Faster, simpler labels	Needs box + mask or instance ID labels

Your course project is semantic (pet vs background vs border trimap) — the right first step. Instance is the natural Module 5 extension if you label separate object IDs.

Practical notes

Annotation cost: instance > semantic — each object needs its own mask ID.
Compute: Mask R-CNN is heavier than U-Net — often GPU-only at reasonable resolution.
Libraries: torchvision.models.detection.maskrcnn_resnet50_fpn — fine-tune on custom instance datasets.

Checkpoint

What does the mask head predict relative to the box head?
Why does RoIAlign matter for mask edges?
Name one task that needs instance segmentation, not semantic only.

What's next

Lesson 6 — Losses & metrics