Instance segmentation & Mask R-CNN
Before we begin
Semantic segmentation labels every pixel with a class — all “person” pixels share one ID. Instance segmentation must answer: which pixels belong to person A vs person B?
That requires detecting objects and drawing a mask per instance. The workhorse architecture for years has been Mask R-CNN (He et al., 2017) — built on Faster R-CNN with a small mask head per region.
Figure
Mask R-CNN pipeline
What you will learn
- Contrast semantic, instance, and panoptic outputs with examples.
- Walk through Mask R-CNN stages: backbone, FPN, RPN, RoIAlign, mask head.
- Explain why RoIAlign beats RoI Pool for mask quality.
- Know one-stage and query-based alternatives at a high level.
Before this lesson
Why semantic models fail at instances
Two overlapping cups on a table:
| Output type | What happens |
|---|---|
| Semantic | All cup pixels → class cup — one blob |
| Instance | Cup 1 mask + Cup 2 mask — separable |
| Detection only | Two boxes — pixels inside box still include background |
Apps that count, track, or edit one object need instance or panoptic pipelines.
Two-stage instance segmentation (Mask R-CNN)
Stage 1 — Region proposals
Backbone + FPN extract multi-scale features (same pyramid idea as detection lessons in CV foundations).
Region Proposal Network (RPN) slides anchors on the feature map, predicts objectness and box deltas → candidate boxes.
Stage 2 — Per-region prediction
For each proposed region:
- RoIAlign crops a fixed-size feature patch — bilinear sampling at fractional coordinates (no harsh quantization like RoI Pool).
- Box head — class label + refined box.
- Mask head — small FCN outputs K×m×m mask per class (typically 28×28), applied only for the winning class.
Image → ResNet+FPN → RPN proposals
→ RoIAlign per box → parallel heads: class | box | maskTraining: multi-task loss = classification + box regression + mask pixel loss (on positive regions).
RoIAlign vs RoI Pool (why masks got better)
RoI Pool snaps region boundaries to discrete grid cells → misalignment — fine for coarse boxes, bad for pixel masks.
RoIAlign samples at continuous locations → masks align with object edges — critical for instance quality.
Outputs at inference
Per detected instance you get:
- Bounding box
- Class score
- Binary mask (upsampled to image size, cropped to box)
Post-processing: NMS (non-max suppression) removes duplicate boxes on the same object.
One-stage instance methods (awareness)
| Family | Idea |
|---|---|
| YOLACT / YOLO-mask variants | Predict masks in one pass — faster, often lower mask quality |
| SOLO / SOLOv2 | Assign instances to grid cells directly |
| Mask2Former (query-based) | Transformer queries predict masks — modern panoptic leader |
Mask R-CNN remains the teaching default because the two-stage story is explicit: detect → segment inside box.
Panoptic segmentation — tying it together
Panoptic = semantic labels for stuff (sky, road) + instance masks for things (people, cars).
Often two branches or unified models (e.g. Panoptic FPN, Mask2Former). Production driving stacks care deeply about not confusing road vs sidewalk (semantic) and not merging two pedestrians (instance).
When to use instance vs semantic U-Net
| Use semantic U-Net / DeepLab | Use instance / Mask R-CNN |
|---|---|
| Portrait background (person vs bg) | Counting objects in a shelf |
| Road / sky parsing | Robotics manipulation per object |
| Organ segmentation in CT slice | Interactive photo editing per person |
| Faster, simpler labels | Needs box + mask or instance ID labels |
Your course project is semantic (pet vs background vs border trimap) — the right first step. Instance is the natural Module 5 extension if you label separate object IDs.
Practical notes
- Annotation cost: instance > semantic — each object needs its own mask ID.
- Compute: Mask R-CNN is heavier than U-Net — often GPU-only at reasonable resolution.
- Libraries:
torchvision.models.detection.maskrcnn_resnet50_fpn— fine-tune on custom instance datasets.
Checkpoint
- What does the mask head predict relative to the box head?
- Why does RoIAlign matter for mask edges?
- Name one task that needs instance segmentation, not semantic only.