What is image segmentation?
Before we begin
So far in this course, most models output one decision per input:
- Regression → one number
- Spam / MNIST → one class label
Image segmentation breaks that pattern. The model outputs a label for every pixel — a dense map aligned with the image grid. You are teaching the network to paint regions: this pixel is pet, this pixel is background, this pixel is border.
That single change — from sparse to dense output — drives different architectures (encoder–decoder, U-Net), different metrics (IoU, Dice), and much more expensive labeling.
Figure
From one label to per-pixel labels
What you will learn
- Contrast classification, detection, and segmentation with output shapes.
- Define semantic, instance, and panoptic segmentation.
- Walk through a portrait-mode example end to end.
- Explain why masks must be pixel-aligned with images.
- Understand why segmentation labels cost more than classification labels.
Before this lesson
The vision task ladder
Computer vision tasks are often ordered by how much spatial detail the output requires:
| Task | Output | Shape (typical) | Question |
|---|---|---|---|
| Classification | One class | (batch, num_classes) logits | “What is it?” |
| Detection | Boxes + classes | (N, 4) boxes + (N,) labels | “Where are they?” (rectangles) |
| Semantic segmentation | Class per pixel | (batch, C, H, W) logits | “What class is each pixel?” |
| Instance segmentation | Mask per object | N masks, each (H, W) | “Which pixels belong to object k?” |
Figure
Output shape changes
Key idea: segmentation is dense prediction. If the input is 256×256 RGB, the output is often 256×256 class IDs (or 256×256×C logits before argmax).
Worked example: tiny 4×4 “image”
Imagine a 4×4 grayscale scene — a white blob on black (like a small MNIST digit):
Input pixels (intensity):
0 0 0 0
0 9 9 0
0 9 9 0
0 0 0 0
Classification label: "blob"
Detection: box (1,1)-(2,2), class "blob"
Semantic segmentation mask: 0 0 0 0
0 1 1 0
0 1 1 0
0 0 0 0
(0=background, 1=foreground)The mask has the same height and width as the input. Row 2, column 2 in the image corresponds to row 2, column 2 in the mask — always.
Checkpoint: If you resize the image to 8×8 but leave the mask at 4×4, what goes wrong?
Every pixel label points at the wrong location — the model learns misaligned nonsense. Image and mask must undergo the same spatial transforms.
Semantic segmentation
Semantic = “what class is this pixel?” — not “which instance?”
- All pixels labeled
personshare the same ID, even if three people stand in frame. - Sky, road, grass are usually stuff classes — large regions without instance IDs.
- Common datasets: Cityscapes (driving), ADE20K (scenes), medical organ scans.
When it is enough: blur everything that is not person; colorize road vs sidewalk; measure tumor area.
When it fails: you need to count individuals or track person A separately from person B — that requires instance segmentation.
Instance segmentation
Instance = separate mask per object, even for the same class.
- Person 1 → purple mask, person 2 → cyan mask (see figure above).
- Classic pipeline: Mask R-CNN — detect boxes with Faster R-CNN, then a small mask head predicts a binary mask inside each box.
- More GPU memory and annotation time than semantic-only.
| Semantic | Instance | |
|---|---|---|
| Two overlapping people | Both pixels = person | Two distinct masks |
| Annotation | Paint by class color | Paint + separate object IDs |
| Typical use | Scene parsing | Robotics, counting, AR |
Panoptic segmentation
Real scenes mix stuff and things:
- Stuff — sky, water, road (no clear instances).
- Things — cars, people, dogs (countable instances).
Panoptic segmentation assigns:
- A semantic label to amorphous regions, and
- An instance ID to each countable object.
Autonomous driving benchmarks care because confusing road vs sidewalk (semantic) and merging two pedestrians into one blob (instance) cause different failure modes.
Real-world walkthrough: portrait mode
When a phone blurs the background behind a person:
- Camera frame enters a segmentation model (often person vs background, sometimes hair/subject refinement).
- Model outputs H×W mask — high values = person, low = background.
- App blurs pixels where mask ≈ background; keeps person sharp.
- Runs on-device for privacy (your face never leaves the phone for this step).
That is binary semantic segmentation in production — same family of models you will train in the project, different dataset and resolution.
| App feature | Likely segmentation type |
|---|---|
| Portrait blur | Person vs background (semantic) |
| Object cutout in editor | Instance or semantic + refinement |
| Style transfer on face only | Segmentation-aware masking |
| Medical CT organ outline | Semantic (organ class per voxel) |
Detection vs segmentation (do not confuse them)
| Detection | Segmentation | |
|---|---|---|
| Output | Rectangles | Pixel masks |
| Shape | Coarse — box includes background pixels | Tight — follows object boundary |
| Good for | Counting, tracking init, fast preview | Editing, blur, precise area |
Many pipelines use both: detector proposes regions, segmenter refines boundaries (Mask R-CNN).
Annotation cost — why this module matters for careers
| Task | Labels per image (order of magnitude) |
|---|---|
| Classification | 1 |
| Detection | 5–50 boxes × 5 numbers |
| Segmentation | H × W (millions for HD) |
Labeling tools (Labelbox, CVAT, Roboflow) let annotators “paint” masks. Still slow. That is why practitioners use:
- Pretrained encoders and fine-tune heads
- Smaller crops (256×256) for learning
- Architectures like U-Net that work with hundreds, not millions, of training images
Common beginner mistakes
| Mistake | Symptom |
|---|---|
| Treating mask as independent of image | Striped or shifted predictions |
| Using bilinear resize on mask labels | Blended invalid class IDs (e.g. 0.7 = not a class) |
| Reporting pixel accuracy only | “99% accurate” model that never finds the pet |
| Expecting instance behavior from semantic head | Two people merged into one blob |
Checkpoint
- For a 128×128 RGB image and 3-class segmentation, what is the shape of the target mask tensor
(N, ?, ?)? - Semantic vs instance: two dogs in one photo — how many dog classes vs instances?
- Why is portrait blur a segmentation problem, not classification?
Answers (check yourself): (1)
(N, 128, 128)integer class IDs. (2) One classdog, two instances. (3) You need per-pixel person vs background, not one label for the whole frame.