← Back to curriculum

Module 5 — Image segmentation

What is image segmentation?

Task ladder, 4×4 worked example, semantic vs instance vs panoptic, portrait-mode walkthrough, and annotation cost.

~85 min read + exercises

What is image segmentation?

Before we begin

So far in this course, most models output one decision per input:

  • Regression → one number
  • Spam / MNIST → one class label

Image segmentation breaks that pattern. The model outputs a label for every pixel — a dense map aligned with the image grid. You are teaching the network to paint regions: this pixel is pet, this pixel is background, this pixel is border.

That single change — from sparse to dense output — drives different architectures (encoder–decoder, U-Net), different metrics (IoU, Dice), and much more expensive labeling.

Figure

From one label to per-pixel labels

street sceneClassification1 label per imagepersonpersoncarDetectionboxes + labelsSemantic seg.class per pixelInstance seg.mask per object
Each step adds spatial detail: one tag → boxes → class-colored pixels → separate instance colors.

What you will learn

  • Contrast classification, detection, and segmentation with output shapes.
  • Define semantic, instance, and panoptic segmentation.
  • Walk through a portrait-mode example end to end.
  • Explain why masks must be pixel-aligned with images.
  • Understand why segmentation labels cost more than classification labels.

Before this lesson


The vision task ladder

Computer vision tasks are often ordered by how much spatial detail the output requires:

TaskOutputShape (typical)Question
ClassificationOne class(batch, num_classes) logits“What is it?”
DetectionBoxes + classes(N, 4) boxes + (N,) labels“Where are they?” (rectangles)
Semantic segmentationClass per pixel(batch, C, H, W) logits“What class is each pixel?”
Instance segmentationMask per objectN masks, each (H, W)“Which pixels belong to object k?”

Figure

Output shape changes

1 labele.g. catH×W masklabel per pixel
Segmentation keeps the 2D grid — every spatial location gets its own prediction.

Key idea: segmentation is dense prediction. If the input is 256×256 RGB, the output is often 256×256 class IDs (or 256×256×C logits before argmax).


Worked example: tiny 4×4 “image”

Imagine a 4×4 grayscale scene — a white blob on black (like a small MNIST digit):

text
Input pixels (intensity):
0  0  0  0
0  9  9  0
0  9  9  0
0  0  0  0
 
Classification label:        "blob"
Detection:                   box (1,1)-(2,2), class "blob"
Semantic segmentation mask:  0 0 0 0
                             0 1 1 0
                             0 1 1 0
                             0 0 0 0
                             (0=background, 1=foreground)

The mask has the same height and width as the input. Row 2, column 2 in the image corresponds to row 2, column 2 in the mask — always.

Checkpoint: If you resize the image to 8×8 but leave the mask at 4×4, what goes wrong?

Every pixel label points at the wrong location — the model learns misaligned nonsense. Image and mask must undergo the same spatial transforms.


Semantic segmentation

Semantic = “what class is this pixel?” — not “which instance?”

  • All pixels labeled person share the same ID, even if three people stand in frame.
  • Sky, road, grass are usually stuff classes — large regions without instance IDs.
  • Common datasets: Cityscapes (driving), ADE20K (scenes), medical organ scans.

When it is enough: blur everything that is not person; colorize road vs sidewalk; measure tumor area.

When it fails: you need to count individuals or track person A separately from person B — that requires instance segmentation.


Instance segmentation

Instance = separate mask per object, even for the same class.

  • Person 1 → purple mask, person 2 → cyan mask (see figure above).
  • Classic pipeline: Mask R-CNN — detect boxes with Faster R-CNN, then a small mask head predicts a binary mask inside each box.
  • More GPU memory and annotation time than semantic-only.
SemanticInstance
Two overlapping peopleBoth pixels = personTwo distinct masks
AnnotationPaint by class colorPaint + separate object IDs
Typical useScene parsingRobotics, counting, AR

Panoptic segmentation

Real scenes mix stuff and things:

  • Stuff — sky, water, road (no clear instances).
  • Things — cars, people, dogs (countable instances).

Panoptic segmentation assigns:

  • A semantic label to amorphous regions, and
  • An instance ID to each countable object.

Autonomous driving benchmarks care because confusing road vs sidewalk (semantic) and merging two pedestrians into one blob (instance) cause different failure modes.


Real-world walkthrough: portrait mode

When a phone blurs the background behind a person:

  1. Camera frame enters a segmentation model (often person vs background, sometimes hair/subject refinement).
  2. Model outputs H×W mask — high values = person, low = background.
  3. App blurs pixels where mask ≈ background; keeps person sharp.
  4. Runs on-device for privacy (your face never leaves the phone for this step).

That is binary semantic segmentation in production — same family of models you will train in the project, different dataset and resolution.

App featureLikely segmentation type
Portrait blurPerson vs background (semantic)
Object cutout in editorInstance or semantic + refinement
Style transfer on face onlySegmentation-aware masking
Medical CT organ outlineSemantic (organ class per voxel)

Detection vs segmentation (do not confuse them)

DetectionSegmentation
OutputRectanglesPixel masks
ShapeCoarse — box includes background pixelsTight — follows object boundary
Good forCounting, tracking init, fast previewEditing, blur, precise area

Many pipelines use both: detector proposes regions, segmenter refines boundaries (Mask R-CNN).


Annotation cost — why this module matters for careers

TaskLabels per image (order of magnitude)
Classification1
Detection5–50 boxes × 5 numbers
SegmentationH × W (millions for HD)

Labeling tools (Labelbox, CVAT, Roboflow) let annotators “paint” masks. Still slow. That is why practitioners use:

  • Pretrained encoders and fine-tune heads
  • Smaller crops (256×256) for learning
  • Architectures like U-Net that work with hundreds, not millions, of training images

Common beginner mistakes

MistakeSymptom
Treating mask as independent of imageStriped or shifted predictions
Using bilinear resize on mask labelsBlended invalid class IDs (e.g. 0.7 = not a class)
Reporting pixel accuracy only“99% accurate” model that never finds the pet
Expecting instance behavior from semantic headTwo people merged into one blob

Checkpoint

  1. For a 128×128 RGB image and 3-class segmentation, what is the shape of the target mask tensor (N, ?, ?)?
  2. Semantic vs instance: two dogs in one photo — how many dog classes vs instances?
  3. Why is portrait blur a segmentation problem, not classification?

Answers (check yourself): (1) (N, 128, 128) integer class IDs. (2) One class dog, two instances. (3) You need per-pixel person vs background, not one label for the whole frame.


What's next

Lesson 2 — Encoder–decoder & dense prediction