What is image segmentation?

Before we begin

So far in this course, most models output one decision per input:

Regression → one number
Spam / MNIST → one class label

Image segmentation breaks that pattern. The model outputs a label for every pixel — a dense map aligned with the image grid. You are teaching the network to paint regions: this pixel is pet, this pixel is background, this pixel is border.

That single change — from sparse to dense output — drives different architectures (encoder–decoder, U-Net), different metrics (IoU, Dice), and much more expensive labeling.

Figure

From one label to per-pixel labels

Each step adds spatial detail: one tag → boxes → class-colored pixels → separate instance colors.

What you will learn

Contrast classification, detection, and segmentation with output shapes.
Define semantic, instance, and panoptic segmentation.
Walk through a portrait-mode example end to end.
Explain why masks must be pixel-aligned with images.
Understand why segmentation labels cost more than classification labels.

Before this lesson

The vision task ladder

Computer vision tasks are often ordered by how much spatial detail the output requires:

Task	Output	Shape (typical)	Question
Classification	One class	`(batch, num_classes)` logits	“What is it?”
Detection	Boxes + classes	`(N, 4)` boxes + `(N,)` labels	“Where are they?” (rectangles)
Semantic segmentation	Class per pixel	`(batch, C, H, W)` logits	“What class is each pixel?”
Instance segmentation	Mask per object	N masks, each `(H, W)`	“Which pixels belong to object k?”

Figure

Output shape changes

Segmentation keeps the 2D grid — every spatial location gets its own prediction.

Key idea: segmentation is dense prediction. If the input is 256×256 RGB, the output is often 256×256 class IDs (or 256×256×C logits before argmax).

Worked example: tiny 4×4 “image”

Imagine a 4×4 grayscale scene — a white blob on black (like a small MNIST digit):

text

Input pixels (intensity):
0  0  0  0
0  9  9  0
0  9  9  0
0  0  0  0
 
Classification label:        "blob"
Detection:                   box (1,1)-(2,2), class "blob"
Semantic segmentation mask:  0 0 0 0
                             0 1 1 0
                             0 1 1 0
                             0 0 0 0
                             (0=background, 1=foreground)

The mask has the same height and width as the input. Row 2, column 2 in the image corresponds to row 2, column 2 in the mask — always.

Checkpoint: If you resize the image to 8×8 but leave the mask at 4×4, what goes wrong?

Every pixel label points at the wrong location — the model learns misaligned nonsense. Image and mask must undergo the same spatial transforms.

Semantic segmentation

Semantic = “what class is this pixel?” — not “which instance?”

All pixels labeled person share the same ID, even if three people stand in frame.
Sky, road, grass are usually stuff classes — large regions without instance IDs.
Common datasets: Cityscapes (driving), ADE20K (scenes), medical organ scans.

When it is enough: blur everything that is not person; colorize road vs sidewalk; measure tumor area.

When it fails: you need to count individuals or track person A separately from person B — that requires instance segmentation.

Instance segmentation

Instance = separate mask per object, even for the same class.

Person 1 → purple mask, person 2 → cyan mask (see figure above).
Classic pipeline: Mask R-CNN — detect boxes with Faster R-CNN, then a small mask head predicts a binary mask inside each box.
More GPU memory and annotation time than semantic-only.

	Semantic	Instance
Two overlapping people	Both pixels = `person`	Two distinct masks
Annotation	Paint by class color	Paint + separate object IDs
Typical use	Scene parsing	Robotics, counting, AR

Panoptic segmentation

Real scenes mix stuff and things:

Stuff — sky, water, road (no clear instances).
Things — cars, people, dogs (countable instances).

Panoptic segmentation assigns:

A semantic label to amorphous regions, and
An instance ID to each countable object.

Autonomous driving benchmarks care because confusing road vs sidewalk (semantic) and merging two pedestrians into one blob (instance) cause different failure modes.

Real-world walkthrough: portrait mode

When a phone blurs the background behind a person:

Camera frame enters a segmentation model (often person vs background, sometimes hair/subject refinement).
Model outputs H×W mask — high values = person, low = background.
App blurs pixels where mask ≈ background; keeps person sharp.
Runs on-device for privacy (your face never leaves the phone for this step).

That is binary semantic segmentation in production — same family of models you will train in the project, different dataset and resolution.

App feature	Likely segmentation type
Portrait blur	Person vs background (semantic)
Object cutout in editor	Instance or semantic + refinement
Style transfer on face only	Segmentation-aware masking
Medical CT organ outline	Semantic (organ class per voxel)

Detection vs segmentation (do not confuse them)

	Detection	Segmentation
Output	Rectangles	Pixel masks
Shape	Coarse — box includes background pixels	Tight — follows object boundary
Good for	Counting, tracking init, fast preview	Editing, blur, precise area

Many pipelines use both: detector proposes regions, segmenter refines boundaries (Mask R-CNN).

Annotation cost — why this module matters for careers

Task	Labels per image (order of magnitude)
Classification	1
Detection	5–50 boxes × 5 numbers
Segmentation	H × W (millions for HD)

Labeling tools (Labelbox, CVAT, Roboflow) let annotators “paint” masks. Still slow. That is why practitioners use:

Pretrained encoders and fine-tune heads
Smaller crops (256×256) for learning
Architectures like U-Net that work with hundreds, not millions, of training images

Common beginner mistakes

Mistake	Symptom
Treating mask as independent of image	Striped or shifted predictions
Using bilinear resize on mask labels	Blended invalid class IDs (e.g. 0.7 = not a class)
Reporting pixel accuracy only	“99% accurate” model that never finds the pet
Expecting instance behavior from semantic head	Two people merged into one blob

Checkpoint

For a 128×128 RGB image and 3-class segmentation, what is the shape of the target mask tensor (N, ?, ?)?
Semantic vs instance: two dogs in one photo — how many dog classes vs instances?
Why is portrait blur a segmentation problem, not classification?

Answers (check yourself): (1) (N, 128, 128) integer class IDs. (2) One class dog, two instances. (3) You need per-pixel person vs background, not one label for the whole frame.

What's next

Lesson 2 — Encoder–decoder & dense prediction