← Back to curriculum

Module 5 — Image segmentation

Segmentation losses & metrics

Per-pixel CE wiring, IoU/Dice worked examples, class imbalance traps, pet trimap classes, and what to log each epoch.

~80 min read + exercises

Segmentation losses & metrics

Before we begin

Whether you train U-Net, DeepLab, or a fine-tuned Mask R-CNN mask head, optimization boils down to losses on predictions and evaluation metrics that match the task. A forward pass on a 256×256 image produces hundreds of thousands of logits. Training needs a loss that scores every pixel; evaluation needs overlap metrics — not misleading “99% accuracy” on empty background.

This lesson covers:

  • Per-pixel cross-entropy — default for semantic segmentation
  • IoU — standard validation metric
  • Dice — common in medical imaging and imbalanced foreground
  • What to log so you catch broken models early

Figure

IoU intuition

IoU = area(A ∩ B) ÷ area(A ∪ B)Used to match predicted boxes to ground truth during training and evaluation.A (prediction)B (ground truth)A ∩ B
Overlap divided by union — 1.0 is perfect, 0.0 is no overlap.

What you will learn

  • Wire CrossEntropyLoss for (N, C, H, W) logits and (N, H, W) targets.
  • Compute IoU and Dice by hand and in code.
  • Explain why pixel accuracy fails on imbalanced masks.
  • Build a minimal evaluation loop for mIoU.

Before this lesson


Per-pixel cross-entropy

Same formula as MNIST — applied independently at each pixel.

For pixel i with target class y_i and logits z_i (length C):

text
loss_i = -log( softmax(z_i)[y_i] )
total loss = mean (or sum) over all pixels

PyTorch expects:

python
criterion = nn.CrossEntropyLoss()  # optional: ignore_index=255 for void pixels
 
logits = model(images)       # (N, C, H, W) float
targets = masks.long()       # (N, H, W) int64 — values in 0 .. C-1
 
loss = criterion(logits, targets)

Common bugs:

BugSymptom
Target shape (N, 1, H, W) without squeezeRuntime error or wrong broadcast
Target floats 0.0–1.0CE expects class indices
Logits (N, H, W, C) wrong dim orderUse (N, C, H, W) for Conv2d heads

Instance note: Mask R-CNN applies BCE or CE on mask pixels inside each positive RoI — same per-pixel idea, smaller spatial crops.


Class imbalance — the accuracy trap

Pet / portrait masks are mostly background.

Example: 90% background, 10% pet.

A model that predicts background everywhere gets 90% pixel accuracy — looks great in a spreadsheet — but 0% useful pet IoU.

MetricWhat it rewards
Pixel accuracyMajority class (background)
Foreground IoUOverlap on the class you care about
Mean IoU (mIoU)Average IoU across classes — fairer

Rule for this course: always report mIoU or per-class IoU on validation; treat pixel accuracy as optional.


IoU (Intersection over Union)

Also called Jaccard index. For binary prediction mask P and ground truth G:

IoU=PGPG\text{IoU} = \frac{|P \cap G|}{|P \cup G|}
  • 1.0 — perfect overlap
  • 0.0 — no overlap
  • Union in denominator penalizes both missed pixels and false alarm pixels

Worked example (counts of pixels)

RegionCount
Predicted foreground only10
Ground truth foreground only10
Both foreground40

Intersection = 40. Union = 10 + 10 + 40 = 60.

IoU = 40/60 ≈ 0.67

Multi-class mIoU

Compute IoU for each class c separately (binary: “pixel is class c” vs not), then:

text
mIoU = mean( IoU_c for c in classes )

Often ignore void / ignore_index classes in the average.

Code sketch

python
def iou_per_class(preds, targets, num_classes):
    """preds, targets: (N, H, W) int64"""
    ious = []
    for c in range(num_classes):
        p = preds == c
        t = targets == c
        inter = (p & t).sum().item()
        union = (p | t).sum().item()
        if union == 0:
            ious.append(float("nan"))  # class absent in batch
        else:
            ious.append(inter / union)
    return ious  # nanmean for mIoU

Dice coefficient

Dice=2PGP+G\text{Dice} = \frac{2|P \cap G|}{|P| + |G|}

For binary masks, Dice is the same as F1 score on pixels.

IoUDice (binary)
Emphasizes unionEmphasizes overlap vs total pred+GT mass
Standard in detection/seg benchmarksVery common in medical segmentation

Dice loss: 1 - Dice — gradients push for overlap; sometimes mixed with CE:

text
loss = CE + λ * (1 - Dice)

For the pet project, CE alone is enough to start; add Dice if foreground IoU plateaus.


Oxford-IIIT Pet — three classes in the project

Trimap labels (after remapping in the project):

IDMeaning
0Pet (foreground)
1Background
2Border / trimap transition

Border pixels are thin — easy to miss. A model can score high IoU on background and pet while border IoU stays low. Per-class IoU table in README tells the full story.


What to log every epoch

LogWhy
train_lossOptimization progress
val_mIoUGeneralization — pick best checkpoint
Per-class val IoUSpot weak classes (often border)
3–5 overlay PNGsHuman eyes catch failures metrics miss
text
Overlay = 0.6 * RGB image + 0.4 * colorized pred mask

If loss drops but overlays look worse → bug, overfit, or metric computed wrong.


Train vs val discipline

  • Tune thresholds / early stopping on validation mIoU.
  • Touch test set once at the end for honest reporting.
  • Same rule as Module 2 spam project — segmentation is no different.

Failure modes checklist

ObservationLikely cause
mIoU ≈ 0 alwaysWrong mask encoding; CE on wrong value range
mIoU high, overlays wrongMetric on subset; color map bug
All-background predictionsClass imbalance — check foreground IoU
Striped masksAugmentation misalignment

Checkpoint

  1. Logits shape (4, 3, 128, 128) — what is 3?
  2. 1000 GT foreground pixels, model predicts 0 — IoU?
  3. Why log overlays if loss already decreases?

Answers: (1) num_classes channel dimension. (2) 0 — intersection empty. (3) Loss can improve on easy background pixels while borders stay wrong; overlays reveal that.


What's next

Module 5 quiz — then the U-Net project.