IoU, NMS, mAP & evaluation

Before we begin

Training loss going down does not guarantee a shippable detector. Products care about: Did we find the objects? Are boxes tight? Are there duplicate false alarms?

IoU, NMS, and mAP are how the field answers those questions consistently. This lesson is as detailed as Module 2 metrics for classification — but for boxes.

What you will learn

Compute IoU by hand and in code.
Implement NMS step-by-step and interpret failures.
Build precision–recall for one class at multiple score thresholds.
Explain AP and COCO mAP@0.5:0.95.
Tune score threshold and NMS IoU on validation data.

Before this lesson

Lesson 3 — Training detectors

IoU — definition and worked example

\mathrm{IoU}(A,B) = \frac{|A \cap B|}{|A \cup B|}

Figure

IoU geometry

Intersection over union — 0 to 1.

Pred xyxy: [10, 10, 50, 50] → area $40^2=1600$
GT xyxy: [30, 30, 70, 70] → area $1600$
Intersection: [30,30,50,50] → $20^2=400$
Union: $1600+1600-400=2800$
IoU $= 400/2800 \approx 0.143$

At COCO threshold 0.5, this prediction is a miss (false positive unless another pred matches).

python

def iou_xyxy(box_a, box_b):
    xa1, ya1, xa2, ya2 = box_a
    xb1, yb1, xb2, yb2 = box_b
    ix1, iy1 = max(xa1, xb1), max(ya1, yb1)
    ix2, iy2 = min(xa2, xb2), min(ya2, yb2)
    inter = max(0, ix2 - ix1) * max(0, iy2 - iy1)
    area_a = (xa2 - xa1) * (ya2 - ya1)
    area_b = (xb2 - xb1) * (yb2 - yb1)
    union = area_a + area_b - inter
    return inter / union if union > 0 else 0.0

GIoU / DIoU (preview)

When boxes do not overlap, IoU = 0 — gradient may vanish. GIoU adds penalty for smallest enclosing box — still used in YOLO training losses.

Matching predictions to ground truth (per image)

For class $c$ , at score threshold $t$ :

Filter preds with score ≥ $t$ and class $c$ .
Sort preds by score descending.
Greedy match: each pred → highest IoU unmatched GT if IoU ≥ threshold.
Count TP, FP; unmatched GT → FN.

Outcome	Meaning
TP	Pred matched GT, correct class, IoU ok
FP	Pred unmatched or wrong class
FN	GT with no matching pred

Non-maximum suppression (NMS)

Problem: dense detectors emit 5–50 overlapping boxes on one person.

Figure

NMS intuition

Keep best score; suppress overlapping duplicates.

python

def nms_xyxy(boxes, scores, iou_thresh=0.5):
    order = scores.argsort(descending=True)
    keep = []
    while order.numel() > 0:
        i = order[0].item()
        keep.append(i)
        if order.numel() == 1:
            break
        ious = torch.tensor([iou_xyxy(boxes[i], boxes[j]) for j in order[1:]])
        remaining = (ious <= iou_thresh).nonzero().squeeze(1)
        order = order[1:][remaining]
    return keep

Soft-NMS: reduce scores of overlapping boxes instead of deleting — better in crowds.

Failure: two people hugging → high mutual IoU → one person suppressed. Mitigations: lower NMS threshold, specialized crowd NMS, higher input resolution.

Precision and recall (detection)

At fixed score threshold:

\mathrm{precision} = \frac{TP}{TP+FP}, \quad \mathrm{recall} = \frac{TP}{TP+FN}

Raise score threshold	Usually
Precision	↑ fewer junk boxes
Recall	↓ missed objects

Product mapping:

Safety-critical (miss = bad) → favor recall — lower threshold
Spammy UI overlays → favor precision — higher threshold

Average Precision (AP)

Vary score threshold → sweep precision/recall → curve.

Figure

PR curve

AP = shaded area under curve (per VOC/COCO interpolation rules).

AP (one class): area under PR curve (COCO uses 101-point interpolation).

mAP: mean AP over classes. COCO mAP also averages over IoU thresholds $0.5, 0.55, \ldots, 0.95$ .

Metric	Strictness
AP@0.5	Loose boxes OK
AP@0.75	Tight localization
mAP@[.5:.95]	Industry standard for papers

Never compare your AP@0.5 number to someone's COCO mAP — different scales.

Computing AP in the project (simplified)

python

# torchmetrics (extension)
from torchmetrics.detection import MeanAveragePrecision
 
metric = MeanAveragePrecision(iou_type="bbox")
metric.update(preds, targets)  # list of dicts in COCO format
out = metric.compute()
print(out["map"], out["map_50"], out["map_75"])

For course project, reporting mAP@0.5 on val is acceptable; note which metric you used.

Tuning inference knobs (validation only)

Knob	Default-ish	Effect
`score_thresh`	0.5–0.7	FP vs FN trade-off
`nms_thresh`	0.5	duplicate removal aggressiveness
`max_detections`	100	cap boxes per image

python

# torchvision inference — model handles NMS internally
model.eval()
with torch.no_grad():
    pred = model([img])[0]
mask = pred["scores"] > 0.6
boxes = pred["boxes"][mask]

Log precision/recall on val while sweeping threshold — plot curve in project README.

Qualitative evaluation (mandatory)

Build a failure gallery:

Failure type	What to look for
Missed small objects	need FPN / higher res
Duplicate boxes	NMS / threshold
Class confusion	more data / hard negatives
Jittery boxes on video	temporal smoothing (Module 6)

mAP aggregates — images teach.

Check your understanding

IoU 0.45 at threshold 0.5 — TP or FP?
Lower NMS IoU threshold — more or fewer boxes kept?
Why report AP@0.75 in addition to AP@0.5?

Sketches: (1) FP (below 0.5). (2) fewer (more aggressive suppression). (3) tight localization matters for grasping/measurement.

What's next

Lesson 5 — On-device detection