IoU, NMS, mAP & evaluation
Before we begin
Training loss going down does not guarantee a shippable detector. Products care about: Did we find the objects? Are boxes tight? Are there duplicate false alarms?
IoU, NMS, and mAP are how the field answers those questions consistently. This lesson is as detailed as Module 2 metrics for classification — but for boxes.
What you will learn
- Compute IoU by hand and in code.
- Implement NMS step-by-step and interpret failures.
- Build precision–recall for one class at multiple score thresholds.
- Explain AP and COCO mAP@0.5:0.95.
- Tune score threshold and NMS IoU on validation data.
Before this lesson
IoU — definition and worked example
Figure
IoU geometry
Pred xyxy: [10, 10, 50, 50] → area
GT xyxy: [30, 30, 70, 70] → area
Intersection: [30,30,50,50] →
Union:
IoU
At COCO threshold 0.5, this prediction is a miss (false positive unless another pred matches).
def iou_xyxy(box_a, box_b):
xa1, ya1, xa2, ya2 = box_a
xb1, yb1, xb2, yb2 = box_b
ix1, iy1 = max(xa1, xb1), max(ya1, yb1)
ix2, iy2 = min(xa2, xb2), min(ya2, yb2)
inter = max(0, ix2 - ix1) * max(0, iy2 - iy1)
area_a = (xa2 - xa1) * (ya2 - ya1)
area_b = (xb2 - xb1) * (yb2 - yb1)
union = area_a + area_b - inter
return inter / union if union > 0 else 0.0GIoU / DIoU (preview)
When boxes do not overlap, IoU = 0 — gradient may vanish. GIoU adds penalty for smallest enclosing box — still used in YOLO training losses.
Matching predictions to ground truth (per image)
For class , at score threshold :
- Filter preds with score ≥ and class .
- Sort preds by score descending.
- Greedy match: each pred → highest IoU unmatched GT if IoU ≥ threshold.
- Count TP, FP; unmatched GT → FN.
| Outcome | Meaning |
|---|---|
| TP | Pred matched GT, correct class, IoU ok |
| FP | Pred unmatched or wrong class |
| FN | GT with no matching pred |
Non-maximum suppression (NMS)
Problem: dense detectors emit 5–50 overlapping boxes on one person.
Figure
NMS intuition
def nms_xyxy(boxes, scores, iou_thresh=0.5):
order = scores.argsort(descending=True)
keep = []
while order.numel() > 0:
i = order[0].item()
keep.append(i)
if order.numel() == 1:
break
ious = torch.tensor([iou_xyxy(boxes[i], boxes[j]) for j in order[1:]])
remaining = (ious <= iou_thresh).nonzero().squeeze(1)
order = order[1:][remaining]
return keepSoft-NMS: reduce scores of overlapping boxes instead of deleting — better in crowds.
Failure: two people hugging → high mutual IoU → one person suppressed. Mitigations: lower NMS threshold, specialized crowd NMS, higher input resolution.
Precision and recall (detection)
At fixed score threshold:
| Raise score threshold | Usually |
|---|---|
| Precision | ↑ fewer junk boxes |
| Recall | ↓ missed objects |
Product mapping:
- Safety-critical (miss = bad) → favor recall — lower threshold
- Spammy UI overlays → favor precision — higher threshold
Average Precision (AP)
Vary score threshold → sweep precision/recall → curve.
Figure
PR curve
AP (one class): area under PR curve (COCO uses 101-point interpolation).
mAP: mean AP over classes. COCO mAP also averages over IoU thresholds .
| Metric | Strictness |
|---|---|
| AP@0.5 | Loose boxes OK |
| AP@0.75 | Tight localization |
| mAP@[.5:.95] | Industry standard for papers |
Never compare your AP@0.5 number to someone's COCO mAP — different scales.
Computing AP in the project (simplified)
# torchmetrics (extension)
from torchmetrics.detection import MeanAveragePrecision
metric = MeanAveragePrecision(iou_type="bbox")
metric.update(preds, targets) # list of dicts in COCO format
out = metric.compute()
print(out["map"], out["map_50"], out["map_75"])For course project, reporting mAP@0.5 on val is acceptable; note which metric you used.
Tuning inference knobs (validation only)
| Knob | Default-ish | Effect |
|---|---|---|
score_thresh | 0.5–0.7 | FP vs FN trade-off |
nms_thresh | 0.5 | duplicate removal aggressiveness |
max_detections | 100 | cap boxes per image |
# torchvision inference — model handles NMS internally
model.eval()
with torch.no_grad():
pred = model([img])[0]
mask = pred["scores"] > 0.6
boxes = pred["boxes"][mask]Log precision/recall on val while sweeping threshold — plot curve in project README.
Qualitative evaluation (mandatory)
Build a failure gallery:
| Failure type | What to look for |
|---|---|
| Missed small objects | need FPN / higher res |
| Duplicate boxes | NMS / threshold |
| Class confusion | more data / hard negatives |
| Jittery boxes on video | temporal smoothing (Module 6) |
mAP aggregates — images teach.
Check your understanding
- IoU 0.45 at threshold 0.5 — TP or FP?
- Lower NMS IoU threshold — more or fewer boxes kept?
- Why report AP@0.75 in addition to AP@0.5?
Sketches: (1) FP (below 0.5). (2) fewer (more aggressive suppression). (3) tight localization matters for grasping/measurement.