← Back to curriculum

Module 4 — Object detection

Project: train and evaluate an object detector

Fine-tune Faster R-CNN on Penn-Fudan, plot PR curves, compute mAP, tune thresholds, visualize failures, and export ONNX.

~360 min read + exercises

Project: train and evaluate an object detector

Before we begin

This is your first object detection project. You will fine-tune a Faster R-CNN detector on the Penn-Fudan pedestrian dataset, track training losses, compute mAP on a validation split, sweep score thresholds, build a failure gallery, and export to ONNX for Module 7.

You are not inventing a new architecture — you are learning the full detection workflow: labels → collate → loss dict → eval → metrics → deployment.

Figure

What good evaluation looks like

Precision–recall curve (one class)AP = area under this curve. Lower score threshold → move right (higher recall).RecallPrecision
mAP summarizes precision–recall across IoU thresholds; always pair it with visual failure analysis.

How this connects to Module 4

LessonWhere you use it
Classification → detectionBox format xyxy, one class (person), variable object count
ArchitecturesFaster R-CNN: backbone + FPN + RPN + RoI head
Trainingmodel(images, targets) loss dict, label 0 = background
IoU / NMS / mAPtorchmetrics mAP, score threshold tuning
On-deviceONNX export, fixed input size, post-process on CPU

Folder layout:

text
pedestrian-detector/
  data/                          # torchvision download cache
  train_detector.py              # train, evaluate, plot, export
  outputs/
    loss_curves.png
    detections_val.png           # green GT, red preds
    pr_threshold_sweep.png
    failure_gallery.png
    metrics.json                 # mAP@0.5, mAP@0.5:0.95
    best_detector.pt
    detector.onnx
  README.md                      # metrics table + 3 failure reflections

Estimated time: 4–6 hours (first time); ~2.5 hours if comfortable with Module 3 project.


What you will build

  1. Convert Penn-Fudan instance masks → COCO-style boxes + labels targets.
  2. Fine-tune fasterrcnn_resnet50_fpn (COCO-pretrained) for 1 foreground class.
  3. Log RPN and RoI losses every epoch — spot collapse early.
  4. Compute validation mAP@0.5 and mAP@0.5:0.95 with torchmetrics.
  5. Sweep confidence thresholds and plot precision vs recall.
  6. Visualize predictions vs ground truth and a failure gallery.
  7. Export ONNX with a fixed 800×800 input (deployment pattern from Lesson 5).

Before you start

bash
mkdir pedestrian-detector && cd pedestrian-detector
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install torch torchvision matplotlib numpy torchmetrics onnx onnxruntime tqdm
  • GPU strongly recommended — Faster R-CNN fine-tuning on CPU works but is slow (~1–2 hours for 5 epochs vs ~15 min on GPU).

Step 1 — Penn-Fudan → detection targets

Goal: Each __getitem__ returns (tensor_image, target_dict) where:

python
target = {
    "boxes": FloatTensor[N, 4],   # xyxy absolute pixels
    "labels": Int64Tensor[N],     # all 1 for person (0 = background, unused in GT)
}

Penn-Fudan ships instance masks (pixel IDs per pedestrian). Use torchvision.ops.masks_to_boxes:

python
# train_detector.py
import json
import random
from pathlib import Path
 
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, Subset
from torchvision import transforms
from torchvision.datasets import PennFudanPed
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.ops import masks_to_boxes
from tqdm import tqdm
 
try:
    from torchmetrics.detection import MeanAveragePrecision
except ImportError:
    from torchmetrics.detection.mean_ap import MeanAveragePrecision
 
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)
 
ROOT = Path(__file__).resolve().parent
DATA_DIR = ROOT / "data"
OUT_DIR = ROOT / "outputs"
OUT_DIR.mkdir(exist_ok=True)
 
NUM_CLASSES = 2          # background + person
PERSON_LABEL = 1
SCORE_THRESH_DEFAULT = 0.5

Dataset wrapper:

python
class PennFudanDetection(Dataset):
    """Wrap Penn-Fudan masks as Faster R-CNN targets."""
 
    def __init__(self, root: Path, transforms_fn=None):
        self.base = PennFudanPed(str(root), download=True)
        self.transforms_fn = transforms_fn
 
    def __len__(self):
        return len(self.base)
 
    def __getitem__(self, idx):
        img, mask = self.base[idx]  # PIL, LongTensor [H, W]
        img = transforms.ToTensor()(img)
 
        obj_ids = torch.unique(mask)
        obj_ids = obj_ids[obj_ids != 0]
        if obj_ids.numel() == 0:
            boxes = torch.zeros((0, 4), dtype=torch.float32)
            labels = torch.zeros((0,), dtype=torch.int64)
        else:
            masks = mask == obj_ids[:, None, None]
            boxes = masks_to_boxes(masks)
            labels = torch.ones((boxes.shape[0],), dtype=torch.int64) * PERSON_LABEL
 
        target = {"boxes": boxes, "labels": labels}
        if self.transforms_fn:
            img, target = self.transforms_fn(img, target)
        return img, target

Checkpoint — run a smoke test:

python
if __name__ == "__main__" and False:  # flip to True once to verify
    ds = PennFudanDetection(DATA_DIR)
    img, tgt = ds[0]
    print("image:", img.shape, "boxes:", tgt["boxes"].shape)

You should see boxes: torch.Size([N, 4]) with N ≥ 1 on most images.


Step 2 — Train / val split and collate

Goal: ~80% train, 20% val — same split every run via SEED.

Detection cannot use default collate_fn — images have different sizes and different object counts:

python
def collate_fn(batch):
    return tuple(zip(*batch))
 
 
def split_indices(n: int, val_ratio: float = 0.2):
    idx = list(range(n))
    random.shuffle(idx)
    n_val = max(1, int(n * val_ratio))
    return idx[n_val:], idx[:n_val]
 
 
full_ds = PennFudanDetection(DATA_DIR)
train_idx, val_idx = split_indices(len(full_ds))
train_ds = Subset(full_ds, train_idx)
val_ds = Subset(full_ds, val_idx)
 
train_loader = DataLoader(
    train_ds, batch_size=2, shuffle=True, num_workers=0, collate_fn=collate_fn
)
val_loader = DataLoader(
    val_ds, batch_size=2, shuffle=False, num_workers=0, collate_fn=collate_fn
)
print(f"train {len(train_ds)} / val {len(val_ds)} images")
ChoiceValueWhy
batch_size=2SmallDetection memory scales with objects + image size
num_workers=0Windows-safeAvoid multiprocessing spawn issues while learning
No heavy aug yetBaselineGet mAP > 0 before adding RandomHorizontalFlip

Step 3 — Load pretrained Faster R-CNN

Goal: Replace the classification head for 2 classes (background + person).

python
def build_model(num_classes: int = NUM_CLASSES):
    model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model
 
 
model = build_model().to(device)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=1e-4)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

What each piece does:

ComponentRole
ResNet50-FPN backboneMulti-scale features — small and large pedestrians
RPNProposes ~1000 region candidates per image
RoI Align + headClassifies each proposal and refines box coordinates
FastRCNNPredictorNew cls_score layer sized for your class count

Step 4 — Training loop with loss breakdown

Goal: Sum loss dict scalars, backprop, save best checkpoint by val mAP@0.5.

python
def train_one_epoch(model, loader, optimizer, epoch):
    model.train()
    running = {}
    for images, targets in tqdm(loader, desc=f"train e{epoch}"):
        images = [im.to(device) for im in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
 
        loss_dict = model(images, targets)
        losses = sum(loss_dict.values())
 
        optimizer.zero_grad()
        losses.backward()
        optimizer.step()
 
        for k, v in loss_dict.items():
            running[k] = running.get(k, 0.0) + v.item()
    n = len(loader)
    return {k: v / n for k, v in running.items()}

Typical loss keys (torchvision Faster R-CNN):

  • loss_classifier, loss_box_reg — RoI head
  • loss_objectness, loss_rpn_box_reg — RPN

If all losses → 0 in epoch 1 but mAP stays 0, you likely have a label bug (wrong class indices or empty targets). Re-read Lesson 3.


Step 5 — Validation mAP with torchmetrics

Goal: Honest detection metrics — not pixel accuracy.

python
@torch.no_grad()
def evaluate_map(model, loader, score_thresh: float = SCORE_THRESH_DEFAULT):
    model.eval()
    metric = MeanAveragePrecision(box_format="xyxy", iou_type="bbox")
    metric.warn_on_many_detections = False
 
    for images, targets in tqdm(loader, desc="eval"):
        images = [im.to(device) for im in images]
        outputs = model(images)
 
        preds = []
        gts = []
        for out, tgt in zip(outputs, targets):
            keep = out["scores"] >= score_thresh
            preds.append({
                "boxes": out["boxes"][keep].cpu(),
                "scores": out["scores"][keep].cpu(),
                "labels": out["labels"][keep].cpu(),
            })
            gts.append({
                "boxes": tgt["boxes"].cpu(),
                "labels": tgt["labels"].cpu(),
            })
        metric.update(preds, gts)
 
    stats = metric.compute()
    return {
        "map_50": float(stats["map_50"]),
        "map": float(stats["map"]),
    }
MetricMeaning
map_50AP at IoU ≥ 0.5 — “did you find the person roughly?”
mapCOCO-style average over IoU 0.5:0.95 — rewards tight boxes

Target after 5 epochs (GPU): map_500.75 on this tiny dataset. map is often lower (0.4–0.6) — that is normal.


Step 6 — Full train + save best

python
EPOCHS = 5
best_map50 = -1.0
history = {"train_loss": [], "map_50": [], "map": []}
 
for epoch in range(1, EPOCHS + 1):
    losses = train_one_epoch(model, train_loader, optimizer, epoch)
    metrics = evaluate_map(model, val_loader)
    lr_scheduler.step()
 
    total_loss = sum(losses.values())
    history["train_loss"].append(total_loss)
    history["map_50"].append(metrics["map_50"])
    history["map"].append(metrics["map"])
 
    print(f"epoch {epoch} loss={total_loss:.4f} mAP@50={metrics['map_50']:.3f} mAP={metrics['map']:.3f}")
 
    if metrics["map_50"] > best_map50:
        best_map50 = metrics["map_50"]
        torch.save(model.state_dict(), OUT_DIR / "best_detector.pt")
 
with open(OUT_DIR / "metrics.json", "w") as f:
    json.dump({"best_map_50": best_map50, "history": history}, f, indent=2)

Plot loss + mAP:

python
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(history["train_loss"], marker="o")
ax[0].set_title("Train loss (sum)")
ax[0].set_xlabel("epoch")
ax[1].plot(history["map_50"], marker="o", label="mAP@0.5")
ax[1].plot(history["map"], marker="o", label="mAP@0.5:0.95")
ax[1].legend()
ax[1].set_xlabel("epoch")
fig.tight_layout()
fig.savefig(OUT_DIR / "loss_curves.png", dpi=150)
plt.close()

Step 7 — Visualize predictions vs ground truth

Goal: Green = GT, red = predictions above threshold.

python
def draw_boxes(ax, boxes, color, label_prefix=""):
    for i, box in enumerate(boxes):
        x1, y1, x2, y2 = box.tolist()
        rect = patches.Rectangle(
            (x1, y1), x2 - x1, y2 - y1, linewidth=2, edgecolor=color, facecolor="none"
        )
        ax.add_patch(rect)
 
 
@torch.no_grad()
def visualize_detections(model, dataset, indices, score_thresh=0.5, save_path=None):
    model.eval()
    n = len(indices)
    fig, axes = plt.subplots(1, n, figsize=(4 * n, 4))
    if n == 1:
        axes = [axes]
 
    for ax, idx in zip(axes, indices):
        img, tgt = dataset[idx]
        pred = model([img.to(device)])[0]
        keep = pred["scores"] >= score_thresh
 
        ax.imshow(img.permute(1, 2, 0).cpu().numpy())
        draw_boxes(ax, tgt["boxes"], "lime")
        draw_boxes(ax, pred["boxes"][keep].cpu(), "red")
        ax.set_title(f"#{idx}")
        ax.axis("off")
 
    fig.tight_layout()
    if save_path:
        fig.savefig(save_path, dpi=150)
    plt.close()
 
 
model.load_state_dict(torch.load(OUT_DIR / "best_detector.pt", map_location=device))
visualize_detections(model, full_ds, val_idx[:3], save_path=OUT_DIR / "detections_val.png")

Step 8 — Threshold sweep (precision vs recall)

Goal: Pick a deployment threshold — not always 0.5.

python
@torch.no_grad()
def collect_scores_and_matches(model, loader, iou_thresh=0.5):
    """Simple PR sweep: treat each detection as TP if max IoU with GT ≥ thresh."""
    from torchvision.ops import box_iou
 
    model.eval()
    all_scores = []
    all_tps = []
 
    for images, targets in loader:
        images = [im.to(device) for im in images]
        outputs = model(images)
        for out, tgt in zip(outputs, targets):
            gt_boxes = tgt["boxes"]
            if gt_boxes.numel() == 0:
                continue
            scores = out["scores"].cpu()
            boxes = out["boxes"].cpu()
            if boxes.numel() == 0:
                continue
            ious = box_iou(boxes, gt_boxes)
            max_iou, _ = ious.max(dim=1)
            all_scores.extend(scores.tolist())
            all_tps.extend((max_iou >= iou_thresh).int().tolist())
    return np.array(all_scores), np.array(all_tps)
 
 
scores, tps = collect_scores_and_matches(model, val_loader)
thresholds = np.linspace(0.05, 0.95, 19)
precisions, recalls = [], []
total_gt = sum(len(full_ds[i][1]["boxes"]) for i in val_idx)
 
for t in thresholds:
    mask = scores >= t
    tp = tps[mask].sum()
    fp = mask.sum() - tp
    fn = total_gt - tp
    precisions.append(tp / (tp + fp + 1e-9))
    recalls.append(tp / (total_gt + 1e-9))
 
plt.figure(figsize=(6, 4))
plt.plot(recalls, precisions, marker="o")
for t, r, p in zip(thresholds[::3], recalls[::3], precisions[::3]):
    plt.annotate(f"{t:.2f}", (r, p), fontsize=8)
plt.xlabel("recall")
plt.ylabel("precision")
plt.title("PR sweep (val, IoU≥0.5)")
plt.tight_layout()
plt.savefig(OUT_DIR / "pr_threshold_sweep.png", dpi=150)
plt.close()

Use the sweep to justify SCORE_THRESH_DEFAULT in your README.


Step 9 — Failure gallery

Goal: Three categories — missed (FN), false alarm (FP), loose box (low IoU).

Manually pick indices from val where:

  • No pred box overlaps GT (missed pedestrian).
  • High-score pred on background (false alarm).
  • Pred overlaps but visibly too large/small (localization error).
python
visualize_detections(
    model, full_ds, val_idx[5:8], score_thresh=0.3, save_path=OUT_DIR / "failure_gallery.png"
)

In README.md, write one sentence per failure: what happened and which module lesson explains it.


Step 10 — ONNX export (deployment handoff)

Goal: Fixed-size tensor in → raw boxes out. NMS stays in app code (Lesson 5).

python
class DetectorWrapper(nn.Module):
  def __init__(self, model):
    super().__init__()
    self.model = model
 
  def forward(self, x):
    out = self.model(x)[0]
    return out["boxes"], out["scores"], out["labels"]
 
 
model.eval()
wrapper = DetectorWrapper(model).cpu()
dummy = torch.randn(1, 3, 800, 800)
torch.onnx.export(
    wrapper,
    dummy,
    OUT_DIR / "detector.onnx",
    input_names=["image"],
    output_names=["boxes", "scores", "labels"],
    dynamic_axes=None,
    opset_version=17,
)
print("exported:", OUT_DIR / "detector.onnx")

Note: Real photos must be letterboxed or resized to 800×800 with the same normalization you used in training (ToTensor[0,1]). Module 7 covers full pipeline integration.


Deliverables checklist

  • outputs/best_detector.pt — best val mAP@0.5 checkpoint
  • outputs/metrics.json — includes best_map_50
  • outputs/loss_curves.png — loss decreasing, mAP increasing
  • outputs/detections_val.png — green GT + red preds
  • outputs/pr_threshold_sweep.png — justified threshold
  • outputs/failure_gallery.png — 3 annotated failure modes
  • outputs/detector.onnx — loads in onnxruntime
  • README.md — metrics table + reflections

Troubleshooting

SymptomLikely causeFix
mAP = 0 after trainingLabels all 0 or wrong num_classesUse PERSON_LABEL = 1, NUM_CLASSES = 2
CUDA OOMBatch too large / huge imagesbatch_size=1, shorter side resize
Loss NaNLR too highTry lr=0.002 or freeze backbone first epoch
Thousands of red boxesThreshold too lowRaise to 0.5+; check NMS is inside model (it is for Faster R-CNN)
Great mAP, bad visualsLoose boxes at IoU 0.5Report map (0.5:0.95); inspect failure gallery

Extensions (optional)

  1. Horizontal flip aug — flip image + flip box x coordinates.
  2. Freeze backbone for 1 epoch, then unfreeze — stabilizes early RPN.
  3. Compare score_thresh 0.3 vs 0.7 on the same images — write precision/recall tradeoff.
  4. Try retinanet_resnet50_fpn — one-stage baseline on same split.

What's next

Module 5 — Segmentation and instance masks — pixel-level labels, U-Net, and masks instead of boxes.