Project: train and evaluate an object detector

Before we begin

This is your first object detection project. You will fine-tune a Faster R-CNN detector on the Penn-Fudan pedestrian dataset, track training losses, compute mAP on a validation split, sweep score thresholds, build a failure gallery, and export to ONNX for Module 7.

You are not inventing a new architecture — you are learning the full detection workflow: labels → collate → loss dict → eval → metrics → deployment.

Figure

What good evaluation looks like

mAP summarizes precision–recall across IoU thresholds; always pair it with visual failure analysis.

How this connects to Module 4

Lesson	Where you use it
Classification → detection	Box format `xyxy`, one class (`person`), variable object count
Architectures	Faster R-CNN: backbone + FPN + RPN + RoI head
Training	`model(images, targets)` loss dict, label 0 = background
IoU / NMS / mAP	`torchmetrics` mAP, score threshold tuning
On-device	ONNX export, fixed input size, post-process on CPU

Folder layout:

text

pedestrian-detector/
  data/                          # torchvision download cache
  train_detector.py              # train, evaluate, plot, export
  outputs/
    loss_curves.png
    detections_val.png           # green GT, red preds
    pr_threshold_sweep.png
    failure_gallery.png
    metrics.json                 # mAP@0.5, mAP@0.5:0.95
    best_detector.pt
    detector.onnx
  README.md                      # metrics table + 3 failure reflections

Estimated time: 4–6 hours (first time); ~2.5 hours if comfortable with Module 3 project.

What you will build

Convert Penn-Fudan instance masks → COCO-style boxes + labels targets.
Fine-tune fasterrcnn_resnet50_fpn (COCO-pretrained) for 1 foreground class.
Log RPN and RoI losses every epoch — spot collapse early.
Compute validation mAP@0.5 and mAP@0.5:0.95 with torchmetrics.
Sweep confidence thresholds and plot precision vs recall.
Visualize predictions vs ground truth and a failure gallery.
Export ONNX with a fixed 800×800 input (deployment pattern from Lesson 5).

Before you start

Finish the Module 4 quiz.
Python 3.10+:

bash

mkdir pedestrian-detector && cd pedestrian-detector
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install torch torchvision matplotlib numpy torchmetrics onnx onnxruntime tqdm

GPU strongly recommended — Faster R-CNN fine-tuning on CPU works but is slow (~1–2 hours for 5 epochs vs ~15 min on GPU).

Step 1 — Penn-Fudan → detection targets

Goal: Each __getitem__ returns (tensor_image, target_dict) where:

python

target = {
    "boxes": FloatTensor[N, 4],   # xyxy absolute pixels
    "labels": Int64Tensor[N],     # all 1 for person (0 = background, unused in GT)
}

Penn-Fudan ships instance masks (pixel IDs per pedestrian). Use torchvision.ops.masks_to_boxes:

python

# train_detector.py
import json
import random
from pathlib import Path
 
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, Subset
from torchvision import transforms
from torchvision.datasets import PennFudanPed
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.ops import masks_to_boxes
from tqdm import tqdm
 
try:
    from torchmetrics.detection import MeanAveragePrecision
except ImportError:
    from torchmetrics.detection.mean_ap import MeanAveragePrecision
 
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)
 
ROOT = Path(__file__).resolve().parent
DATA_DIR = ROOT / "data"
OUT_DIR = ROOT / "outputs"
OUT_DIR.mkdir(exist_ok=True)
 
NUM_CLASSES = 2          # background + person
PERSON_LABEL = 1
SCORE_THRESH_DEFAULT = 0.5

Dataset wrapper:

python

class PennFudanDetection(Dataset):
    """Wrap Penn-Fudan masks as Faster R-CNN targets."""
 
    def __init__(self, root: Path, transforms_fn=None):
        self.base = PennFudanPed(str(root), download=True)
        self.transforms_fn = transforms_fn
 
    def __len__(self):
        return len(self.base)
 
    def __getitem__(self, idx):
        img, mask = self.base[idx]  # PIL, LongTensor [H, W]
        img = transforms.ToTensor()(img)
 
        obj_ids = torch.unique(mask)
        obj_ids = obj_ids[obj_ids != 0]
        if obj_ids.numel() == 0:
            boxes = torch.zeros((0, 4), dtype=torch.float32)
            labels = torch.zeros((0,), dtype=torch.int64)
        else:
            masks = mask == obj_ids[:, None, None]
            boxes = masks_to_boxes(masks)
            labels = torch.ones((boxes.shape[0],), dtype=torch.int64) * PERSON_LABEL
 
        target = {"boxes": boxes, "labels": labels}
        if self.transforms_fn:
            img, target = self.transforms_fn(img, target)
        return img, target

Checkpoint — run a smoke test:

python

if __name__ == "__main__" and False:  # flip to True once to verify
    ds = PennFudanDetection(DATA_DIR)
    img, tgt = ds[0]
    print("image:", img.shape, "boxes:", tgt["boxes"].shape)

You should see boxes: torch.Size([N, 4]) with N ≥ 1 on most images.

Step 2 — Train / val split and collate

Goal: ~80% train, 20% val — same split every run via SEED.

Detection cannot use default collate_fn — images have different sizes and different object counts:

python

def collate_fn(batch):
    return tuple(zip(*batch))
 
 
def split_indices(n: int, val_ratio: float = 0.2):
    idx = list(range(n))
    random.shuffle(idx)
    n_val = max(1, int(n * val_ratio))
    return idx[n_val:], idx[:n_val]
 
 
full_ds = PennFudanDetection(DATA_DIR)
train_idx, val_idx = split_indices(len(full_ds))
train_ds = Subset(full_ds, train_idx)
val_ds = Subset(full_ds, val_idx)
 
train_loader = DataLoader(
    train_ds, batch_size=2, shuffle=True, num_workers=0, collate_fn=collate_fn
)
val_loader = DataLoader(
    val_ds, batch_size=2, shuffle=False, num_workers=0, collate_fn=collate_fn
)
print(f"train {len(train_ds)} / val {len(val_ds)} images")

Choice	Value	Why
`batch_size=2`	Small	Detection memory scales with objects + image size
`num_workers=0`	Windows-safe	Avoid multiprocessing spawn issues while learning
No heavy aug yet	Baseline	Get mAP > 0 before adding `RandomHorizontalFlip`

Step 3 — Load pretrained Faster R-CNN

Goal: Replace the classification head for 2 classes (background + person).

python

def build_model(num_classes: int = NUM_CLASSES):
    model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model
 
 
model = build_model().to(device)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=1e-4)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

What each piece does:

Component	Role
ResNet50-FPN backbone	Multi-scale features — small and large pedestrians
RPN	Proposes ~1000 region candidates per image
RoI Align + head	Classifies each proposal and refines box coordinates
`FastRCNNPredictor`	New `cls_score` layer sized for your class count

Step 4 — Training loop with loss breakdown

Goal: Sum loss dict scalars, backprop, save best checkpoint by val mAP@0.5.

python

def train_one_epoch(model, loader, optimizer, epoch):
    model.train()
    running = {}
    for images, targets in tqdm(loader, desc=f"train e{epoch}"):
        images = [im.to(device) for im in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
 
        loss_dict = model(images, targets)
        losses = sum(loss_dict.values())
 
        optimizer.zero_grad()
        losses.backward()
        optimizer.step()
 
        for k, v in loss_dict.items():
            running[k] = running.get(k, 0.0) + v.item()
    n = len(loader)
    return {k: v / n for k, v in running.items()}

Typical loss keys (torchvision Faster R-CNN):

loss_classifier, loss_box_reg — RoI head
loss_objectness, loss_rpn_box_reg — RPN

If all losses → 0 in epoch 1 but mAP stays 0, you likely have a label bug (wrong class indices or empty targets). Re-read Lesson 3.

Step 5 — Validation mAP with torchmetrics

Goal: Honest detection metrics — not pixel accuracy.

python

@torch.no_grad()
def evaluate_map(model, loader, score_thresh: float = SCORE_THRESH_DEFAULT):
    model.eval()
    metric = MeanAveragePrecision(box_format="xyxy", iou_type="bbox")
    metric.warn_on_many_detections = False
 
    for images, targets in tqdm(loader, desc="eval"):
        images = [im.to(device) for im in images]
        outputs = model(images)
 
        preds = []
        gts = []
        for out, tgt in zip(outputs, targets):
            keep = out["scores"] >= score_thresh
            preds.append({
                "boxes": out["boxes"][keep].cpu(),
                "scores": out["scores"][keep].cpu(),
                "labels": out["labels"][keep].cpu(),
            })
            gts.append({
                "boxes": tgt["boxes"].cpu(),
                "labels": tgt["labels"].cpu(),
            })
        metric.update(preds, gts)
 
    stats = metric.compute()
    return {
        "map_50": float(stats["map_50"]),
        "map": float(stats["map"]),
    }

Metric	Meaning
`map_50`	AP at IoU ≥ 0.5 — “did you find the person roughly?”
`map`	COCO-style average over IoU 0.5:0.95 — rewards tight boxes

Target after 5 epochs (GPU): map_50 ≥ 0.75 on this tiny dataset. map is often lower (0.4–0.6) — that is normal.

Step 6 — Full train + save best

python

EPOCHS = 5
best_map50 = -1.0
history = {"train_loss": [], "map_50": [], "map": []}
 
for epoch in range(1, EPOCHS + 1):
    losses = train_one_epoch(model, train_loader, optimizer, epoch)
    metrics = evaluate_map(model, val_loader)
    lr_scheduler.step()
 
    total_loss = sum(losses.values())
    history["train_loss"].append(total_loss)
    history["map_50"].append(metrics["map_50"])
    history["map"].append(metrics["map"])
 
    print(f"epoch {epoch} loss={total_loss:.4f} mAP@50={metrics['map_50']:.3f} mAP={metrics['map']:.3f}")
 
    if metrics["map_50"] > best_map50:
        best_map50 = metrics["map_50"]
        torch.save(model.state_dict(), OUT_DIR / "best_detector.pt")
 
with open(OUT_DIR / "metrics.json", "w") as f:
    json.dump({"best_map_50": best_map50, "history": history}, f, indent=2)

Plot loss + mAP:

python

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(history["train_loss"], marker="o")
ax[0].set_title("Train loss (sum)")
ax[0].set_xlabel("epoch")
ax[1].plot(history["map_50"], marker="o", label="mAP@0.5")
ax[1].plot(history["map"], marker="o", label="mAP@0.5:0.95")
ax[1].legend()
ax[1].set_xlabel("epoch")
fig.tight_layout()
fig.savefig(OUT_DIR / "loss_curves.png", dpi=150)
plt.close()

Step 7 — Visualize predictions vs ground truth

Goal: Green = GT, red = predictions above threshold.

python

def draw_boxes(ax, boxes, color, label_prefix=""):
    for i, box in enumerate(boxes):
        x1, y1, x2, y2 = box.tolist()
        rect = patches.Rectangle(
            (x1, y1), x2 - x1, y2 - y1, linewidth=2, edgecolor=color, facecolor="none"
        )
        ax.add_patch(rect)
 
 
@torch.no_grad()
def visualize_detections(model, dataset, indices, score_thresh=0.5, save_path=None):
    model.eval()
    n = len(indices)
    fig, axes = plt.subplots(1, n, figsize=(4 * n, 4))
    if n == 1:
        axes = [axes]
 
    for ax, idx in zip(axes, indices):
        img, tgt = dataset[idx]
        pred = model([img.to(device)])[0]
        keep = pred["scores"] >= score_thresh
 
        ax.imshow(img.permute(1, 2, 0).cpu().numpy())
        draw_boxes(ax, tgt["boxes"], "lime")
        draw_boxes(ax, pred["boxes"][keep].cpu(), "red")
        ax.set_title(f"#{idx}")
        ax.axis("off")
 
    fig.tight_layout()
    if save_path:
        fig.savefig(save_path, dpi=150)
    plt.close()
 
 
model.load_state_dict(torch.load(OUT_DIR / "best_detector.pt", map_location=device))
visualize_detections(model, full_ds, val_idx[:3], save_path=OUT_DIR / "detections_val.png")

Step 8 — Threshold sweep (precision vs recall)

Goal: Pick a deployment threshold — not always 0.5.

python

@torch.no_grad()
def collect_scores_and_matches(model, loader, iou_thresh=0.5):
    """Simple PR sweep: treat each detection as TP if max IoU with GT ≥ thresh."""
    from torchvision.ops import box_iou
 
    model.eval()
    all_scores = []
    all_tps = []
 
    for images, targets in loader:
        images = [im.to(device) for im in images]
        outputs = model(images)
        for out, tgt in zip(outputs, targets):
            gt_boxes = tgt["boxes"]
            if gt_boxes.numel() == 0:
                continue
            scores = out["scores"].cpu()
            boxes = out["boxes"].cpu()
            if boxes.numel() == 0:
                continue
            ious = box_iou(boxes, gt_boxes)
            max_iou, _ = ious.max(dim=1)
            all_scores.extend(scores.tolist())
            all_tps.extend((max_iou >= iou_thresh).int().tolist())
    return np.array(all_scores), np.array(all_tps)
 
 
scores, tps = collect_scores_and_matches(model, val_loader)
thresholds = np.linspace(0.05, 0.95, 19)
precisions, recalls = [], []
total_gt = sum(len(full_ds[i][1]["boxes"]) for i in val_idx)
 
for t in thresholds:
    mask = scores >= t
    tp = tps[mask].sum()
    fp = mask.sum() - tp
    fn = total_gt - tp
    precisions.append(tp / (tp + fp + 1e-9))
    recalls.append(tp / (total_gt + 1e-9))
 
plt.figure(figsize=(6, 4))
plt.plot(recalls, precisions, marker="o")
for t, r, p in zip(thresholds[::3], recalls[::3], precisions[::3]):
    plt.annotate(f"{t:.2f}", (r, p), fontsize=8)
plt.xlabel("recall")
plt.ylabel("precision")
plt.title("PR sweep (val, IoU≥0.5)")
plt.tight_layout()
plt.savefig(OUT_DIR / "pr_threshold_sweep.png", dpi=150)
plt.close()

Use the sweep to justify SCORE_THRESH_DEFAULT in your README.

Step 9 — Failure gallery

Goal: Three categories — missed (FN), false alarm (FP), loose box (low IoU).

Manually pick indices from val where:

No pred box overlaps GT (missed pedestrian).
High-score pred on background (false alarm).
Pred overlaps but visibly too large/small (localization error).

python

visualize_detections(
    model, full_ds, val_idx[5:8], score_thresh=0.3, save_path=OUT_DIR / "failure_gallery.png"
)

In README.md, write one sentence per failure: what happened and which module lesson explains it.

Step 10 — ONNX export (deployment handoff)

Goal: Fixed-size tensor in → raw boxes out. NMS stays in app code (Lesson 5).

python

class DetectorWrapper(nn.Module):
  def __init__(self, model):
    super().__init__()
    self.model = model
 
  def forward(self, x):
    out = self.model(x)[0]
    return out["boxes"], out["scores"], out["labels"]
 
 
model.eval()
wrapper = DetectorWrapper(model).cpu()
dummy = torch.randn(1, 3, 800, 800)
torch.onnx.export(
    wrapper,
    dummy,
    OUT_DIR / "detector.onnx",
    input_names=["image"],
    output_names=["boxes", "scores", "labels"],
    dynamic_axes=None,
    opset_version=17,
)
print("exported:", OUT_DIR / "detector.onnx")

Note: Real photos must be letterboxed or resized to 800×800 with the same normalization you used in training (ToTensor → [0,1]). Module 7 covers full pipeline integration.

Deliverables checklist

outputs/best_detector.pt — best val mAP@0.5 checkpoint
outputs/metrics.json — includes best_map_50
outputs/loss_curves.png — loss decreasing, mAP increasing
outputs/detections_val.png — green GT + red preds
outputs/pr_threshold_sweep.png — justified threshold
outputs/failure_gallery.png — 3 annotated failure modes
outputs/detector.onnx — loads in onnxruntime
README.md — metrics table + reflections

Troubleshooting

Symptom	Likely cause	Fix
mAP = 0 after training	Labels all 0 or wrong `num_classes`	Use `PERSON_LABEL = 1`, `NUM_CLASSES = 2`
CUDA OOM	Batch too large / huge images	`batch_size=1`, shorter side resize
Loss NaN	LR too high	Try `lr=0.002` or freeze backbone first epoch
Thousands of red boxes	Threshold too low	Raise to 0.5+; check NMS is inside model (it is for Faster R-CNN)
Great mAP, bad visuals	Loose boxes at IoU 0.5	Report `map` (0.5:0.95); inspect failure gallery

Extensions (optional)

Horizontal flip aug — flip image + flip box x coordinates.
Freeze backbone for 1 epoch, then unfreeze — stabilizes early RPN.
Compare score_thresh 0.3 vs 0.7 on the same images — write precision/recall tradeoff.
Try retinanet_resnet50_fpn — one-stage baseline on same split.

What's next

Module 5 — Segmentation and instance masks — pixel-level labels, U-Net, and masks instead of boxes.