Project: train and evaluate an object detector
Before we begin
This is your first object detection project. You will fine-tune a Faster R-CNN detector on the Penn-Fudan pedestrian dataset, track training losses, compute mAP on a validation split, sweep score thresholds, build a failure gallery, and export to ONNX for Module 7.
You are not inventing a new architecture — you are learning the full detection workflow: labels → collate → loss dict → eval → metrics → deployment.
Figure
What good evaluation looks like
How this connects to Module 4
| Lesson | Where you use it |
|---|---|
| Classification → detection | Box format xyxy, one class (person), variable object count |
| Architectures | Faster R-CNN: backbone + FPN + RPN + RoI head |
| Training | model(images, targets) loss dict, label 0 = background |
| IoU / NMS / mAP | torchmetrics mAP, score threshold tuning |
| On-device | ONNX export, fixed input size, post-process on CPU |
Folder layout:
pedestrian-detector/
data/ # torchvision download cache
train_detector.py # train, evaluate, plot, export
outputs/
loss_curves.png
detections_val.png # green GT, red preds
pr_threshold_sweep.png
failure_gallery.png
metrics.json # mAP@0.5, mAP@0.5:0.95
best_detector.pt
detector.onnx
README.md # metrics table + 3 failure reflectionsEstimated time: 4–6 hours (first time); ~2.5 hours if comfortable with Module 3 project.
What you will build
- Convert Penn-Fudan instance masks → COCO-style
boxes+labelstargets. - Fine-tune
fasterrcnn_resnet50_fpn(COCO-pretrained) for 1 foreground class. - Log RPN and RoI losses every epoch — spot collapse early.
- Compute validation mAP@0.5 and mAP@0.5:0.95 with
torchmetrics. - Sweep confidence thresholds and plot precision vs recall.
- Visualize predictions vs ground truth and a failure gallery.
- Export ONNX with a fixed
800×800input (deployment pattern from Lesson 5).
Before you start
- Finish the Module 4 quiz.
- Python 3.10+:
mkdir pedestrian-detector && cd pedestrian-detector
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install torch torchvision matplotlib numpy torchmetrics onnx onnxruntime tqdm- GPU strongly recommended — Faster R-CNN fine-tuning on CPU works but is slow (~1–2 hours for 5 epochs vs ~15 min on GPU).
Step 1 — Penn-Fudan → detection targets
Goal: Each __getitem__ returns (tensor_image, target_dict) where:
target = {
"boxes": FloatTensor[N, 4], # xyxy absolute pixels
"labels": Int64Tensor[N], # all 1 for person (0 = background, unused in GT)
}Penn-Fudan ships instance masks (pixel IDs per pedestrian). Use torchvision.ops.masks_to_boxes:
# train_detector.py
import json
import random
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, Subset
from torchvision import transforms
from torchvision.datasets import PennFudanPed
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.ops import masks_to_boxes
from tqdm import tqdm
try:
from torchmetrics.detection import MeanAveragePrecision
except ImportError:
from torchmetrics.detection.mean_ap import MeanAveragePrecision
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)
ROOT = Path(__file__).resolve().parent
DATA_DIR = ROOT / "data"
OUT_DIR = ROOT / "outputs"
OUT_DIR.mkdir(exist_ok=True)
NUM_CLASSES = 2 # background + person
PERSON_LABEL = 1
SCORE_THRESH_DEFAULT = 0.5Dataset wrapper:
class PennFudanDetection(Dataset):
"""Wrap Penn-Fudan masks as Faster R-CNN targets."""
def __init__(self, root: Path, transforms_fn=None):
self.base = PennFudanPed(str(root), download=True)
self.transforms_fn = transforms_fn
def __len__(self):
return len(self.base)
def __getitem__(self, idx):
img, mask = self.base[idx] # PIL, LongTensor [H, W]
img = transforms.ToTensor()(img)
obj_ids = torch.unique(mask)
obj_ids = obj_ids[obj_ids != 0]
if obj_ids.numel() == 0:
boxes = torch.zeros((0, 4), dtype=torch.float32)
labels = torch.zeros((0,), dtype=torch.int64)
else:
masks = mask == obj_ids[:, None, None]
boxes = masks_to_boxes(masks)
labels = torch.ones((boxes.shape[0],), dtype=torch.int64) * PERSON_LABEL
target = {"boxes": boxes, "labels": labels}
if self.transforms_fn:
img, target = self.transforms_fn(img, target)
return img, targetCheckpoint — run a smoke test:
if __name__ == "__main__" and False: # flip to True once to verify
ds = PennFudanDetection(DATA_DIR)
img, tgt = ds[0]
print("image:", img.shape, "boxes:", tgt["boxes"].shape)You should see boxes: torch.Size([N, 4]) with N ≥ 1 on most images.
Step 2 — Train / val split and collate
Goal: ~80% train, 20% val — same split every run via SEED.
Detection cannot use default collate_fn — images have different sizes and different object counts:
def collate_fn(batch):
return tuple(zip(*batch))
def split_indices(n: int, val_ratio: float = 0.2):
idx = list(range(n))
random.shuffle(idx)
n_val = max(1, int(n * val_ratio))
return idx[n_val:], idx[:n_val]
full_ds = PennFudanDetection(DATA_DIR)
train_idx, val_idx = split_indices(len(full_ds))
train_ds = Subset(full_ds, train_idx)
val_ds = Subset(full_ds, val_idx)
train_loader = DataLoader(
train_ds, batch_size=2, shuffle=True, num_workers=0, collate_fn=collate_fn
)
val_loader = DataLoader(
val_ds, batch_size=2, shuffle=False, num_workers=0, collate_fn=collate_fn
)
print(f"train {len(train_ds)} / val {len(val_ds)} images")| Choice | Value | Why |
|---|---|---|
batch_size=2 | Small | Detection memory scales with objects + image size |
num_workers=0 | Windows-safe | Avoid multiprocessing spawn issues while learning |
| No heavy aug yet | Baseline | Get mAP > 0 before adding RandomHorizontalFlip |
Step 3 — Load pretrained Faster R-CNN
Goal: Replace the classification head for 2 classes (background + person).
def build_model(num_classes: int = NUM_CLASSES):
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
return model
model = build_model().to(device)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=1e-4)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)What each piece does:
| Component | Role |
|---|---|
| ResNet50-FPN backbone | Multi-scale features — small and large pedestrians |
| RPN | Proposes ~1000 region candidates per image |
| RoI Align + head | Classifies each proposal and refines box coordinates |
FastRCNNPredictor | New cls_score layer sized for your class count |
Step 4 — Training loop with loss breakdown
Goal: Sum loss dict scalars, backprop, save best checkpoint by val mAP@0.5.
def train_one_epoch(model, loader, optimizer, epoch):
model.train()
running = {}
for images, targets in tqdm(loader, desc=f"train e{epoch}"):
images = [im.to(device) for im in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()
for k, v in loss_dict.items():
running[k] = running.get(k, 0.0) + v.item()
n = len(loader)
return {k: v / n for k, v in running.items()}Typical loss keys (torchvision Faster R-CNN):
loss_classifier,loss_box_reg— RoI headloss_objectness,loss_rpn_box_reg— RPN
If all losses → 0 in epoch 1 but mAP stays 0, you likely have a label bug (wrong class indices or empty targets). Re-read Lesson 3.
Step 5 — Validation mAP with torchmetrics
Goal: Honest detection metrics — not pixel accuracy.
@torch.no_grad()
def evaluate_map(model, loader, score_thresh: float = SCORE_THRESH_DEFAULT):
model.eval()
metric = MeanAveragePrecision(box_format="xyxy", iou_type="bbox")
metric.warn_on_many_detections = False
for images, targets in tqdm(loader, desc="eval"):
images = [im.to(device) for im in images]
outputs = model(images)
preds = []
gts = []
for out, tgt in zip(outputs, targets):
keep = out["scores"] >= score_thresh
preds.append({
"boxes": out["boxes"][keep].cpu(),
"scores": out["scores"][keep].cpu(),
"labels": out["labels"][keep].cpu(),
})
gts.append({
"boxes": tgt["boxes"].cpu(),
"labels": tgt["labels"].cpu(),
})
metric.update(preds, gts)
stats = metric.compute()
return {
"map_50": float(stats["map_50"]),
"map": float(stats["map"]),
}| Metric | Meaning |
|---|---|
map_50 | AP at IoU ≥ 0.5 — “did you find the person roughly?” |
map | COCO-style average over IoU 0.5:0.95 — rewards tight boxes |
Target after 5 epochs (GPU): map_50 ≥ 0.75 on this tiny dataset. map is often lower (0.4–0.6) — that is normal.
Step 6 — Full train + save best
EPOCHS = 5
best_map50 = -1.0
history = {"train_loss": [], "map_50": [], "map": []}
for epoch in range(1, EPOCHS + 1):
losses = train_one_epoch(model, train_loader, optimizer, epoch)
metrics = evaluate_map(model, val_loader)
lr_scheduler.step()
total_loss = sum(losses.values())
history["train_loss"].append(total_loss)
history["map_50"].append(metrics["map_50"])
history["map"].append(metrics["map"])
print(f"epoch {epoch} loss={total_loss:.4f} mAP@50={metrics['map_50']:.3f} mAP={metrics['map']:.3f}")
if metrics["map_50"] > best_map50:
best_map50 = metrics["map_50"]
torch.save(model.state_dict(), OUT_DIR / "best_detector.pt")
with open(OUT_DIR / "metrics.json", "w") as f:
json.dump({"best_map_50": best_map50, "history": history}, f, indent=2)Plot loss + mAP:
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(history["train_loss"], marker="o")
ax[0].set_title("Train loss (sum)")
ax[0].set_xlabel("epoch")
ax[1].plot(history["map_50"], marker="o", label="mAP@0.5")
ax[1].plot(history["map"], marker="o", label="mAP@0.5:0.95")
ax[1].legend()
ax[1].set_xlabel("epoch")
fig.tight_layout()
fig.savefig(OUT_DIR / "loss_curves.png", dpi=150)
plt.close()Step 7 — Visualize predictions vs ground truth
Goal: Green = GT, red = predictions above threshold.
def draw_boxes(ax, boxes, color, label_prefix=""):
for i, box in enumerate(boxes):
x1, y1, x2, y2 = box.tolist()
rect = patches.Rectangle(
(x1, y1), x2 - x1, y2 - y1, linewidth=2, edgecolor=color, facecolor="none"
)
ax.add_patch(rect)
@torch.no_grad()
def visualize_detections(model, dataset, indices, score_thresh=0.5, save_path=None):
model.eval()
n = len(indices)
fig, axes = plt.subplots(1, n, figsize=(4 * n, 4))
if n == 1:
axes = [axes]
for ax, idx in zip(axes, indices):
img, tgt = dataset[idx]
pred = model([img.to(device)])[0]
keep = pred["scores"] >= score_thresh
ax.imshow(img.permute(1, 2, 0).cpu().numpy())
draw_boxes(ax, tgt["boxes"], "lime")
draw_boxes(ax, pred["boxes"][keep].cpu(), "red")
ax.set_title(f"#{idx}")
ax.axis("off")
fig.tight_layout()
if save_path:
fig.savefig(save_path, dpi=150)
plt.close()
model.load_state_dict(torch.load(OUT_DIR / "best_detector.pt", map_location=device))
visualize_detections(model, full_ds, val_idx[:3], save_path=OUT_DIR / "detections_val.png")Step 8 — Threshold sweep (precision vs recall)
Goal: Pick a deployment threshold — not always 0.5.
@torch.no_grad()
def collect_scores_and_matches(model, loader, iou_thresh=0.5):
"""Simple PR sweep: treat each detection as TP if max IoU with GT ≥ thresh."""
from torchvision.ops import box_iou
model.eval()
all_scores = []
all_tps = []
for images, targets in loader:
images = [im.to(device) for im in images]
outputs = model(images)
for out, tgt in zip(outputs, targets):
gt_boxes = tgt["boxes"]
if gt_boxes.numel() == 0:
continue
scores = out["scores"].cpu()
boxes = out["boxes"].cpu()
if boxes.numel() == 0:
continue
ious = box_iou(boxes, gt_boxes)
max_iou, _ = ious.max(dim=1)
all_scores.extend(scores.tolist())
all_tps.extend((max_iou >= iou_thresh).int().tolist())
return np.array(all_scores), np.array(all_tps)
scores, tps = collect_scores_and_matches(model, val_loader)
thresholds = np.linspace(0.05, 0.95, 19)
precisions, recalls = [], []
total_gt = sum(len(full_ds[i][1]["boxes"]) for i in val_idx)
for t in thresholds:
mask = scores >= t
tp = tps[mask].sum()
fp = mask.sum() - tp
fn = total_gt - tp
precisions.append(tp / (tp + fp + 1e-9))
recalls.append(tp / (total_gt + 1e-9))
plt.figure(figsize=(6, 4))
plt.plot(recalls, precisions, marker="o")
for t, r, p in zip(thresholds[::3], recalls[::3], precisions[::3]):
plt.annotate(f"{t:.2f}", (r, p), fontsize=8)
plt.xlabel("recall")
plt.ylabel("precision")
plt.title("PR sweep (val, IoU≥0.5)")
plt.tight_layout()
plt.savefig(OUT_DIR / "pr_threshold_sweep.png", dpi=150)
plt.close()Use the sweep to justify SCORE_THRESH_DEFAULT in your README.
Step 9 — Failure gallery
Goal: Three categories — missed (FN), false alarm (FP), loose box (low IoU).
Manually pick indices from val where:
- No pred box overlaps GT (missed pedestrian).
- High-score pred on background (false alarm).
- Pred overlaps but visibly too large/small (localization error).
visualize_detections(
model, full_ds, val_idx[5:8], score_thresh=0.3, save_path=OUT_DIR / "failure_gallery.png"
)In README.md, write one sentence per failure: what happened and which module lesson explains it.
Step 10 — ONNX export (deployment handoff)
Goal: Fixed-size tensor in → raw boxes out. NMS stays in app code (Lesson 5).
class DetectorWrapper(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, x):
out = self.model(x)[0]
return out["boxes"], out["scores"], out["labels"]
model.eval()
wrapper = DetectorWrapper(model).cpu()
dummy = torch.randn(1, 3, 800, 800)
torch.onnx.export(
wrapper,
dummy,
OUT_DIR / "detector.onnx",
input_names=["image"],
output_names=["boxes", "scores", "labels"],
dynamic_axes=None,
opset_version=17,
)
print("exported:", OUT_DIR / "detector.onnx")Note: Real photos must be letterboxed or resized to 800×800 with the same normalization you used in training (ToTensor → [0,1]). Module 7 covers full pipeline integration.
Deliverables checklist
-
outputs/best_detector.pt— best val mAP@0.5 checkpoint -
outputs/metrics.json— includesbest_map_50 -
outputs/loss_curves.png— loss decreasing, mAP increasing -
outputs/detections_val.png— green GT + red preds -
outputs/pr_threshold_sweep.png— justified threshold -
outputs/failure_gallery.png— 3 annotated failure modes -
outputs/detector.onnx— loads in onnxruntime -
README.md— metrics table + reflections
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| mAP = 0 after training | Labels all 0 or wrong num_classes | Use PERSON_LABEL = 1, NUM_CLASSES = 2 |
| CUDA OOM | Batch too large / huge images | batch_size=1, shorter side resize |
| Loss NaN | LR too high | Try lr=0.002 or freeze backbone first epoch |
| Thousands of red boxes | Threshold too low | Raise to 0.5+; check NMS is inside model (it is for Faster R-CNN) |
| Great mAP, bad visuals | Loose boxes at IoU 0.5 | Report map (0.5:0.95); inspect failure gallery |
Extensions (optional)
- Horizontal flip aug — flip image + flip box
xcoordinates. - Freeze backbone for 1 epoch, then unfreeze — stabilizes early RPN.
- Compare score_thresh 0.3 vs 0.7 on the same images — write precision/recall tradeoff.
- Try
retinanet_resnet50_fpn— one-stage baseline on same split.
What's next
Module 5 — Segmentation and instance masks — pixel-level labels, U-Net, and masks instead of boxes.