← Back to curriculum

Module 4 — Object detection

Training detectors — losses, labels & data pipelines

Classification + box regression losses, RPN and RoI training, hard negative mining, COCO/YOLO dataset loading, and common training bugs.

~90 min read + exercises

Training detectors — losses, labels & data pipelines

Before we begin

Detection training is not "cross-entropy on one label." You optimize multiple losses on matched predictions — classification, box regression, objectness — with heavy class imbalance (thousands of background anchors vs few objects).

This lesson explains what loss_dict means in PyTorch, how labels flow from disk to the model, and how to debug the most common training failures.


What you will learn

  • Decompose Faster R-CNN training losses and what each term optimizes.
  • Explain Smooth L1 / GIoU box regression objectives.
  • Build a valid torchvision target dict from masks or COCO JSON.
  • Apply hard negative mining intuition.
  • Diagnose broken boxes, label ids, and empty targets.

Before this lesson


Faster R-CNN loss dictionary

In model.train() with targets:

python
loss_dict = model(images, targets)
# Example keys:
# loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg
losses = sum(loss_dict.values())
losses.backward()
LossWhat it trainsIntuition
loss_rpn_box_regRPN box deltasProposals overlap objects
loss_objectnessRPN fg/bgProposals are object vs background
loss_classifierRoI classCorrect category per proposal
loss_box_regRoI box refineTight boxes after RoI Align

Inference: model.eval() + no targets → returns list of dicts with boxes, labels, scores.


Classification loss on proposals

Matched positive proposals get cross-entropy toward true class.
Background proposals are label 0.

Imbalance: 200k anchors, maybe 10 objects → 99% background.
Mitigations:

  • Sample balanced mini-batches of RoIs (e.g. 25% fg, 75% bg)
  • Focal loss in one-stage detectors
  • Hard negative mining (keep bg examples with high loss)

Box regression targets

Given anchor AA and ground truth GG, predict deltas (schematic):

tx=GcxAcxAw,tw=logGwAwt_x = \frac{G_{cx} - A_{cx}}{A_w}, \quad t_w = \log\frac{G_w}{A_w}

Loss: Smooth L1 (less sensitive to outliers than L2):

python
# conceptually
loss_box = smooth_l1_loss(pred_deltas, target_deltas, beta=1.0)

Modern systems also use GIoU / DIoU / CIoU — penalize non-overlapping boxes better than plain L1 on coords.


Label id convention (torchvision)

python
# num_classes = 2 means: background + 1 foreground class
labels = torch.tensor([1, 1, 1])  # all "person"
# NEVER use 0 for person — 0 is background

When replacing FastRCNNPredictor:

python
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=2)

num_classes includes background.


Building targets from instance masks (project pattern)

Penn-Fudan provides a single mask image — colors index instances:

python
import torch
 
def mask_to_boxes_labels(mask):
    """mask: 2D int tensor, 0=background, instance ids 1..N"""
    obj_ids = torch.unique(mask)
    obj_ids = obj_ids[obj_ids != 0]
    boxes, labels = [], []
    for _ in obj_ids:
        labels.append(1)  # person class
    for obj_id in obj_ids:
        pos = torch.where(mask == obj_id)
        ymin, ymax = pos[0].min(), pos[0].max()
        xmin, xmax = pos[1].min(), pos[1].max()
        boxes.append([xmin.item(), ymin.item(), xmax.item(), ymax.item()])
    if not boxes:
        return torch.zeros((0, 4), dtype=torch.float32), torch.zeros((0,), dtype=torch.int64)
    return torch.as_tensor(boxes, dtype=torch.float32), torch.as_tensor(labels, dtype=torch.int64)

Edge case: image with zero objects — return empty tensors shape (0,4) — still valid.


DataLoader and collate

python
def collate_fn(batch):
    return tuple(zip(*batch))
 
loader = DataLoader(ds, batch_size=2, shuffle=True, collate_fn=collate_fn)

Each batch: images is list of 3×H×W tensors; targets is list of dicts (variable box counts).

Do not torch.stack images of different sizes without resize/pad policy.

Recommended transforms (detection)

python
import torchvision.transforms.v2 as T
 
train_tf = T.Compose([
    T.RandomPhotometricDistort(p=0.5),
    T.RandomHorizontalFlip(p=0.5),
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
])

v2 transforms update boxes when flipping — critical. Plain ToTensor() on PIL without box sync will misalign labels.


Optimizer and schedule (Faster R-CNN fine-tune)

Common recipe from torchvision references:

python
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
PhaseStrategy
Epochs 1–2Train all layers, moderate LR
Small dataFreeze backbone early layers, lower LR
OverfitStronger aug, fewer epochs, weight decay

Debugging checklist

SymptomLikely cause
Loss NaNInvalid boxes; LR too high
Loss flat, no boxes after trainAll labels 0 (background)
Boxes in cornerNormalized coords treated as pixels
Perfect train, zero val detectionsForgot model.eval() or wrong score thresh
CUDA OOMbatch_size>1 on 800px images

Sanity script

python
img, tgt = dataset[0]
print(tgt["boxes"], tgt["labels"])
assert (tgt["boxes"][:, 2] > tgt["boxes"][:, 0]).all()

Visualize GT boxes on image before any training.


COCO training path (extension)

python
from torchvision.datasets import CocoDetection
 
ds = CocoDetection(root="images/", annFile="annotations.json", transforms=...)
# CocoDetection returns (image, list_of_ann_dicts) — wrap to tensors

Use pycocotools for official mAP evaluation on val.


What's next

Lesson 4 — IoU, NMS & mAP