Training detectors — losses, labels & data pipelines

Before we begin

Detection training is not "cross-entropy on one label." You optimize multiple losses on matched predictions — classification, box regression, objectness — with heavy class imbalance (thousands of background anchors vs few objects).

This lesson explains what loss_dict means in PyTorch, how labels flow from disk to the model, and how to debug the most common training failures.

What you will learn

Decompose Faster R-CNN training losses and what each term optimizes.
Explain Smooth L1 / GIoU box regression objectives.
Build a valid torchvision target dict from masks or COCO JSON.
Apply hard negative mining intuition.
Diagnose broken boxes, label ids, and empty targets.

Before this lesson

Lesson 2 — Detector architectures

Faster R-CNN loss dictionary

In model.train() with targets:

python

loss_dict = model(images, targets)
# Example keys:
# loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg
losses = sum(loss_dict.values())
losses.backward()

Loss	What it trains	Intuition
`loss_rpn_box_reg`	RPN box deltas	Proposals overlap objects
`loss_objectness`	RPN fg/bg	Proposals are object vs background
`loss_classifier`	RoI class	Correct category per proposal
`loss_box_reg`	RoI box refine	Tight boxes after RoI Align

Inference: model.eval() + no targets → returns list of dicts with boxes, labels, scores.

Classification loss on proposals

Matched positive proposals get cross-entropy toward true class.
Background proposals are label 0.

Imbalance: 200k anchors, maybe 10 objects → 99% background.
Mitigations:

Sample balanced mini-batches of RoIs (e.g. 25% fg, 75% bg)
Focal loss in one-stage detectors
Hard negative mining (keep bg examples with high loss)

Box regression targets

Given anchor $A$ and ground truth $G$ , predict deltas (schematic):

t_x = \frac{G_{cx} - A_{cx}}{A_w}, \quad t_w = \log\frac{G_w}{A_w}

Loss: Smooth L1 (less sensitive to outliers than L2):

python

# conceptually
loss_box = smooth_l1_loss(pred_deltas, target_deltas, beta=1.0)

Modern systems also use GIoU / DIoU / CIoU — penalize non-overlapping boxes better than plain L1 on coords.

Label id convention (torchvision)

python

# num_classes = 2 means: background + 1 foreground class
labels = torch.tensor([1, 1, 1])  # all "person"
# NEVER use 0 for person — 0 is background

When replacing FastRCNNPredictor:

python

in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=2)

num_classes includes background.

Building targets from instance masks (project pattern)

Penn-Fudan provides a single mask image — colors index instances:

python

import torch
 
def mask_to_boxes_labels(mask):
    """mask: 2D int tensor, 0=background, instance ids 1..N"""
    obj_ids = torch.unique(mask)
    obj_ids = obj_ids[obj_ids != 0]
    boxes, labels = [], []
    for _ in obj_ids:
        labels.append(1)  # person class
    for obj_id in obj_ids:
        pos = torch.where(mask == obj_id)
        ymin, ymax = pos[0].min(), pos[0].max()
        xmin, xmax = pos[1].min(), pos[1].max()
        boxes.append([xmin.item(), ymin.item(), xmax.item(), ymax.item()])
    if not boxes:
        return torch.zeros((0, 4), dtype=torch.float32), torch.zeros((0,), dtype=torch.int64)
    return torch.as_tensor(boxes, dtype=torch.float32), torch.as_tensor(labels, dtype=torch.int64)

Edge case: image with zero objects — return empty tensors shape (0,4) — still valid.

DataLoader and collate

python

def collate_fn(batch):
    return tuple(zip(*batch))
 
loader = DataLoader(ds, batch_size=2, shuffle=True, collate_fn=collate_fn)

Each batch: images is list of 3×H×W tensors; targets is list of dicts (variable box counts).

Do not torch.stack images of different sizes without resize/pad policy.

Recommended transforms (detection)

python

import torchvision.transforms.v2 as T
 
train_tf = T.Compose([
    T.RandomPhotometricDistort(p=0.5),
    T.RandomHorizontalFlip(p=0.5),
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
])

v2 transforms update boxes when flipping — critical. Plain ToTensor() on PIL without box sync will misalign labels.

Optimizer and schedule (Faster R-CNN fine-tune)

Common recipe from torchvision references:

python

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

Phase	Strategy
Epochs 1–2	Train all layers, moderate LR
Small data	Freeze backbone early layers, lower LR
Overfit	Stronger aug, fewer epochs, weight decay

Debugging checklist

Symptom	Likely cause
Loss NaN	Invalid boxes; LR too high
Loss flat, no boxes after train	All labels 0 (background)
Boxes in corner	Normalized coords treated as pixels
Perfect train, zero val detections	Forgot `model.eval()` or wrong score thresh
CUDA OOM	batch_size>1 on 800px images

Sanity script

python

img, tgt = dataset[0]
print(tgt["boxes"], tgt["labels"])
assert (tgt["boxes"][:, 2] > tgt["boxes"][:, 0]).all()

Visualize GT boxes on image before any training.

COCO training path (extension)

python

from torchvision.datasets import CocoDetection
 
ds = CocoDetection(root="images/", annFile="annotations.json", transforms=...)
# CocoDetection returns (image, list_of_ann_dicts) — wrap to tensors

Use pycocotools for official mAP evaluation on val.

What's next

Lesson 4 — IoU, NMS & mAP