Training detectors — losses, labels & data pipelines
Before we begin
Detection training is not "cross-entropy on one label." You optimize multiple losses on matched predictions — classification, box regression, objectness — with heavy class imbalance (thousands of background anchors vs few objects).
This lesson explains what loss_dict means in PyTorch, how labels flow from disk to the model, and how to debug the most common training failures.
What you will learn
- Decompose Faster R-CNN training losses and what each term optimizes.
- Explain Smooth L1 / GIoU box regression objectives.
- Build a valid torchvision target dict from masks or COCO JSON.
- Apply hard negative mining intuition.
- Diagnose broken boxes, label ids, and empty targets.
Before this lesson
Faster R-CNN loss dictionary
In model.train() with targets:
loss_dict = model(images, targets)
# Example keys:
# loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg
losses = sum(loss_dict.values())
losses.backward()| Loss | What it trains | Intuition |
|---|---|---|
loss_rpn_box_reg | RPN box deltas | Proposals overlap objects |
loss_objectness | RPN fg/bg | Proposals are object vs background |
loss_classifier | RoI class | Correct category per proposal |
loss_box_reg | RoI box refine | Tight boxes after RoI Align |
Inference: model.eval() + no targets → returns list of dicts with boxes, labels, scores.
Classification loss on proposals
Matched positive proposals get cross-entropy toward true class.
Background proposals are label 0.
Imbalance: 200k anchors, maybe 10 objects → 99% background.
Mitigations:
- Sample balanced mini-batches of RoIs (e.g. 25% fg, 75% bg)
- Focal loss in one-stage detectors
- Hard negative mining (keep bg examples with high loss)
Box regression targets
Given anchor and ground truth , predict deltas (schematic):
Loss: Smooth L1 (less sensitive to outliers than L2):
# conceptually
loss_box = smooth_l1_loss(pred_deltas, target_deltas, beta=1.0)Modern systems also use GIoU / DIoU / CIoU — penalize non-overlapping boxes better than plain L1 on coords.
Label id convention (torchvision)
# num_classes = 2 means: background + 1 foreground class
labels = torch.tensor([1, 1, 1]) # all "person"
# NEVER use 0 for person — 0 is backgroundWhen replacing FastRCNNPredictor:
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=2)num_classes includes background.
Building targets from instance masks (project pattern)
Penn-Fudan provides a single mask image — colors index instances:
import torch
def mask_to_boxes_labels(mask):
"""mask: 2D int tensor, 0=background, instance ids 1..N"""
obj_ids = torch.unique(mask)
obj_ids = obj_ids[obj_ids != 0]
boxes, labels = [], []
for _ in obj_ids:
labels.append(1) # person class
for obj_id in obj_ids:
pos = torch.where(mask == obj_id)
ymin, ymax = pos[0].min(), pos[0].max()
xmin, xmax = pos[1].min(), pos[1].max()
boxes.append([xmin.item(), ymin.item(), xmax.item(), ymax.item()])
if not boxes:
return torch.zeros((0, 4), dtype=torch.float32), torch.zeros((0,), dtype=torch.int64)
return torch.as_tensor(boxes, dtype=torch.float32), torch.as_tensor(labels, dtype=torch.int64)Edge case: image with zero objects — return empty tensors shape (0,4) — still valid.
DataLoader and collate
def collate_fn(batch):
return tuple(zip(*batch))
loader = DataLoader(ds, batch_size=2, shuffle=True, collate_fn=collate_fn)Each batch: images is list of 3×H×W tensors; targets is list of dicts (variable box counts).
Do not torch.stack images of different sizes without resize/pad policy.
Recommended transforms (detection)
import torchvision.transforms.v2 as T
train_tf = T.Compose([
T.RandomPhotometricDistort(p=0.5),
T.RandomHorizontalFlip(p=0.5),
T.ToImage(),
T.ToDtype(torch.float32, scale=True),
])v2 transforms update boxes when flipping — critical. Plain ToTensor() on PIL without box sync will misalign labels.
Optimizer and schedule (Faster R-CNN fine-tune)
Common recipe from torchvision references:
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)| Phase | Strategy |
|---|---|
| Epochs 1–2 | Train all layers, moderate LR |
| Small data | Freeze backbone early layers, lower LR |
| Overfit | Stronger aug, fewer epochs, weight decay |
Debugging checklist
| Symptom | Likely cause |
|---|---|
| Loss NaN | Invalid boxes; LR too high |
| Loss flat, no boxes after train | All labels 0 (background) |
| Boxes in corner | Normalized coords treated as pixels |
| Perfect train, zero val detections | Forgot model.eval() or wrong score thresh |
| CUDA OOM | batch_size>1 on 800px images |
Sanity script
img, tgt = dataset[0]
print(tgt["boxes"], tgt["labels"])
assert (tgt["boxes"][:, 2] > tgt["boxes"][:, 0]).all()Visualize GT boxes on image before any training.
COCO training path (extension)
from torchvision.datasets import CocoDetection
ds = CocoDetection(root="images/", annFile="annotations.json", transforms=...)
# CocoDetection returns (image, list_of_ann_dicts) — wrap to tensorsUse pycocotools for official mAP evaluation on val.