From classification to detection
Before we begin
A classifier collapses the whole image into one decision. A detector must find all objects and localize each one. That sounds like a small step — it is not. The output is a set (variable size), training needs matching between predictions and ground truth, and evaluation uses IoU instead of simple accuracy.
This lesson builds the vocabulary and math you need before architectures and training.
What you will learn
- Contrast fixed-length classifier outputs vs variable detection sets.
- Read, convert, and validate bounding box formats.
- Explain anchors, positive/negative assignment, and box deltas.
- Parse COCO and YOLO label files.
- State why permutation-invariant matching is required in training.
Before this lesson
Step 1 — Two questions, two output types
Classification:
One vector of length — same shape for every image.
Detection:
- — box parameters
- — class (background handled separately in many APIs)
- — confidence
- — depends on the image
Figure
Fixed vs variable outputs
Product example: A shelf camera needs one box per product facing — changes every frame.
Checkpoint: Why is padding "always 100 boxes" a bad training target?
Answer sketch: Most slots would be empty sentinels; the model wastes capacity learning padding semantics. Better: set prediction, dynamic NMS output, or objectness scores with thresholding.
Step 2 — Bounding box formats (worked example)
Image size: width=400, height=300 (pixels).
Object: top-left corner , bottom-right .
| Format | Values | Notes |
|---|---|---|
| xyxy | [100, 50, 220, 180] | PyTorch / torchvision default |
| xywh (COCO file) | [100, 50, 120, 130] | , |
| cxcywh | [160, 115, 120, 130] | center |
| normalized cxcywh | [0.4, 0.383, 0.3, 0.433] | divide by image W,H |
import torch
from torchvision.ops import box_convert
boxes_xyxy = torch.tensor([[100., 50., 220., 180.]])
boxes_cxcywh = box_convert(boxes_xyxy, in_fmt="xyxy", out_fmt="cxcywh")
print(boxes_cxcywh) # tensor([[160., 115., 120., 130.]])Validation rules (catch bugs early)
def validate_xyxy(boxes, img_w, img_h):
assert (boxes[:, 2] > boxes[:, 0]).all(), "x2 must exceed x1"
assert (boxes[:, 3] > boxes[:, 1]).all(), "y2 must exceed y1"
assert (boxes[:, 0] >= 0).all() and (boxes[:, 2] <= img_w).all()
assert (boxes[:, 1] >= 0).all() and (boxes[:, 3] <= img_h).all()Invalid boxes (zero area, inverted coords) break IoU and training.
Step 3 — Anchors and assignment
Early and many modern detectors use anchor boxes — predefined shapes at each feature map location.
At cell with stride 16, you might have anchors:
- , , pixels
- aspect ratios , ,
Ground-truth box is assigned to anchor by IoU:
| IoU with | Typical label |
|---|---|
| ≥ 0.5 | Positive — predict object |
| < 0.4 | Negative — background |
| 0.4 – 0.5 | Ignore — no loss |
Network targets for a positive anchor:
- Classification: class id of
- Regression: deltas mapping anchor →
Figure
IoU for assignment
Anchor-free (FCOS, CenterNet): predict distances from cell center to four sides of box — removes hand-tuned anchor grids; still needs center-ness / quality estimates.
Step 4 — Set matching (why order does not matter)
Suppose the model outputs 100 boxes but the image has 3 objects. Which output is "box 1"?
Predictions are an unordered set. Training solves bipartite matching:
- Faster R-CNN / YOLO: match by IoU to anchors or grid cells
- DETR: Hungarian algorithm on cost = class loss + L1 box distance
Without matching, you cannot define a stable loss — swapping prediction indices should not change the optimum.
Step 5 — Annotation formats in the wild
COCO JSON (research & torchvision)
{
"images": [{"id": 1, "file_name": "img001.jpg", "width": 640, "height": 480}],
"categories": [{"id": 1, "name": "person"}],
"annotations": [{
"id": 10,
"image_id": 1,
"category_id": 1,
"bbox": [100, 50, 120, 130],
"area": 15600,
"iscrowd": 0
}]
}bbox is xywh top-left format. iscrowd=1 uses different IoU rules for crowded regions.
YOLO txt (one file per image)
# class_id cx cy w h (all normalized 0–1)
0 0.400 0.383 0.300 0.433Folder layout:
dataset/
images/train/*.jpg
labels/train/*.txt # same stem as image
data.yaml # class names, paths (Ultralytics)torchvision detection target dict
target = {
"boxes": torch.tensor([[100., 50., 220., 180.]]), # xyxy float32
"labels": torch.tensor([1]), # int64, 0 = background reserved
"image_id": torch.tensor([42]),
"area": torch.tensor([15600.]),
"iscrowd": torch.tensor([0]),
}Critical: class 0 is background in Faster R-CNN — your first real class is often label 1.
Step 6 — When detection vs classification
| Choose classification | Choose detection |
|---|---|
| Single dominant object centered | Multiple objects |
| Only presence matters | Position matters for action (pick, avoid) |
| Tiny data, simple deploy | Need counts or tracking (Module 6) |
Hybrid pipelines: detect then classify crops (two-stage product design).
Mini exercise (paper and pencil)
Box A (pred): xyxy [10, 10, 50, 50]
Box B (GT): xyxy [30, 30, 70, 70]
Intersection: [30,30] to [50,50] → area
Area A , Area B
Union
IoU
Would this match as TP at IoU 0.5? No — would be FP unless another pred matches B.
Common beginner mistakes
| Mistake | Symptom |
|---|---|
| xywh vs xyxy confusion | Boxes shifted, huge mAP drop |
| Normalized vs pixel coords | Boxes clustered in corner |
| Label 0 for person | Silent training bug (background) |
| No train/val split for threshold tuning | Overfit score threshold |