From classification to detection

Before we begin

A classifier collapses the whole image into one decision. A detector must find all objects and localize each one. That sounds like a small step — it is not. The output is a set (variable size), training needs matching between predictions and ground truth, and evaluation uses IoU instead of simple accuracy.

This lesson builds the vocabulary and math you need before architectures and training.

What you will learn

Contrast fixed-length classifier outputs vs variable detection sets.
Read, convert, and validate bounding box formats.
Explain anchors, positive/negative assignment, and box deltas.
Parse COCO and YOLO label files.
State why permutation-invariant matching is required in training.

Before this lesson

Step 1 — Two questions, two output types

Classification:

f_\theta(\mathbf{x}) = \mathbf{p} \in \mathbb{R}^K, \quad \sum_k p_k = 1

One vector of length $K$ — same shape for every image.

Detection:

f_\theta(\mathbf{x}) = \big\{(b_i, c_i, s_i)\big\}_{i=1}^{N(\mathbf{x})}

$b_i \in \mathbb{R}^4$ — box parameters
$c_i \in \{1,\ldots,K\}$ — class (background handled separately in many APIs)
$s_i \in [0,1]$ — confidence
$N(\mathbf{x})$ — depends on the image

Figure

Fixed vs variable outputs

Detection must handle zero objects, one object, or dozens.

Product example: A shelf camera needs one box per product facing — $N$ changes every frame.

Checkpoint: Why is padding "always 100 boxes" a bad training target?

Answer sketch: Most slots would be empty sentinels; the model wastes capacity learning padding semantics. Better: set prediction, dynamic NMS output, or objectness scores with thresholding.

Step 2 — Bounding box formats (worked example)

Image size: width=400, height=300 (pixels).

Object: top-left corner $(100, 50)$ , bottom-right $(220, 180)$ .

Format	Values	Notes
xyxy	`[100, 50, 220, 180]`	PyTorch / torchvision default
xywh (COCO file)	`[100, 50, 120, 130]`	$w=220-100$ , $h=180-50$
cxcywh	`[160, 115, 120, 130]`	center $x=(100+220)/2$
normalized cxcywh	`[0.4, 0.383, 0.3, 0.433]`	divide by image W,H

python

import torch
from torchvision.ops import box_convert
 
boxes_xyxy = torch.tensor([[100., 50., 220., 180.]])
boxes_cxcywh = box_convert(boxes_xyxy, in_fmt="xyxy", out_fmt="cxcywh")
print(boxes_cxcywh)  # tensor([[160., 115., 120., 130.]])

Validation rules (catch bugs early)

python

def validate_xyxy(boxes, img_w, img_h):
    assert (boxes[:, 2] > boxes[:, 0]).all(), "x2 must exceed x1"
    assert (boxes[:, 3] > boxes[:, 1]).all(), "y2 must exceed y1"
    assert (boxes[:, 0] >= 0).all() and (boxes[:, 2] <= img_w).all()
    assert (boxes[:, 1] >= 0).all() and (boxes[:, 3] <= img_h).all()

Invalid boxes (zero area, inverted coords) break IoU and training.

Step 3 — Anchors and assignment

Early and many modern detectors use anchor boxes — predefined shapes at each feature map location.

At cell $(i,j)$ with stride 16, you might have anchors:

$32\times32$ , $64\times64$ , $128\times128$ pixels
aspect ratios $1:1$ , $1:2$ , $2:1$

Ground-truth box $G$ is assigned to anchor $A$ by IoU:

IoU with $G$	Typical label
≥ 0.5	Positive — predict object
< 0.4	Negative — background
0.4 – 0.5	Ignore — no loss

Network targets for a positive anchor:

Classification: class id of $G$
Regression: deltas $(\Delta x, \Delta y, \Delta w, \Delta h)$ mapping anchor → $G$

Figure

IoU for assignment

Same IoU definition used in training and evaluation.

Anchor-free (FCOS, CenterNet): predict distances from cell center to four sides of box — removes hand-tuned anchor grids; still needs center-ness / quality estimates.

Step 4 — Set matching (why order does not matter)

Suppose the model outputs 100 boxes but the image has 3 objects. Which output is "box 1"?

Predictions are an unordered set. Training solves bipartite matching:

Faster R-CNN / YOLO: match by IoU to anchors or grid cells
DETR: Hungarian algorithm on cost = class loss + L1 box distance

Without matching, you cannot define a stable loss — swapping prediction indices should not change the optimum.

Step 5 — Annotation formats in the wild

COCO JSON (research & torchvision)

json

{
  "images": [{"id": 1, "file_name": "img001.jpg", "width": 640, "height": 480}],
  "categories": [{"id": 1, "name": "person"}],
  "annotations": [{
    "id": 10,
    "image_id": 1,
    "category_id": 1,
    "bbox": [100, 50, 120, 130],
    "area": 15600,
    "iscrowd": 0
  }]
}

bbox is xywh top-left format. iscrowd=1 uses different IoU rules for crowded regions.

YOLO txt (one file per image)

text

# class_id cx cy w h   (all normalized 0–1)
0 0.400 0.383 0.300 0.433

Folder layout:

text

dataset/
  images/train/*.jpg
  labels/train/*.txt   # same stem as image
  data.yaml            # class names, paths (Ultralytics)

torchvision detection target dict

python

target = {
    "boxes": torch.tensor([[100., 50., 220., 180.]]),  # xyxy float32
    "labels": torch.tensor([1]),                        # int64, 0 = background reserved
    "image_id": torch.tensor([42]),
    "area": torch.tensor([15600.]),
    "iscrowd": torch.tensor([0]),
}

Critical: class 0 is background in Faster R-CNN — your first real class is often label 1.

Step 6 — When detection vs classification

Choose classification	Choose detection
Single dominant object centered	Multiple objects
Only presence matters	Position matters for action (pick, avoid)
Tiny data, simple deploy	Need counts or tracking (Module 6)

Hybrid pipelines: detect then classify crops (two-stage product design).

Mini exercise (paper and pencil)

Box A (pred): xyxy [10, 10, 50, 50]
Box B (GT): xyxy [30, 30, 70, 70]

Intersection: [30,30] to [50,50] → area $20\times20=400$
Area A $= 40\times40=1600$ , Area B $= 40\times40=1600$
Union $= 1600+1600-400=2800$
IoU $= 400/2800 \approx 0.143$

Would this match as TP at IoU 0.5? No — would be FP unless another pred matches B.

Common beginner mistakes

Mistake	Symptom
xywh vs xyxy confusion	Boxes shifted, huge mAP drop
Normalized vs pixel coords	Boxes clustered in corner
Label 0 for person	Silent training bug (background)
No train/val split for threshold tuning	Overfit score threshold

What's next

Lesson 2 — Detector architectures