← Back to curriculum

Module 4 — Object detection

From classification to detection

Variable-size outputs, box formats, anchors, assignment, set matching, and annotation formats with worked numeric examples.

~95 min read + exercises

From classification to detection

Before we begin

A classifier collapses the whole image into one decision. A detector must find all objects and localize each one. That sounds like a small step — it is not. The output is a set (variable size), training needs matching between predictions and ground truth, and evaluation uses IoU instead of simple accuracy.

This lesson builds the vocabulary and math you need before architectures and training.


What you will learn

  • Contrast fixed-length classifier outputs vs variable detection sets.
  • Read, convert, and validate bounding box formats.
  • Explain anchors, positive/negative assignment, and box deltas.
  • Parse COCO and YOLO label files.
  • State why permutation-invariant matching is required in training.

Before this lesson


Step 1 — Two questions, two output types

Classification:

fθ(x)=pRK,kpk=1f_\theta(\mathbf{x}) = \mathbf{p} \in \mathbb{R}^K, \quad \sum_k p_k = 1

One vector of length KK — same shape for every image.

Detection:

fθ(x)={(bi,ci,si)}i=1N(x)f_\theta(\mathbf{x}) = \big\{(b_i, c_i, s_i)\big\}_{i=1}^{N(\mathbf{x})}
  • biR4b_i \in \mathbb{R}^4 — box parameters
  • ci{1,,K}c_i \in \{1,\ldots,K\} — class (background handled separately in many APIs)
  • si[0,1]s_i \in [0,1] — confidence
  • N(x)N(\mathbf{x})depends on the image

Figure

Fixed vs variable outputs

Same image — different output shapesClassification: fixed vector. Detection: variable set of boxes.Classifierp(cat)=0.02p(person)=0.91p(car)=0.07K numbers alwaysDetectorperson @ (120,40,180,200) score=0.94person @ (310,55,360,190) score=0.88car @ (400,130,510,200) score=0.81N varies per image
Detection must handle zero objects, one object, or dozens.

Product example: A shelf camera needs one box per product facingNN changes every frame.

Checkpoint: Why is padding "always 100 boxes" a bad training target?

Answer sketch: Most slots would be empty sentinels; the model wastes capacity learning padding semantics. Better: set prediction, dynamic NMS output, or objectness scores with thresholding.


Step 2 — Bounding box formats (worked example)

Image size: width=400, height=300 (pixels).

Object: top-left corner (100,50)(100, 50), bottom-right (220,180)(220, 180).

FormatValuesNotes
xyxy[100, 50, 220, 180]PyTorch / torchvision default
xywh (COCO file)[100, 50, 120, 130]w=220100w=220-100, h=18050h=180-50
cxcywh[160, 115, 120, 130]center x=(100+220)/2x=(100+220)/2
normalized cxcywh[0.4, 0.383, 0.3, 0.433]divide by image W,H
python
import torch
from torchvision.ops import box_convert
 
boxes_xyxy = torch.tensor([[100., 50., 220., 180.]])
boxes_cxcywh = box_convert(boxes_xyxy, in_fmt="xyxy", out_fmt="cxcywh")
print(boxes_cxcywh)  # tensor([[160., 115., 120., 130.]])

Validation rules (catch bugs early)

python
def validate_xyxy(boxes, img_w, img_h):
    assert (boxes[:, 2] > boxes[:, 0]).all(), "x2 must exceed x1"
    assert (boxes[:, 3] > boxes[:, 1]).all(), "y2 must exceed y1"
    assert (boxes[:, 0] >= 0).all() and (boxes[:, 2] <= img_w).all()
    assert (boxes[:, 1] >= 0).all() and (boxes[:, 3] <= img_h).all()

Invalid boxes (zero area, inverted coords) break IoU and training.


Step 3 — Anchors and assignment

Early and many modern detectors use anchor boxes — predefined shapes at each feature map location.

At cell (i,j)(i,j) with stride 16, you might have anchors:

  • 32×3232\times32, 64×6464\times64, 128×128128\times128 pixels
  • aspect ratios 1:11:1, 1:21:2, 2:12:1

Ground-truth box GG is assigned to anchor AA by IoU:

IoU with GGTypical label
≥ 0.5Positive — predict object
< 0.4Negative — background
0.4 – 0.5Ignore — no loss

Network targets for a positive anchor:

  • Classification: class id of GG
  • Regression: deltas (Δx,Δy,Δw,Δh)(\Delta x, \Delta y, \Delta w, \Delta h) mapping anchor → GG

Figure

IoU for assignment

IoU = area(A ∩ B) ÷ area(A ∪ B)Used to match predicted boxes to ground truth during training and evaluation.A (prediction)B (ground truth)A ∩ B
Same IoU definition used in training and evaluation.

Anchor-free (FCOS, CenterNet): predict distances from cell center to four sides of box — removes hand-tuned anchor grids; still needs center-ness / quality estimates.


Step 4 — Set matching (why order does not matter)

Suppose the model outputs 100 boxes but the image has 3 objects. Which output is "box 1"?

Predictions are an unordered set. Training solves bipartite matching:

  • Faster R-CNN / YOLO: match by IoU to anchors or grid cells
  • DETR: Hungarian algorithm on cost = class loss + L1 box distance

Without matching, you cannot define a stable loss — swapping prediction indices should not change the optimum.


Step 5 — Annotation formats in the wild

COCO JSON (research & torchvision)

json
{
  "images": [{"id": 1, "file_name": "img001.jpg", "width": 640, "height": 480}],
  "categories": [{"id": 1, "name": "person"}],
  "annotations": [{
    "id": 10,
    "image_id": 1,
    "category_id": 1,
    "bbox": [100, 50, 120, 130],
    "area": 15600,
    "iscrowd": 0
  }]
}

bbox is xywh top-left format. iscrowd=1 uses different IoU rules for crowded regions.

YOLO txt (one file per image)

text
# class_id cx cy w h   (all normalized 0–1)
0 0.400 0.383 0.300 0.433

Folder layout:

text
dataset/
  images/train/*.jpg
  labels/train/*.txt   # same stem as image
  data.yaml            # class names, paths (Ultralytics)

torchvision detection target dict

python
target = {
    "boxes": torch.tensor([[100., 50., 220., 180.]]),  # xyxy float32
    "labels": torch.tensor([1]),                        # int64, 0 = background reserved
    "image_id": torch.tensor([42]),
    "area": torch.tensor([15600.]),
    "iscrowd": torch.tensor([0]),
}

Critical: class 0 is background in Faster R-CNN — your first real class is often label 1.


Step 6 — When detection vs classification

Choose classificationChoose detection
Single dominant object centeredMultiple objects
Only presence mattersPosition matters for action (pick, avoid)
Tiny data, simple deployNeed counts or tracking (Module 6)

Hybrid pipelines: detect then classify crops (two-stage product design).


Mini exercise (paper and pencil)

Box A (pred): xyxy [10, 10, 50, 50]
Box B (GT): xyxy [30, 30, 70, 70]

Intersection: [30,30] to [50,50] → area 20×20=40020\times20=400
Area A =40×40=1600= 40\times40=1600, Area B =40×40=1600= 40\times40=1600
Union =1600+1600400=2800= 1600+1600-400=2800
IoU =400/28000.143= 400/2800 \approx 0.143

Would this match as TP at IoU 0.5? No — would be FP unless another pred matches B.


Common beginner mistakes

MistakeSymptom
xywh vs xyxy confusionBoxes shifted, huge mAP drop
Normalized vs pixel coordsBoxes clustered in corner
Label 0 for personSilent training bug (background)
No train/val split for threshold tuningOverfit score threshold

What's next

Lesson 2 — Detector architectures