On-device detection & deployment

Before we begin

A detector with 45 mAP on a workstation is only useful on a phone if it runs at acceptable latency, stable frame rate, and fits thermal limits. This lesson connects model choices to milliseconds, megabytes, and ship/no-ship product calls.

What you will learn

Budget latency, memory, thermal for detection pipelines.
Compare FP32, FP16, INT8 for detector backbones and heads.
Export torchvision / YOLO → ONNX → TensorRT / TFLite.
Place NMS on CPU vs GPU vs in-graph.
Define a minimal production checklist before launch.

Before this lesson

Lesson 4 — IoU, NMS & mAP

Where time goes (typical pipeline)

Stage	ms (illustrative mobile)	Notes
Preprocess (resize, norm)	2–8	Often CPU
Backbone + heads	15–80	GPU/NPU
Decode boxes	1–5
NMS	2–20	CPU if many candidates
Draw / tracking	1–5

Rule: profile end-to-end — optimizing backbone while NMS dominates is wasted effort.

Resolution vs accuracy vs cost

Input	Relative compute	Small object AP
320×320	1×	Lower
640×640	~4×	Baseline
1280×1280	~16×	Higher

Doubling side length ≈ 4× activation memory in early conv layers.

Product pattern: preview stream at 320, capture high-res on demand for detail.

Figure

Deployment constraints

Latency, memory, thermal, precision — pick two aggressively.

Model selection for edge

Model	Params (order)	Typical role
YOLOv8n	~3M	Phone real-time
MobileNet-SSD	~5M	Classic mobile
Faster R-CNN R50-FPN	~40M+	Server / batch
YOLOv8x	~68M	Offline accuracy

Knowledge distillation: train small student to mimic large teacher logits + box outputs.

Quantization

Post-training quantization (PTQ)

Run 100–500 calibration images through model.
Record activation ranges per layer.
Map to INT8 scale/zero-point.

Risk: small objects and fine boxes degrade first — always re-run mAP@0.5 on val.

Quantization-aware training (QAT)

Fake-quant nodes during fine-tune — recovers 1–3 mAP points vs PTQ.

Precision ladder

Dtype	Speed	Accuracy
FP32	Baseline	Best
FP16	~1.5–2× GPU	Usually fine
INT8	2–4×	Validate per class

Export paths

PyTorch → ONNX

python

model.eval()
dummy = torch.randn(1, 3, 640, 640)
torch.onnx.export(
    model, dummy, "detector.onnx",
    input_names=["images"],
    output_names=["boxes", "labels", "scores"],
    opset_version=17,
    dynamic_axes={"images": {0: "batch"}},
)

Verify with ONNX Runtime — compare max diff to PyTorch on same tensor.

YOLO → ONNX / TFLite (Ultralytics)

bash

yolo export model=best.pt format=onnx simplify=True
yolo export model=best.pt format=tflite int8=True

NMS placement

Option	Pros	Cons
In model graph	Single runtime call	Export complexity
App code (CPU)	Flexible thresholds	Extra latency
GPU NMS plugin	Fast for many boxes	Platform-specific

Serving pattern (server)

Client uploads image or frame.
Server preprocesses (same as training!).
ONNX Runtime / TensorRT inference.
NMS + filter by score_thresh.
Return JSON: [{bbox, class, score}, ...].

json

{
  "detections": [
    {"label": "person", "score": 0.91, "bbox": [120, 40, 220, 200]}
  ],
  "latency_ms": 28
}

Module 7 project extends classifier serving to batched inference — same discipline applies.

Monitoring in production

Log per request:

Latency p50/p95
Detection count distribution
Max score histogram
Input resolution

Alert on distribution shift (suddenly zero detections on busy camera) — see Module 7 drift lesson.

Ship checklist

Val mAP documented with IoU threshold
score_thresh chosen on val, not test
Preprocess documented (RGB order, mean/std, resize)
Worst-case latency measured (thermal throttling after 5 min)
Failure gallery reviewed by team

What's next

Module 4 quiz — then detector project.