Edge deployment & optimization

Before we begin

Mobile robots, phones, and embedded cameras need low latency and low power — often INT8 models on NPUs.

Learning objectives

Explain post-training quantization (PTQ) vs quantization-aware training (QAT).
Name TensorRT, TFLite, and CoreML roles.
Describe pruning and distillation at high level.
Profile latency and memory on target hardware.

INT8 quantization

Map float32 weights/activations to 8-bit integers with scale/zero-point. ~4× smaller, faster on supported hardware.

PTQ: calibrate on representative batches — fast, may lose 1–2% accuracy.
QAT: simulate quantization during training — better accuracy at INT8.

ONNX → TensorRT

Build engine for specific GPU; fuse layers; pick FP16/INT8 precision. Profile with trtexec or NVIDIA tools.

Mobile

TFLite with GPU/NNAPI delegates. Watch operator support — some ops fall back to slow CPU paths.

Pruning & distillation

Pruning: remove low-magnitude weights or whole channels — retrain briefly.
Distillation: small student mimics large teacher logits — compact models for edge.

Trade-off table

Technique	Latency	Accuracy risk
FP16	Moderate gain	Usually low
INT8 PTQ	Large gain	Medium
Smaller backbone	Large gain	Architecture-dependent
Input resolution ↓	Large gain	Can hurt small objects

What's next

Lesson 3 — Monitoring, drift & retraining