Edge deployment & optimization
Before we begin
Mobile robots, phones, and embedded cameras need low latency and low power — often INT8 models on NPUs.
Learning objectives
- Explain post-training quantization (PTQ) vs quantization-aware training (QAT).
- Name TensorRT, TFLite, and CoreML roles.
- Describe pruning and distillation at high level.
- Profile latency and memory on target hardware.
INT8 quantization
Map float32 weights/activations to 8-bit integers with scale/zero-point. ~4× smaller, faster on supported hardware.
PTQ: calibrate on representative batches — fast, may lose 1–2% accuracy.
QAT: simulate quantization during training — better accuracy at INT8.
ONNX → TensorRT
Build engine for specific GPU; fuse layers; pick FP16/INT8 precision. Profile with trtexec or NVIDIA tools.
Mobile
TFLite with GPU/NNAPI delegates. Watch operator support — some ops fall back to slow CPU paths.
Pruning & distillation
Pruning: remove low-magnitude weights or whole channels — retrain briefly.
Distillation: small student mimics large teacher logits — compact models for edge.
Trade-off table
| Technique | Latency | Accuracy risk |
|---|---|---|
| FP16 | Moderate gain | Usually low |
| INT8 PTQ | Large gain | Medium |
| Smaller backbone | Large gain | Architecture-dependent |
| Input resolution ↓ | Large gain | Can hurt small objects |