Model serving for vision
Before we begin
Serving turns a trained weights file into a service other apps call. Vision adds heavy preprocessing and large payloads.
Learning objectives
- Compare REST vs gRPC for image inference.
- Design preprocessing parity between train and serve.
- Use batching and warm-up for stable latency.
- Outline ONNX Runtime deployment path.
Preprocessing contract
Document and test:
- Resize dimensions and crop policy
- RGB vs BGR
- Normalization mean/std
- NCHW vs NHWC tensor layout
A one-pixel resize mismatch can crater detection mAP in production.
REST API sketch (FastAPI)
python
from fastapi import FastAPI, File, UploadFile
import numpy as np
import onnxruntime as ort
from PIL import Image
import io
app = FastAPI()
sess = ort.InferenceSession("classifier.onnx", providers=["CPUExecutionProvider"])
def preprocess(img: Image.Image):
img = img.convert("RGB").resize((224, 224))
x = np.array(img).astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
x = (x - mean) / std
return x.transpose(2, 0, 1)[None, ...]
@app.post("/predict")
async def predict(file: UploadFile = File(...)):
img = Image.open(io.BytesIO(await file.read()))
logits = sess.run(None, {"input": preprocess(img)})[0]
return {"class_id": int(logits.argmax())}Batching & warm-up
- Dynamic batching: queue requests for a few ms, run batch on GPU.
- Warm-up: run dummy inference on deploy to avoid cold-start p99 spikes.
Health checks
GET /health returns model version and readiness — required for Kubernetes/load balancers.