Model serving & deployment
Before we begin
Training happens offline. Serving happens on every user request — your app calls a model and returns an answer under time and cost constraints.
Figure
Typical serving stack
What you will learn
- Define model serving vs training vs fine-tuning.
- Choose hosted API vs self-hosted inference.
- Deploy GenAI routes in Next.js with streaming and timeouts.
Before this lesson
Hosted API vs self-hosting
| Hosted (OpenAI, Anthropic, etc.) | Self-hosted (vLLM, Ollama, TGI) | |
|---|---|---|
| Setup | Minutes | Days–weeks (GPU, ops) |
| Cost model | Per token | Fixed GPU + engineering |
| Best for | MVPs, most products | High volume, strict privacy |
| Scaling | Provider handles | You handle replicas |
Most teams start hosted, measure unit economics, then evaluate self-hosting.
Next.js API route pattern
Keep secrets server-side. Never expose API keys in client bundles.
// app/api/chat/route.ts
export async function POST(req: Request) {
const { messages } = await req.json();
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30_000);
try {
const res = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ model: "gpt-4o-mini", messages, stream: true }),
signal: controller.signal,
});
// return streaming Response to client
return new Response(res.body, { headers: { "Content-Type": "text/event-stream" } });
} finally {
clearTimeout(timeout);
}
}Streaming
Autoregressive models generate one token at a time. Streaming sends tokens as they arrive — better perceived latency even when total time is similar.
Client: use fetch + ReadableStream or provider SDK stream helpers.
Timeouts and retries
- Set hard timeouts on every upstream LLM call.
- Retry idempotent failures (429, 503) with exponential backoff — cap attempts.
- Do not blindly retry non-idempotent tool side effects (payments, deletes).
Deployment checklist
- Env vars in host dashboard — not in git
- Health check route (
/api/health) - Edge vs Node runtime — LLM routes usually need Node
- Version pin for model name in config
Checkpoint
You should explain why the LLM call lives in a server route, not the browser, and what streaming improves for users.