Model serving & deployment

Before we begin

Training happens offline. Serving happens on every user request — your app calls a model and returns an answer under time and cost constraints.

Figure

Typical serving stack

UI → API route → orchestration (RAG/agent) → model provider or GPU server.

What you will learn

Define model serving vs training vs fine-tuning.
Choose hosted API vs self-hosted inference.
Deploy GenAI routes in Next.js with streaming and timeouts.

Before this lesson

Hosted API vs self-hosting

	Hosted (OpenAI, Anthropic, etc.)	Self-hosted (vLLM, Ollama, TGI)
Setup	Minutes	Days–weeks (GPU, ops)
Cost model	Per token	Fixed GPU + engineering
Best for	MVPs, most products	High volume, strict privacy
Scaling	Provider handles	You handle replicas

Most teams start hosted, measure unit economics, then evaluate self-hosting.

Next.js API route pattern

Keep secrets server-side. Never expose API keys in client bundles.

typescript

// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json();
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 30_000);
 
  try {
    const res = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ model: "gpt-4o-mini", messages, stream: true }),
      signal: controller.signal,
    });
    // return streaming Response to client
    return new Response(res.body, { headers: { "Content-Type": "text/event-stream" } });
  } finally {
    clearTimeout(timeout);
  }
}

Streaming

Autoregressive models generate one token at a time. Streaming sends tokens as they arrive — better perceived latency even when total time is similar.

Client: use fetch + ReadableStream or provider SDK stream helpers.

Timeouts and retries

Set hard timeouts on every upstream LLM call.
Retry idempotent failures (429, 503) with exponential backoff — cap attempts.
Do not blindly retry non-idempotent tool side effects (payments, deletes).

Deployment checklist

Env vars in host dashboard — not in git
Health check route (/api/health)
Edge vs Node runtime — LLM routes usually need Node
Version pin for model name in config

Checkpoint

You should explain why the LLM call lives in a server route, not the browser, and what streaming improves for users.

What's next

Lesson 2 — Caching for LLM apps