← Back to curriculum

Module 10 — Production & scaling

Cost optimization & token budgets

Input vs output pricing, context trimming, model routing, prompt caching, and setting per-request budgets.

~70 min read + exercises

Cost optimization & token budgets

Before we begin

GenAI bills are token meters. Small prompt bloat × traffic = surprise invoices. Production engineers measure and trim context deliberately.

Figure

Where tokens go

Typical bill: input tokens (prompt + RAG) + output tokensSystem15%Retrieved45%History20%Output20%
Retrieved chunks and chat history often dominate input cost; output adds on top.

What you will learn

  • How to reduce token cost (roadmap quiz topic).
  • Set per-request and per-user budgets.
  • Route tasks to right-sized models.

Before this lesson


Input vs output pricing

Providers charge input tokens (prompt + RAG) and output tokens (completion). Output is often more expensive per token.

Rough levers:

  1. Smaller retrieved chunks — top-k=3 not 10.
  2. Summarize history — sliding window or periodic compression.
  3. Shorter system prompt — every word repeats on every call.
  4. Lower max_tokens — stop runaway answers.
  5. Cheaper model for classification/routing; premium model only for final answer.

Model routing

text
User message → small model: "needs_rag? needs_tools?"
  → if FAQ + cache hit → return cached
  → if RAG → retrieve → medium model answer
  → if agent → tool loop with step cap

One gpt-4-class model for everything is rarely optimal.


Prompt caching (provider feature)

When many requests share a long static prefix (system prompt + document bundle), providers may cache prefix KV states — discounted input on repeats.

Requires stable prefix ordering; changing one chunk invalidates cache benefit.


Token budget middleware

Before calling LLM:

typescript
const estimated = estimateTokens(messages);
if (estimated > MAX_INPUT_TOKENS) {
  return Response.json({ error: "Context too large" }, { status: 400 });
}

Log actual usage from API response — refine estimates over time.


Latency bottleneck (quiz topic)

In GenAI apps, latency usually comes from:

  1. Retrieval (embed + vector search)
  2. LLM time to first token (TTFT)
  3. Long autoregressive output
  4. Sequential agent tool calls — each round trip adds seconds

Mitigations: parallel retrieval, streaming UI, smaller models, cache hits, fewer agent steps.


Checkpoint

List three concrete ways to cut token spend without removing RAG entirely.


What's next

Lesson 5 — Monitoring, logging & error handling