Cost optimization & token budgets

Before we begin

GenAI bills are token meters. Small prompt bloat × traffic = surprise invoices. Production engineers measure and trim context deliberately.

Figure

Where tokens go

Retrieved chunks and chat history often dominate input cost; output adds on top.

What you will learn

How to reduce token cost (roadmap quiz topic).
Set per-request and per-user budgets.
Route tasks to right-sized models.

Before this lesson

Input vs output pricing

Providers charge input tokens (prompt + RAG) and output tokens (completion). Output is often more expensive per token.

Rough levers:

Smaller retrieved chunks — top-k=3 not 10.
Summarize history — sliding window or periodic compression.
Shorter system prompt — every word repeats on every call.
Lower max_tokens — stop runaway answers.
Cheaper model for classification/routing; premium model only for final answer.

Model routing

text

User message → small model: "needs_rag? needs_tools?"
  → if FAQ + cache hit → return cached
  → if RAG → retrieve → medium model answer
  → if agent → tool loop with step cap

One gpt-4-class model for everything is rarely optimal.

Prompt caching (provider feature)

When many requests share a long static prefix (system prompt + document bundle), providers may cache prefix KV states — discounted input on repeats.

Requires stable prefix ordering; changing one chunk invalidates cache benefit.

Token budget middleware

Before calling LLM:

typescript

const estimated = estimateTokens(messages);
if (estimated > MAX_INPUT_TOKENS) {
  return Response.json({ error: "Context too large" }, { status: 400 });
}

Log actual usage from API response — refine estimates over time.

Latency bottleneck (quiz topic)

In GenAI apps, latency usually comes from:

Retrieval (embed + vector search)
LLM time to first token (TTFT)
Long autoregressive output
Sequential agent tool calls — each round trip adds seconds

Mitigations: parallel retrieval, streaming UI, smaller models, cache hits, fewer agent steps.

Checkpoint

List three concrete ways to cut token spend without removing RAG entirely.

What's next

Lesson 5 — Monitoring, logging & error handling