Cost optimization & token budgets
Before we begin
GenAI bills are token meters. Small prompt bloat × traffic = surprise invoices. Production engineers measure and trim context deliberately.
Figure
Where tokens go
What you will learn
- How to reduce token cost (roadmap quiz topic).
- Set per-request and per-user budgets.
- Route tasks to right-sized models.
Before this lesson
Input vs output pricing
Providers charge input tokens (prompt + RAG) and output tokens (completion). Output is often more expensive per token.
Rough levers:
- Smaller retrieved chunks — top-k=3 not 10.
- Summarize history — sliding window or periodic compression.
- Shorter system prompt — every word repeats on every call.
- Lower max_tokens — stop runaway answers.
- Cheaper model for classification/routing; premium model only for final answer.
Model routing
User message → small model: "needs_rag? needs_tools?"
→ if FAQ + cache hit → return cached
→ if RAG → retrieve → medium model answer
→ if agent → tool loop with step capOne gpt-4-class model for everything is rarely optimal.
Prompt caching (provider feature)
When many requests share a long static prefix (system prompt + document bundle), providers may cache prefix KV states — discounted input on repeats.
Requires stable prefix ordering; changing one chunk invalidates cache benefit.
Token budget middleware
Before calling LLM:
const estimated = estimateTokens(messages);
if (estimated > MAX_INPUT_TOKENS) {
return Response.json({ error: "Context too large" }, { status: 400 });
}Log actual usage from API response — refine estimates over time.
Latency bottleneck (quiz topic)
In GenAI apps, latency usually comes from:
- Retrieval (embed + vector search)
- LLM time to first token (TTFT)
- Long autoregressive output
- Sequential agent tool calls — each round trip adds seconds
Mitigations: parallel retrieval, streaming UI, smaller models, cache hits, fewer agent steps.
Checkpoint
List three concrete ways to cut token spend without removing RAG entirely.