Caching for LLM apps
Before we begin
The same user question — or semantically similar ones — should not trigger full embed + retrieve + LLM work every time. Caching cuts cost and latency.
Figure
Cache-aside flow
What you will learn
- Why caching matters for LLM apps (roadmap quiz topic).
- Cache embeddings, retrieval results, and final answers.
- Design keys and TTLs that stay correct when data changes.
Before this lesson
What to cache
| Layer | Cache what | Saves |
|---|---|---|
| Embedding | Vector for normalized query text | Embedding API $ |
| Retrieval | Top-k chunk IDs for query + index version | FAISS / DB work |
| Completion | Final answer for query + context hash | Largest LLM $ |
Start with completion cache for FAQ-style traffic; add embedding cache when repeat queries dominate.
Redis cache-aside (Node)
import { createClient } from "redis";
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
async function getCachedAnswer(key: string) {
return redis.get(key);
}
async function setCachedAnswer(key: string, value: string, ttlSec = 3600) {
await redis.set(key, value, { EX: ttlSec });
}Cache keys
Include everything that affects the answer:
rag:v3:gpt-4o-mini:sha256(normalized_question + index_version)Invalidate when:
- Blog/docs re-indexed → bump
index_version - System prompt changes → include prompt hash in key
- User-specific prefs matter → include
userId
Never cache personal medical/legal answers with a shared key unless policy allows.
Semantic cache (optional)
Exact match misses paraphrases. Semantic cache stores embedding of question; on new query, if cosine similarity > threshold to a cached entry, return stored answer.
Trade-off: risk of wrong hit — tune threshold and TTL carefully.
Checkpoint
Explain why caching is important in LLM apps: duplicate work is expensive; cache hits improve p95 latency and monthly bill.