← Back to curriculum

Module 10 — Production & scaling

Caching for LLM apps

Redis cache-aside, embedding and completion caches, cache keys, TTLs, and invalidation when indexes change.

~65 min read + exercises

Caching for LLM apps

Before we begin

The same user question — or semantically similar ones — should not trigger full embed + retrieve + LLM work every time. Caching cuts cost and latency.

Figure

Cache-aside flow

Cache-aside: miss → compute → store → respondRequestRedis?Embed/LLMResponsehit skip
Check Redis first; on miss, compute, store with TTL, then respond.

What you will learn

  • Why caching matters for LLM apps (roadmap quiz topic).
  • Cache embeddings, retrieval results, and final answers.
  • Design keys and TTLs that stay correct when data changes.

Before this lesson


What to cache

LayerCache whatSaves
EmbeddingVector for normalized query textEmbedding API $
RetrievalTop-k chunk IDs for query + index versionFAISS / DB work
CompletionFinal answer for query + context hashLargest LLM $

Start with completion cache for FAQ-style traffic; add embedding cache when repeat queries dominate.


Redis cache-aside (Node)

typescript
import { createClient } from "redis";
 
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
 
async function getCachedAnswer(key: string) {
  return redis.get(key);
}
 
async function setCachedAnswer(key: string, value: string, ttlSec = 3600) {
  await redis.set(key, value, { EX: ttlSec });
}

Cache keys

Include everything that affects the answer:

text
rag:v3:gpt-4o-mini:sha256(normalized_question + index_version)

Invalidate when:

  • Blog/docs re-indexed → bump index_version
  • System prompt changes → include prompt hash in key
  • User-specific prefs matter → include userId

Never cache personal medical/legal answers with a shared key unless policy allows.


Semantic cache (optional)

Exact match misses paraphrases. Semantic cache stores embedding of question; on new query, if cosine similarity > threshold to a cached entry, return stored answer.

Trade-off: risk of wrong hit — tune threshold and TTL carefully.


Checkpoint

Explain why caching is important in LLM apps: duplicate work is expensive; cache hits improve p95 latency and monthly bill.


What's next

Lesson 3 — Rate limiting & guardrails