Caching for LLM apps

Before we begin

The same user question — or semantically similar ones — should not trigger full embed + retrieve + LLM work every time. Caching cuts cost and latency.

Figure

Cache-aside flow

Check Redis first; on miss, compute, store with TTL, then respond.

What you will learn

Why caching matters for LLM apps (roadmap quiz topic).
Cache embeddings, retrieval results, and final answers.
Design keys and TTLs that stay correct when data changes.

Before this lesson

What to cache

Layer	Cache what	Saves
Embedding	Vector for normalized query text	Embedding API $
Retrieval	Top-k chunk IDs for query + index version	FAISS / DB work
Completion	Final answer for query + context hash	Largest LLM $

Start with completion cache for FAQ-style traffic; add embedding cache when repeat queries dominate.

Redis cache-aside (Node)

typescript

import { createClient } from "redis";
 
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
 
async function getCachedAnswer(key: string) {
  return redis.get(key);
}
 
async function setCachedAnswer(key: string, value: string, ttlSec = 3600) {
  await redis.set(key, value, { EX: ttlSec });
}

Cache keys

Include everything that affects the answer:

text

rag:v3:gpt-4o-mini:sha256(normalized_question + index_version)

Invalidate when:

Blog/docs re-indexed → bump index_version
System prompt changes → include prompt hash in key
User-specific prefs matter → include userId

Never cache personal medical/legal answers with a shared key unless policy allows.

Semantic cache (optional)

Exact match misses paraphrases. Semantic cache stores embedding of question; on new query, if cosine similarity > threshold to a cached entry, return stored answer.

Trade-off: risk of wrong hit — tune threshold and TTL carefully.

Checkpoint

Explain why caching is important in LLM apps: duplicate work is expensive; cache hits improve p95 latency and monthly bill.

What's next

Lesson 3 — Rate limiting & guardrails