Rate limiting & guardrails
Before we begin
One abusive client or runaway agent loop can burn through your LLM budget in minutes. Rate limiting is non-negotiable in production.
Figure
Rate limit gate
What you will learn
- Limit by user, IP, or API key.
- Return proper 429 responses with
Retry-After. - Combine limits with agent iteration caps from Module 8.
Before this lesson
What to limit
| Dimension | Example |
|---|---|
| Requests / minute | 20 chat messages per user |
| Tokens / day | 100k input tokens per API key |
| Agent steps | Max 10 tool calls per session |
| Concurrent streams | 3 open SSE connections |
Pick limits from expected UX and worst-case cost.
Token bucket (concept)
Bucket holds burst capacity. Tokens refill at steady rate. Request consumes one token; empty bucket → reject.
Good for chat bursts without allowing sustained abuse.
Simple Redis counter
typescript
async function rateLimit(userId: string, limit = 30, windowSec = 60) {
const key = `rl:${userId}:${Math.floor(Date.now() / 1000 / windowSec)}`;
const count = await redis.incr(key);
if (count === 1) await redis.expire(key, windowSec);
if (count > limit) {
return { ok: false, retryAfterSec: windowSec };
}
return { ok: true };
}Return 429 with JSON body and Retry-After header.
Guardrails beyond rate limits
- Input length cap — reject prompts over N tokens before LLM call.
- Output max_tokens — bound completion cost.
- Tool allowlist — agents only call approved functions.
- Auth — anonymous tiers get lower limits than signed-in users.
Checkpoint
You should articulate why rate limiting protects cost and provider quotas, not just “security theater.”