Rate limiting & guardrails

Before we begin

One abusive client or runaway agent loop can burn through your LLM budget in minutes. Rate limiting is non-negotiable in production.

Figure

Rate limit gate

Every request passes through limits before hitting the paid LLM API.

What you will learn

Limit by user, IP, or API key.
Return proper 429 responses with Retry-After.
Combine limits with agent iteration caps from Module 8.

Before this lesson

What to limit

Dimension	Example
Requests / minute	20 chat messages per user
Tokens / day	100k input tokens per API key
Agent steps	Max 10 tool calls per session
Concurrent streams	3 open SSE connections

Pick limits from expected UX and worst-case cost.

Token bucket (concept)

Bucket holds burst capacity. Tokens refill at steady rate. Request consumes one token; empty bucket → reject.

Good for chat bursts without allowing sustained abuse.

Simple Redis counter

typescript

async function rateLimit(userId: string, limit = 30, windowSec = 60) {
  const key = `rl:${userId}:${Math.floor(Date.now() / 1000 / windowSec)}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, windowSec);
  if (count > limit) {
    return { ok: false, retryAfterSec: windowSec };
  }
  return { ok: true };
}

Return 429 with JSON body and Retry-After header.

Guardrails beyond rate limits

Input length cap — reject prompts over N tokens before LLM call.
Output max_tokens — bound completion cost.
Tool allowlist — agents only call approved functions.
Auth — anonymous tiers get lower limits than signed-in users.

Checkpoint

You should articulate why rate limiting protects cost and provider quotas, not just “security theater.”

What's next

Lesson 4 — Cost optimization & token budgets