Rate limiters protect your API from abuse, runaway clients, and accidental DDoS from a buggy cron job. Every system design interview eventually asks: how do you throttle fairly at scale?

Requirements (say these first)

Functional: cap requests per user/IP/API key over a time window; return 429 Too Many Requests when exceeded
Non-functional: low latency on the hot path; distributed (many API gateway instances); configurable limits per tier

High-level architecture

Algorithm trade-offs

Algorithm	Pros	Cons
Fixed window	Simple, fast	Burst at window boundary
Sliding window log	Accurate	Memory per request
Token bucket	Smooth bursts	Slightly more state
Sliding window counter	Good balance	Approximation at edges

Token bucket (common interview answer)

python
class TokenBucket:
    def __init__(self, capacity: int, refill_per_sec: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_per_sec = refill_per_sec
        self.last_refill = time.monotonic()

    def allow(self) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_per_sec,
        )
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Distributed gotcha

Each gateway instance can't keep its own counter — a user could hit 1000 req/s by fanning across 10 nodes. Centralize state in Redis (or use a gossip/coordinated counter with careful consistency trade-offs).

What to say in the room

Clarify who is limited (user, IP, API key) and what counts (read vs write)
Pick an algorithm and name its weakness
Put Redis on the diagram — interviewers want to see shared state
Mention Retry-After header and idempotency for write endpoints

The diagram + one algorithm + Redis is usually enough for a 35-minute system design slot.