The Question
Design

Scalable Distributed Rate Limiting

Design a distributed system to enforce API rate limits across a global fleet of microservices. The system must support multi-tenant configurations, handle hundreds of thousands of requests per second with sub-millisecond overhead, and ensure high availability even during partial infrastructure failures.
Redis Cluster
Lua Script
Sliding Window Counter
API Gateway Middleware
Questions & Insights

Clarifying Questions

Scale and Traffic: What is the expected Peak Requests Per Second (RPS) and the number of unique entities (users/IPs) to track?
Assumption: 100,000 peak RPS with 10 million daily active users.
Precision Requirements: Is a small margin of error acceptable (e.g., ±5%), or must the limit be strictly enforced?
Assumption: High precision is required for billing/SLA purposes; a Sliding Window Counter algorithm is preferred.
Granularity: Are limits applied per User ID, API Key, IP address, or a combination?
Assumption: Limits are primarily per API Key/User ID, with fallback to IP for anonymous traffic.
Latency Budget: What is the maximum acceptable overhead added to each API request?
Assumption: The rate-limiting check must complete in <5ms to minimize impact on the end-user experience.
Availability vs. Consistency: If the rate-limiting service is down, should we "fail open" (allow traffic) or "fail closed" (block traffic)?
Assumption: Fail open. Availability of the API is more critical than 100% strict limit enforcement during a partial system outage.

Thinking Process

Core Bottleneck: The primary challenge is performing atomic "read-modify-write" operations on counters at high frequency without introducing race conditions or significant latency.
Strategy Steps:
Algorithm Selection: Use a Sliding Window Counter implemented via Redis Lua scripts to ensure atomicity and memory efficiency.
Distributed State: Centralize counters in a high-performance, in-memory store (Redis) to support multiple API Gateway instances.
Local Optimization: Implement a two-tier check where a local cache (in-memory) handles very high-frequency "hot" keys to reduce Redis pressure.
Resiliency: Implement a "Fail Open" mechanism using a circuit breaker to bypass the rate limiter if it becomes a bottleneck.

Bonus Points

Redis Lua Atomicity: Use Lua scripts to combine "increment," "expire," and "threshold check" into a single RTT (Round Trip Time) to avoid race conditions between concurrent requests.
Tiered/Hierarchical Limiting: Support multiple buckets simultaneously (e.g., 10 req/sec AND 5,000 req/hour) using a single Redis call.
Dynamic Configuration: Use a Push-based model (e.g., via Etcd or ConfigMap) to update rate limit thresholds across the fleet in real-time without restarts.
Global Footprint: For multi-region deployments, use "Local Rate Limiting" (per-region) with asynchronous "Global Sync" to avoid cross-region latency penalties.
Design Breakdown

Functional Requirements

Enforce Limits: Block requests that exceed a defined threshold (e.g., 100 requests per minute).
Client Feedback: Return standard HTTP 429 "Too Many Requests" headers (X-Ratelimit-Remaining, X-Ratelimit-Reset).
Identifier Flexibility: Support limiting by User ID, IP, or specific API Resource paths.

Non-Functional Requirements

Low Latency: Limit check overhead must be <5ms.
High Scalability: Must handle horizontal scaling of the API fleet.
High Availability: The rate limiter must not be a single point of failure (SPOF); if it fails, the API continues to function.
Memory Efficiency: Storage of counters must be optimized to handle millions of users.

Estimation

Traffic: 100k RPS.
Storage:
Key: rl:{user_id}:{minute_timestamp} (~40 bytes).
Value: 8-byte integer.
Total per user/min: ~50 bytes.
1 Million active users/min: 1M 50 bytes = 50 MB**.
Network: 100k RPS * 1KB (metadata/headers) ≈ 100MB/s bandwidth to Redis cluster.
Throughput: A single Redis node can handle ~100k operations/sec; a small 3-node Redis Cluster is sufficient for the MVP.

Blueprint

Concise Summary: A distributed middleware-based rate limiter using an API Gateway to intercept requests and a Redis Cluster to track counts using a Sliding Window algorithm.
Major Components:
API Gateway/Middleware: Intercepts incoming requests and communicates with the rate limiter logic.
Rate Limiter Service: A stateless service (or library) that executes the logic to allow/deny requests.
Redis Cluster: An in-memory store providing atomic increments for window-based counters.
Configuration Store (DB): Stores the specific limit policies (e.g., "Tier 1: 500 req/min") for different users.
Simplicity Audit: This is the simplest "Staff-level" design because it avoids complex stream processing or heavy persistence, relying on Redis's native speed and Lua for atomicity.
Architecture Decision Rationale:
Why this architecture?: Redis provides the sub-millisecond latency required, and Lua scripts solve the concurrency problem without needing distributed locks.
Functional Satisfaction: Directly supports 429 responses and header injection.
Non-functional Satisfaction: Scalable by adding API Gateway nodes or Redis shards; highly available via Redis Sentinel/Cluster.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: The Rate Limiter Logic is integrated as a Middleware Plugin within the API Gateway (e.g., Kong, Nginx, or a Go-based sidecar). This reduces one network hop compared to a standalone gRPC service.
API Spec:
GET /check: Internal endpoint (if using a service) taking user_id, api_key, and cost.
Headers:
X-RateLimit-Limit: Maximum requests permitted.
X-RateLimit-Remaining: Remaining requests in the current window.
X-RateLimit-Reset: Time until the limit resets.

Storage

Data Model:
Policy DB: Table rate_limits { client_id (PK), tier, rps_limit, rpm_limit }.
Database Logic: Policies are cached locally in the API Gateway memory with a 5-minute TTL to avoid hitting the Policy DB on every request.

Cache

Data Structure: Redis Hash or String keys.
Sliding Window Counter (Lua):
Logic: current_minute = floor(now/60), previous_minute = current_minute - 1.
Key: user:{id}:min:{timestamp}.
Formula: count = current_min_count + (prev_min_count * (1 - (seconds_into_minute / 60))).
TTL: Keys are set with a 2-minute expiration to automatically clean up old data.
Wrap Up

Advanced Topics

Monitoring:
Metrics: rate_limit_exceeded_total (429 count), redis_latency, policy_cache_hit_rate.
Tools: Prometheus for metrics collection and Grafana for alerting on spikes in blocked requests.
Trade-offs:
Consistency vs. Latency: We choose "Eventual Consistency" for global limits across regions (local limits are strict) to favor low latency.
Accuracy: The Sliding Window Counter formula is an approximation but much more accurate than Fixed Window and uses 90% less memory than Sliding Window Logs.
Bottlenecks: Redis CPU usage can spike with Lua scripts. Optimization: Use Redis Cluster to shard by user_id to distribute the load.
Failure Handling:
Circuit Breaker: If Redis latency > 50ms, the middleware enters "fail-open" mode, allowing all traffic and logging the failure.
Alternatives:
Token Bucket: Better for handling bursts but slightly more complex to implement precisely in a distributed manner compared to counters.