The Question
Design

Scalable Distributed Rate Limiter

Design a distributed rate limiting system capable of handling 1 million requests per second. The system must support various limiting strategies (e.g., sliding window, token bucket) and allow for per-user or per-API-key quotas. Key constraints include sub-5ms latency overhead, high availability with fail-open semantics, and the ability to scale horizontally as traffic grows. Explain your choice of storage, consistency models, and how you would handle 'hot keys' for extremely popular users.
Redis
Lua
gRPC
API Gateway
Kubernetes
NoSQL
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected peak traffic? (Assumption: 1 million requests per second (RPS) globally).
Limiting Granularity: Are we limiting by User ID, IP address, or API Key? (Assumption: Multi-tenant support, limiting by API Key).
Latency Budget: What is the maximum acceptable overhead added to the request? (Assumption: < 5ms at p99).
Accuracy Requirement: Is "soft" limiting acceptable, or is "hard" precision required? (Assumption: High accuracy required, using a Sliding Window Counter algorithm).
Failure Mode: Should the system fail-open or fail-closed? (Assumption: Fail-open to prioritize availability over strict quota enforcement).

Thinking Process

Core Bottleneck: The primary challenge is maintaining a globally synchronized count without introducing massive network latency or race conditions.
Progressive Logic:
How do we store counts for millions of users efficiently in memory? (Redis).
How do we ensure atomic updates to prevent "double-counting" or race conditions? (Redis Lua scripts).
How do we handle high-frequency writes without overwhelming the database? (Sliding Window Counter instead of Sliding Window Log).
How do we ensure the system scales linearly as traffic grows? (Redis Cluster with sharding based on the rate-limit key).

Bonus Points

Local Batching (Hybrid Approach): Mentioning a "Local + Global" strategy where nodes maintain a local counter and sync with Redis every N milliseconds or X requests to reduce network overhead.
Clock Drift Resilience: Discussing how to handle time-sensitive windowing in a distributed environment using Redis server time rather than application server time.
Shadow Mode: Implementing a "dry-run" mode to test new rate-limit rules against production traffic without actually blocking users.
Dynamic Policy Engine: Decoupling the rules from the code using a configuration service (e.g., Etcd) to update limits in real-time without restarts.
Design Breakdown

Functional Requirements

Core Use Cases:
Evaluate incoming requests against defined rules (e.g., 100 requests/minute).
Return 429 Too Many Requests when limits are exceeded.
Support multiple rules per key (e.g., 5 per second AND 1000 per hour).
Scope Control:
In-Scope: Distributed counting logic, rule evaluation, and response header injection (X-RateLimit-Remaining).
Out-of-Scope: User authentication, billing, or complex fraud detection.

Non-Functional Requirements

Scale: Must handle 1M+ RPS with horizontal scaling.
Latency: Sub-5ms processing time per request.
Availability & Reliability: 99.99% availability; if the limiter fails, traffic should pass through (Fail-Open).
Consistency: Eventual consistency is acceptable for global counts, but atomic updates are required for local increments within a shard.
Security: Prevent "noisy neighbor" issues where one user's traffic spikes affect the limiter's performance for others.

Estimation

Traffic: 1M RPS.
Storage:
1M active users.
Each key (API Key) + Metadata + Counter \approx 100 bytes.
Total Storage = 1M * 100B = 100MB (Fits easily in a small Redis cluster).
Bandwidth:
1M RPS * 1KB request size = 1GB/s (Total traffic through the Gateway).
Limiter communication: 1M RPS * 200B (Redis command/response) = 200MB/s.

Blueprint

Concise Summary: A centralized, high-performance rate limiting service using an API Gateway integrated with a Redis Cluster to track windowed counters atomically.
Major Components:
API Gateway: Acts as the entry point, intercepting requests to check quota before routing to backend services.
Rate Limiter Service: A stateless sidecar or microservice that executes the limiting logic and communicates with the cache.
Redis Cluster: Provides the distributed, in-memory storage for counters using sharding for horizontal scale.
Simplicity Audit: This design avoids complex messaging queues or heavy stream processing (YAGNI), opting for Redis's native atomic operations which provide the best balance of speed and consistency for an MVP.
Architecture Decision Rationale:
Why this architecture?: Redis Lua scripts ensure atomicity without the overhead of distributed locks.
Functional Satisfaction: Meets the need for per-key limiting and real-time blocking.
Non-functional Satisfaction: High throughput is handled by Redis sharding; low latency is achieved by in-memory operations.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: DNS redirects traffic to the nearest regional Load Balancer to minimize latency.
Security & Perimeter: API Gateway handles SSL termination and extracts the API-Key or User-ID used for rate limiting.
Rate Limiting Integration: The Gateway performs a synchronous call to the Rate Limiter Service. For extreme performance, the Limiter logic can be implemented as a Lua script directly in an Nginx/OpenResty gateway or an Envoy filter.

Service

Topology & Scaling: Stateless Go or Java services deployed in a Multi-AZ Kubernetes cluster. Scaled based on CPU and request latency.
API Schema Design:
POST /v1/check (Internal API)
Request: { "key": "user_123", "cost": 1 }
Response: { "allowed": true, "remaining": 99, "reset": 1625091200 }
Protocol: gRPC for low-latency internal communication.
Resilience & Reliability:
Circuit Breaker: If the Rate Limiter Service or Redis has high latency (> 50ms), the Gateway trips the circuit and allows all traffic (Fail-Open).
Retries: No retries for rate-limit checks (it's better to miss a limit than to double the latency).

Storage

Access Pattern: Heavy write (increment) and heavy read (check).
Technical Selection: Redis is the industry standard for this use case.
Algorithm: Sliding Window Counter.
We map the current time to buckets (e.g., 1-minute buckets).
We calculate: count = current_bucket_count + last_bucket_count * (1 - weight_of_current_bucket_time).
Distribution Logic: Sharding by hash(rate_limit_key). This ensures all requests for the same user land on the same Redis node, avoiding cross-slot overhead.

Cache

Purpose: Redis acts as both the primary store and the cache for counters.
Key-Value Schema:
Key: ratelimit:{api_key}:{timestamp_minute}
Value: Integer (Counter)
TTL: 2 minutes (to allow for sliding window calculation of the previous minute).
Atomic Operations: Uses INCR and EXPIRE or a Lua script to check and increment in one RTT (Round Trip Time).
Wrap Up

Advanced Topics

Trade-offs (Consistency vs. Availability): We choose AP (Availability) over CP (Consistency). In a network partition, we allow traffic even if we can't sync the global count perfectly.
Reliability & Failure Handling:
Redis Cluster: Uses master-slave replication. If a master fails, a slave is promoted.
Thundering Herd: To prevent millions of requests from hitting the DB if Redis fails, the Gateway uses a local cache of "allow-all" flags.
Bottleneck Analysis:
Hot Keys: A single celebrity user could overwhelm one Redis shard.
Optimization: Use "Local L1 Cache" (In-memory on the service node) for extremely hot keys, syncing with Redis every 100ms.
Security: The X-RateLimit-Limit and X-RateLimit-Remaining headers are returned to the client to help well-behaved clients back off voluntarily.
Distinguishing Insight: To handle "Burstiness", we can implement a Token Bucket algorithm instead of Sliding Window. However, Sliding Window is generally preferred for "hard" time-window quotas (e.g., 100/min) because it is easier for users to understand and debug.