The Question
Design

Distributed Rate Limiter Design

Design a globally scalable, low-latency distributed rate limiting system that can enforce fine-grained request quotas across millions of users and multiple microservices while ensuring high availability even during partial infrastructure failures.
Redis
Lua
gRPC
Sidecar
L7 Load Balancer
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected peak traffic (QPS) the rate limiter needs to handle?
Assumption: Support 1M+ QPS across multiple services.
Accuracy requirements: Is a "hard limit" required (strictly accurate), or is "soft limiting" (eventual consistency/approximation) acceptable?
Assumption: High accuracy is preferred, but latency is the priority.
Granularity: What are we rate limiting? (e.g., User ID, IP address, API Key, or a combination?)
Assumption: Flexible keys (primarily User ID and API Key).
Latency Budget: What is the maximum overhead the rate limiter can add to a request?
Assumption: < 5ms at p99.
Fail-open vs. Fail-closed: If the rate limiter service is unavailable, should we allow the request or block it?
Assumption: Fail-open to prioritize system availability over strict enforcement.

Thinking Process

The Bottleneck: Centralized state management is the core challenge. A single database cannot handle 1M+ increments per second without significant latency.
Algorithm Selection: Move from Fixed Window (bursting issues) to Sliding Window Counter using Redis LUA scripts to ensure atomicity and minimize network round-trips.
Local vs. Global Strategy: How do we balance accuracy and speed? We use a Two-Tier approach: Local in-memory caching (short-lived) to absorb bursts, synchronized with a global Redis cluster.
Resilience: Implement a fail-open mechanism using circuit breakers at the client/sidecar level.

Bonus Points

LUA Scripting for Atomicity: Perform GET, CHECK, and INCR in a single atomic operation within Redis to prevent race conditions without complex locking.
Client-Side Batching: For extremely high-volume keys (e.g., public APIs), use local "thick clients" that batch increments and sync to the central store every 50-100ms.
Hierarchical Rate Limiting: Implement a policy engine that can evaluate multiple rules (e.g., 100 req/sec per user AND 5000 req/sec per service).
Global Clock Synchronization Alternatives: Discussing why we use TTL-based windowing instead of relying on synchronized system clocks to avoid clock drift issues.
Design Breakdown

Functional Requirements

Limit Enforcement: Throttle requests based on defined thresholds (e.g., N requests per T time unit).
Policy Management: Ability to update rate limit configurations dynamically without service restarts.
Status Feedback: Return standard HTTP 429 (Too Many Requests) with Retry-After headers.

Non-Functional Requirements

Low Latency: Minimal impact on the end-to-end request lifecycle.
High Availability: The rate limiter itself should not be a single point of failure.
Scalability: Must scale horizontally as the number of protected services grows.
Accuracy: Prevent "double-counting" or "under-counting" significantly in a distributed environment.

Estimation

Traffic: 1,000,000 QPS.
State Size: 100 bytes per user key (Key + Counter + Window Metadata).
Active Users: 10,000,000.
Total Storage: 10M * 100 bytes ≈ 1GB (Easily fits in Redis).
Bandwidth: 1M QPS * 100 bytes per request ≈ 100 MB/s.
Redis Throughput: A single Redis node can handle ~100k-150k QPS. For 1M QPS, we need a Redis Cluster with ~10 shards.

Blueprint

Concise Summary: A sidecar-based rate limiter that uses a Redis Cluster for distributed state management, employing an atomic Sliding Window Counter algorithm via LUA scripts.
Major Components:
Rate Limiter Sidecar/Middleware: Intercepts requests, checks the local cache, and queries Redis to decide if a request should be throttled.
Redis Cluster: Distributed, in-memory store providing the single source of truth for request counters.
Configuration Service: A simple control plane to push rate limit rules (e.g., "User X: 100/min") to the sidecars.
Simplicity Audit: This architecture avoids complex stream processing (Kafka/Flink) and heavy persistent databases, using Redis's native speed and atomic primitives to solve the problem with minimal moving parts.
Architecture Decision Rationale:
Why this architecture?: Redis is industry-standard for high-speed counters. LUA scripts solve race conditions without the overhead of distributed locks (like Etcd/Zookeeper).
Functional Satisfaction: Directly implements throttling and feedback via HTTP 429.
Non-functional Satisfaction: Redis Cluster handles horizontal scaling; Sidecar deployment ensures low-latency "fail-open" logic is close to the application.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
The rate limiter is deployed as a middleware library or sidecar within the application pod. This ensures the check happens in-process or via localhost, minimizing network hops.
Scaling is tied to the application services (Stateless).
API Schema Design:
Internal Check (Internal gRPC/Function call):
checkLimit(key, limit, windowSize)
Response: { allowed: boolean, remaining: int, resetTime: long }
External Response:
Status: 429 Too Many Requests
Headers: X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After.
Resilience & Reliability:
Fail-Open: If Redis returns an error or times out (> 2ms), the middleware logs the error and allows the request.
Circuit Breaker: If Redis latency spikes, the sidecar stops calling Redis and allows all traffic for a "cooldown" period.
Observability:
Metrics: rate_limit_allowed_count, rate_limit_blocked_count, redis_latency_ms.
Tracing: Inject rate_limit_id into distributed traces to identify which policy caused a throttle.

Cache

Purpose & Justification:
Local Cache: Small, short-lived (500ms) in-memory cache in the sidecar to mitigate "Hot Key" issues (e.g., a single user attacking the system).
Key-Value Schema:
Key: rate_limit:{service_id}:{key_id}:{window_timestamp}
Value: Integer (count)
TTL: Equal to windowSize.
Failure Handling: If the local cache is exhausted, it defaults to the Redis Cluster.
Wrap Up

Advanced Topics

Monitoring: Key metrics include Redis CPU usage and the ratio of 429 vs 200 status codes.
Trade-offs:
Accuracy vs. Latency: We choose Redis over a RDBMS to favor latency, accepting that if a Redis shard fails and fails over, counters might be slightly reset (eventual consistency during failover).
Bottlenecks: The primary bottleneck is the network throughput of the Redis Cluster. This is mitigated by sharding.
Alternatives:
Token Bucket: Simpler but harder to implement "Sliding Window" accuracy across multiple distributed nodes without high lock contention.
Sticky Sessions at LB: Could allow local-only rate limiting, but breaks if the LB rebalances or a node scales down.