The Question
Design

Scalable Distributed Rate Limiter

Design a high-performance distributed rate-limiting system capable of handling millions of requests per second across a global microservices architecture. The system must support various limiting algorithms (e.g., Sliding Window, Token Bucket) and offer different granularities such as per-user, per-IP, and per-API key. Key constraints include sub-millisecond latency overhead, high availability even during storage failures, and the ability to handle 'hot' keys effectively. Explain your choice of state management, how you handle race conditions in a distributed environment, and your strategy for fail-soft operations.
Redis
Lua
API Gateway
Consistent Hashing
Circuit Breaker
L1/L2 Caching
Questions & Insights

Clarifying Questions

Scale & Performance: What is the expected peak traffic (QPS), and what is the maximum allowable latency overhead for the rate-limiting check?
Granularity: Are we limiting by User ID, IP Address, API Key, or a combination of these?
Accuracy: Is a small margin of error acceptable (e.g., ±5%), or do we require strict accuracy for billing or high-security purposes?
Architecture: Is this a centralized global rate limiter, or should it be distributed across multiple regional data centers?
Action: When a limit is exceeded, do we simply drop the request (429 Too Many Requests), or is there a requirement for request queuing/shaping?
Assumptions for MVP:
Scale: Support up to 100,000 QPS.
Latency: Must add < 5ms to the total request round-trip.
Granularity: User ID and IP-based.
Accuracy: Highly accurate but prioritizes availability (eventual consistency in multi-region is acceptable).
Environment: Distributed microservices environment.

Thinking Process

Core Bottleneck: The primary challenge is the "Race Condition" in a distributed environment when multiple application servers try to update the same counter simultaneously.
Step 1: Choose the Algorithm: For an MVP that balances accuracy and memory, the Sliding Window Counter algorithm is superior to Fixed Window (boundary issues) or Token Bucket (locking complexity).
Step 2: State Management: Use a high-performance, in-memory data store (Redis) with atomic operations to handle shared state across distributed nodes.
Step 3: Optimization: Implement local in-memory caching for "hot" keys to reduce Redis hits and improve latency for heavily throttled users.
Step 4: Resilience: Establish a "Fail-Open" strategy so that if the rate limiter service or Redis goes down, traffic is allowed through to ensure system availability.

Bonus Points

Redis Lua Scripts: Use Lua scripting to group "Read-and-Update" operations into a single atomic execution, eliminating race conditions without distributed locks.
Hybrid Tiering: Implement a two-tier rate limiting strategy—Global (Redis) for strict enforcement and Local (In-memory) for high-frequency burst protection.
Client Hinting: Return headers like X-RateLimit-Remaining and Retry-After to allow well-behaved clients to self-throttle.
Dynamic Config: Use a configuration service (e.g., Etcd or Consul) to update rate limits in real-time without redeploying services.
Design Breakdown

Functional Requirements

Core Use Cases:
Intercept incoming requests and determine if they exceed defined limits.
Return HTTP 429 status code for throttled requests.
Support multiple bucket definitions (e.g., 100 req/min and 5000 req/hour).
Scope Control:
In-Scope: Distributed counter management, basic algorithm implementation, and headers.
Out-of-Scope: Request queuing (Leaky Bucket), sophisticated fraud detection, and complex billing integration.

Non-Functional Requirements

Scale: Horizontal scaling of the rate-limiting service.
Latency: Sub-5ms overhead; Redis reads must be optimized.
Availability & Reliability: 99.99% availability; the system must fail-open.
Consistency: Strong consistency per key within a single region via Redis.
Fault Tolerance: Handle Redis cluster partitions gracefully.
Security: Prevent attackers from bypassing the limiter via header spoofing.

Estimation

Traffic: 100k QPS (Peak).
Storage:
Each counter key: user_id:api_endpoint (~64 bytes).
Value: Integer (~8 bytes).
10 Million active users/keys.
Total RAM: 10M 72 bytes ≈ 720 MB** (Well within a single Redis node, but clustered for HA).
Bandwidth:
Request/Response to Redis: ~100 bytes.
100k QPS 100 bytes = 10 MB/s** internal network overhead.

Blueprint

Concise Summary: A sidecar/middleware-based rate limiter that uses a Redis-backed Sliding Window Counter to track request frequencies across a distributed cluster.
Major Components:
API Gateway / Middleware: The entry point that intercepts requests and queries the Rate Limiter Service.
Rate Limiter Service: A stateless service that calculates if a request is allowed based on the configured policy.
Redis Cache: The centralized state store for counters, using Lua scripts for atomicity.
Config Store: A simple repository (ConfigMap or S3) to store rate limit rules (e.g., tier_1: 100/min).
Simplicity Audit: This design avoids complex distributed locking and heavy-duty messaging by leveraging Redis's native atomic increments and TTL features.
Architecture Decision Rationale:
Why this architecture?: Redis is industry-standard for this use case because its data structures (Hashes/Sorted Sets) and atomic operations provide the performance needed to stay in the critical request path.
Functional Requirement Satisfaction: Meets all requirements for distributed tracking and 429 response handling.
Non-functional Requirement Satisfaction: Scaling is achieved by adding Redis shards; low latency is achieved by keeping state in-memory.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Stateless instances of the Rate Limiter Service deployed across multiple Availability Zones (AZs). Scaling is triggered by CPU or Request Count.
API Schema Design:
Internal Endpoint: POST /v1/check
Request: { "key": "user_123", "weight": 1, "context": "login_api" }
Response: { "allowed": true, "remaining": 99, "reset_time": 1625000000 }
SLA: < 2ms p99 latency.
Resilience & Reliability:
Circuit Breaker: If the Rate Limiter Service or Redis latency exceeds 10ms, the Gateway trips the circuit and allows all traffic (Fail-Open).
Retry Policy: No retries for the check itself (to save time); if the first call fails, fail-open immediately.

Storage

Access Pattern: Extremely high-frequency Read-Modify-Write (RMW) cycle.
Database Table Design (Config Store):
policy_id: UUID (Primary Key)
resource_path: String (Index)
limit_count: Integer
window_size: Seconds
Technical Selection: Redis (specifically Redis Sorted Sets or Hashes for Sliding Window).
Distribution Logic: Sharded by key (UserID/IP) using consistent hashing to ensure all requests for a single user hit the same Redis shard, maintaining counter accuracy.

Cache

Purpose & Justification: Redis is* the primary state store here, but we implement an L1 Local Cache** (Guava or Caffeine) inside the Rate Limiter Service to store "Known Banned" keys for 1-5 seconds to protect Redis from targeted DDoS.
Key-Value Schema: key: {allowed_until: timestamp}.
Failure Handling: If Redis is unreachable, the service defaults to "Allow" to prevent a total site outage.
Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for multi-region setups. If a user moves between regions quickly, they might get a fresh quota. This is a PACELC trade-off: preferring Latency (L) over Consistency (C).
Reliability & Failure Handling:
Thundering Herd: Use a jittered TTL on counter keys in Redis so they don't all expire at the exact same second.
Bottleneck Analysis:
Hot Shards: A single celebrity user could overwhelm one Redis shard. Mitigation: Use a sub-key (e.g., user_id:minute_interval) to spread load or use local caching for top-1% users.
Security:
Ensure the X-Forwarded-For header is trusted only from the Load Balancer to prevent IP spoofing.
Distinguishing Insights:
Sliding Window Counter Optimization: Instead of storing every timestamp (which is O(N) memory), use a hybrid approach: Count(current_window) + Count(previous_window) * overlap_weight. This provides high accuracy with O(1) memory.