The Question

Scalable Distributed Rate Limiter

Design a high-performance distributed rate-limiting system capable of handling millions of requests per second across a global microservices architecture. The system must support various limiting algorithms (e.g., Sliding Window, Token Bucket) and offer different granularities such as per-user, per-IP, and per-API key. Key constraints include sub-millisecond latency overhead, high availability even during storage failures, and the ability to handle 'hot' keys effectively. Explain your choice of state management, how you handle race conditions in a distributed environment, and your strategy for fail-soft operations.

Redis

Lua

API Gateway

Consistent Hashing

Circuit Breaker

L1/L2 Caching

Questions & Insights

Clarifying Questions

Scale & Performance: What is the expected peak traffic (QPS), and what is the maximum allowable latency overhead for the rate-limiting check?

Granularity: Are we limiting by User ID, IP Address, API Key, or a combination of these?

Accuracy: Is a small margin of error acceptable (e.g., ±5%), or do we require strict accuracy for billing or high-security purposes?

Architecture: Is this a centralized global rate limiter, or should it be distributed across multiple regional data centers?

Action: When a limit is exceeded, do we simply drop the request (429 Too Many Requests), or is there a requirement for request queuing/shaping?

Assumptions for MVP:

Scale: Support up to 100,000 QPS.

Latency: Must add < 5ms to the total request round-trip.

Granularity: User ID and IP-based.

Accuracy: Highly accurate but prioritizes availability (eventual consistency in multi-region is acceptable).

Environment: Distributed microservices environment.

Thinking Process

Core Bottleneck: The primary challenge is the "Race Condition" in a distributed environment when multiple application servers try to update the same counter simultaneously.

Step 1: Choose the Algorithm: For an MVP that balances accuracy and memory, the Sliding Window Counter algorithm is superior to Fixed Window (boundary issues) or Token Bucket (locking complexity).

Step 2: State Management: Use a high-performance, in-memory data store (Redis) with atomic operations to handle shared state across distributed nodes.

Step 3: Optimization: Implement local in-memory caching for "hot" keys to reduce Redis hits and improve latency for heavily throttled users.

Step 4: Resilience: Establish a "Fail-Open" strategy so that if the rate limiter service or Redis goes down, traffic is allowed through to ensure system availability.

Bonus Points

Redis Lua Scripts: Use Lua scripting to group "Read-and-Update" operations into a single atomic execution, eliminating race conditions without distributed locks.

Hybrid Tiering: Implement a two-tier rate limiting strategy—Global (Redis) for strict enforcement and Local (In-memory) for high-frequency burst protection.

Client Hinting: Return headers like X-RateLimit-Remaining and Retry-After to allow well-behaved clients to self-throttle.

Dynamic Config: Use a configuration service (e.g., Etcd or Consul) to update rate limits in real-time without redeploying services.

Design Breakdown

Functional Requirements

Core Use Cases:

Intercept incoming requests and determine if they exceed defined limits.

Return HTTP 429 status code for throttled requests.

Support multiple bucket definitions (e.g., 100 req/min and 5000 req/hour).

Scope Control:

In-Scope: Distributed counter management, basic algorithm implementation, and headers.

Out-of-Scope: Request queuing (Leaky Bucket), sophisticated fraud detection, and complex billing integration.

Non-Functional Requirements

Scale: Horizontal scaling of the rate-limiting service.

Latency: Sub-5ms overhead; Redis reads must be optimized.

Availability & Reliability: 99.99% availability; the system must fail-open.

Consistency: Strong consistency per key within a single region via Redis.

Fault Tolerance: Handle Redis cluster partitions gracefully.

Security: Prevent attackers from bypassing the limiter via header spoofing.

Estimation

Traffic: 100k QPS (Peak).

Storage:

Each counter key: user_id:api_endpoint (~64 bytes).

Value: Integer (~8 bytes).

10 Million active users/keys.

Total RAM: 10M 72 bytes ≈ 720 MB** (Well within a single Redis node, but clustered for HA).

Bandwidth:

Request/Response to Redis: ~100 bytes.

100k QPS 100 bytes = 10 MB/s** internal network overhead.

Blueprint

Concise Summary: A sidecar/middleware-based rate limiter that uses a Redis-backed Sliding Window Counter to track request frequencies across a distributed cluster.

Major Components:

API Gateway / Middleware: The entry point that intercepts requests and queries the Rate Limiter Service.

Rate Limiter Service: A stateless service that calculates if a request is allowed based on the configured policy.

Redis Cache: The centralized state store for counters, using Lua scripts for atomicity.

Config Store: A simple repository (ConfigMap or S3) to store rate limit rules (e.g., tier_1: 100/min).

Simplicity Audit: This design avoids complex distributed locking and heavy-duty messaging by leveraging Redis's native atomic increments and TTL features.

Architecture Decision Rationale:

Why this architecture?: Redis is industry-standard for this use case because its data structures (Hashes/Sorted Sets) and atomic operations provide the performance needed to stay in the critical request path.

Functional Requirement Satisfaction: Meets all requirements for distributed tracking and 429 response handling.

Non-functional Requirement Satisfaction: Scaling is achieved by adding Redis shards; low latency is achieved by keeping state in-memory.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Stateless instances of the Rate Limiter Service deployed across multiple Availability Zones (AZs). Scaling is triggered by CPU or Request Count.

API Schema Design:

Internal Endpoint: POST /v1/check

Request: { "key": "user_123", "weight": 1, "context": "login_api" }

Response: { "allowed": true, "remaining": 99, "reset_time": 1625000000 }

SLA: < 2ms p99 latency.

Resilience & Reliability:

Circuit Breaker: If the Rate Limiter Service or Redis latency exceeds 10ms, the Gateway trips the circuit and allows all traffic (Fail-Open).

Retry Policy: No retries for the check itself (to save time); if the first call fails, fail-open immediately.

Storage

Access Pattern: Extremely high-frequency Read-Modify-Write (RMW) cycle.

Database Table Design (Config Store):

policy_id: UUID (Primary Key)

resource_path: String (Index)

limit_count: Integer

window_size: Seconds

Technical Selection: Redis (specifically Redis Sorted Sets or Hashes for Sliding Window).

Distribution Logic: Sharded by key (UserID/IP) using consistent hashing to ensure all requests for a single user hit the same Redis shard, maintaining counter accuracy.

Cache

Purpose & Justification: Redis is* the primary state store here, but we implement an L1 Local Cache** (Guava or Caffeine) inside the Rate Limiter Service to store "Known Banned" keys for 1-5 seconds to protect Redis from targeted DDoS.

Key-Value Schema: key: {allowed_until: timestamp}.

Failure Handling: If Redis is unreachable, the service defaults to "Allow" to prevent a total site outage.

Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for multi-region setups. If a user moves between regions quickly, they might get a fresh quota. This is a PACELC trade-off: preferring Latency (L) over Consistency (C).

Reliability & Failure Handling:

Thundering Herd: Use a jittered TTL on counter keys in Redis so they don't all expire at the exact same second.

Bottleneck Analysis:

Hot Shards: A single celebrity user could overwhelm one Redis shard. Mitigation: Use a sub-key (e.g., user_id:minute_interval) to spread load or use local caching for top-1% users.

Security:

Ensure the X-Forwarded-For header is trusted only from the Load Balancer to prevent IP spoofing.

Distinguishing Insights:

Sliding Window Counter Optimization: Instead of storing every timestamp (which is

O(N)

memory), use a hybrid approach: Count(current_window) + Count(previous_window) * overlap_weight. This provides high accuracy with

O(1)

memory.