The Question

Scalable Distributed Rate Limiter Design

Design a high-performance, distributed rate limiting service capable of handling millions of requests per second across a global API infrastructure. The system must support flexible windowing algorithms (e.g., sliding window), ensure sub-millisecond latency overhead, and remain resilient to component failures. Detail the trade-offs between consistency and availability, the choice of storage for real-time counters, and how to handle global synchronization challenges.

Redis

PostgreSQL

gRPC

Kafka

Lua

Kubernetes

NoSQL

Questions & Insights

Clarifying Questions

Scale and Traffic: What is the expected peak throughput (QPS) and the number of unique entities (users/IPs) we are limiting?

Granularity: Are we limiting by User ID, API Key, IP address, or a combination of these?

Latency Budget: What is the maximum acceptable latency overhead added by the rate limiting check (e.g., < 2ms)?

Strictness: Is "soft" rate limiting (eventual consistency) acceptable to improve performance, or do we need "hard" atomic limits?

Deployment: Is this a global service across multiple regions, or localized to a single data center?

Assumptions for this design:

Scale: 1,000,000+ QPS at peak.

Entities: 100 million active keys (users/IPs).

Latency: Sub-millisecond local overhead; < 5ms total overhead.

Consistency: High accuracy required, utilizing a distributed counter.

Scope: Internal high-performance service used by an API Gateway.

Thinking Process

Core Bottleneck: The primary challenge is the "Read-Modify-Write" race condition in a distributed environment and the network latency of checking a central store.

Progressive Approach:

How do we track counts across multiple server nodes without losing accuracy? (Centralized Cache vs. Local Sticky Sessions).

Which algorithm optimizes memory vs. precision? (Token Bucket vs. Sliding Window Counter).

How do we ensure the rate limiter doesn't become a Single Point of Failure (SPOF) or a latency bottleneck? (Fail-open strategy + Local caching).

How do we handle global synchronization without destroying performance? (Local-regional counters with async global sync).

Bonus Points

Token Bucket with Redis Lua: Using Lua scripts to execute the "check-and-decrement" logic atomically inside Redis to avoid race conditions without heavy locks.

Hierarchical Rate Limiting: Implementing tiered limits (e.g., 100/sec, 5000/hour) using multiple buckets to prevent burst abuse while allowing long-term usage.

Client-Side Throttling Signals: Leveraging HTTP headers (X-RateLimit-Retry-After) and 429 status codes to push the "waiting" logic back to the client, reducing server load.

Hybrid Local-Remote State: Implementing "In-memory bursting" where a fraction of the quota is cached locally on the service node to reduce Redis hits for high-volume keys.

Design Breakdown

Functional Requirements

Core Use Cases:

Deny requests exceeding a defined threshold for a specific key.

Support multiple time windows (second, minute, hour).

Provide real-time remaining quota info in response headers.

Scope Control:

In-scope: Distributed counting logic, configuration management, and the decision engine.

Out-of-scope: User authentication (assumed pre-validated), IP geo-location, billing/monetization logic.

Non-Functional Requirements

Scale: Horizontal scalability to handle millions of requests.

Latency: Ultra-low overhead (< 2ms) to ensure the API remains responsive.

Availability: 99.99% availability; the system must "fail-open" if the rate limiter service is down.

Consistency: Distributed atomic consistency for strict tiers; eventual consistency for free tiers.

Fault Tolerance: Resilience against Redis partition failures or node crashes.

Estimation

Traffic: 1M QPS.

Storage: 100M active keys.

Each key: user_id (8 bytes) + timestamp/counter (8 bytes) + overhead

\approx

32 bytes.

Total RAM:

100M \times 32 \text{ bytes} \approx 3.2 \text{ GB}

Bandwidth:

Request: Key (32 bytes).

Response: Boolean + Headers (approx 100 bytes).

Total network overhead is minimal compared to standard API payloads.

Blueprint

Concise Summary: A sidecar or middleware-based Rate Limiter that leverages a distributed Redis cluster using the Sliding Window Counter algorithm to ensure atomic, low-latency request budgeting.

Major Components:

API Gateway/Middleware: Intercepts incoming requests and queries the Rate Limiter.

Rate Limiter Service: Stateless logic layer that executes the counting algorithms.

Redis Cluster: High-performance, in-memory store for real-time counters.

Config Store (Postgres): Persistent storage for per-tier or per-customer limit rules.

Simplicity Audit: This design avoids complex stream processing or heavy locking by using Redis Lua scripts, keeping the architecture lean and high-performance.

Architecture Decision Rationale:

Best for this problem: Redis provides the sub-millisecond latency required for a "blocking" check in the request path.

Functional Satisfaction: Supports flexible keys and windows via Redis-based logic.

Non-functional Satisfaction: Scalable via Redis Sharding and stateless service nodes.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: For global APIs, initial rate limiting (DDoS protection) happens at the CDN/WAF edge (e.g., Cloudflare) to drop volumetric attacks before they hit our infra.

Security & Perimeter: The API Gateway performs SSL termination and initial JWT validation before calling the Rate Limiter to ensure we only count authenticated/valid traffic.

Service

Topology & Scaling: Stateless Go or Rust services (for low GC pause) deployed in a Multi-AZ Kubernetes cluster. Scaled based on CPU and request latency.

API Schema Design:

POST /v1/check-limit

Protocol: gRPC (for performance) or REST.

Request: { "key": "user_123", "service": "payments", "cost": 1 }

Response: { "allowed": true, "remaining": 99, "reset_at": 1625000000 }

Resilience:

Fail-Open: If the Rate Limiter gRPC call times out (> 10ms) or returns a 5xx, the API Gateway defaults to allowed: true to prevent blocking legitimate users during a system outage.

Circuit Breaker: API Gateway uses a circuit breaker to stop calling the Rate Limiter if it's consistently failing.

Storage

Access Pattern: Extremely high write-heavy (incrementing counters) and high read-heavy (checking limits).

Database Table Design (Config DB):

policy_id: UUID (PK)

target_id: String (User_ID or Tier_ID)

limit_count: BigInt

window_size: Integer (seconds)

Technical Selection: Redis for counters (speed); PostgreSQL for configurations (relational integrity).

Distribution Logic: Redis Cluster with sharding based on the Rate-Limit-Key to avoid hot spots.

Cache

Purpose & Justification: Redis is the "source of truth" for real-time counts.

Key-Value Schema:

Key: rl:{user_id}:{window_timestamp}

Structure: Hash or Sorted Set (for sliding window).

TTL: Set to window_size * 2 to ensure auto-cleanup.

Technical Selection: Redis.

Failure Handling: Use Redis Sentinel or Cluster for high availability. In extreme cases, local in-memory caches (LRU) on the service nodes store the last 1 second of "Allow" decisions to mitigate Redis latency spikes.

Messaging

Purpose & Decoupling: Asynchronously log rate-limiting events (denials/approvals) for billing and analytics without blocking the request path.

Technical Selection: Kafka or SQS.

Failure Handling: Dead-letter queues for failed analytics events.

Data Processing

Processing Model: Batch processing of audit logs.

Purpose: Aggregate usage data to update "Config DB" (e.g., automatically upgrading/downgrading tiers based on usage patterns).

Technical Selection: Spark or simple Cron jobs (for MVP).

Wrap Up

Advanced Topics

Trade-offs: We choose Availability over Consistency (AP) by implementing a fail-open strategy. It is better to occasionally let an over-limit request through than to take down the entire API because the rate limiter is slow.

Sliding Window vs. Fixed Window: Fixed windows suffer from "bursting" at window boundaries (e.g.,

2 \times

limit at 11:59 and 12:01). We use Sliding Window Counter (approximated) by summing the current window and a percentage of the previous window to stay memory-efficient while being accurate.

Optimization: Batching. For extremely high-volume keys (e.g., a popular public bot), the middleware can batch increments (count +10) and sync to Redis once every 100ms to reduce network IO.

Security: Prevent "Key Exhaustion" attacks where an attacker sends random keys to fill up Redis memory. Implement a "Default Policy" and reject requests without valid Auth tokens before the rate limiter is reached.