The Question

Distributed API Rate Limiter for Large-Scale AI Services

Design a distributed rate-limiting system capable of handling 1M+ requests per minute for an AI provider like OpenAI. The system must support complex metrics including Requests Per Minute (RPM) and Tokens Per Minute (TPM) across hierarchical levels (Organizations and API Keys). Focus on achieving sub-5ms latency, high availability with fail-open capabilities, and handling the unique challenge of 'token-based' limiting where the exact cost is only known after the request completes.

Redis

gRPC

PostgreSQL

Lua

Kubernetes

API Gateway

LRU Cache

Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected peak throughput?

Assumption: 1,000,000 Requests Per Minute (RPM) with high burstiness.

Limiting Metrics: Are we limiting only by request count, or also by payload size/tokens (e.g., GPT-4 TPM - Tokens Per Minute)?

Assumption: We must support both RPM (Requests) and TPM (Tokens).

Accuracy vs. Latency: Is strict accuracy required, or is "eventual consistency" acceptable to reduce latency?

Assumption: Low latency (<5ms overhead) is critical. High accuracy is required, but "fail-open" is preferred over blocking traffic if the rate limiter itself fails.

Granularity: Are limits applied at the User, Organization, or API Key level?

Assumption: Hierarchical limiting (Org-level global limits + individual API Key overrides).

Thinking Process

Core Bottleneck: The primary challenge is maintaining a globally consistent counter across thousands of distributed application nodes without introducing significant network RTT (Round Trip Time).

Algorithm Selection: Why choose Sliding Window Counter? It provides a smooth rate-limiting experience and handles edge cases of "boundary bursts" better than Fixed Window, while being more memory-efficient than Sliding Window Log.

Atomic Operations: How do we prevent race conditions? We use Redis Lua Scripts to ensure that "check-and-decrement" operations are atomic and performant.

Fail-Open Mechanism: If the Rate Limit service or Redis cluster becomes unreachable, the system must allow the request to pass to ensure OpenAI services remain available (Availability over Consistency).

Bonus Points

Token-Aware Limiting: Unlike standard REST APIs, LLM APIs require post-request accounting. We implement a "reserve and reconcile" pattern where we estimate tokens upfront and reconcile the actual usage after the stream completes.

Redis Cell / Cell-based Isolation: Using Redis Cluster with hash tagging to ensure all keys for a specific Organization land on the same shard, minimizing cross-node communication.

Client-Side Throttling Guidance: Implementation of Retry-After and X-RateLimit-Reset headers using standard IETF drafts to allow "polite" clients to self-throttle.

Local L1 Cache: Using a small in-memory cache (5-10 seconds) for "Hot Organizers" to reject obvious over-limit traffic without even hitting Redis.

Design Breakdown

Functional Requirements

Core Use Cases:

Check if an API Key has exceeded its RPM or TPM quota.

Increment usage counters atomically.

Return remaining quota and reset time in response headers.

Support dynamic limit updates via an Admin Dashboard.

Scope Control:

In-scope: Distributed rate limiting, hierarchical quota management, and low-latency validation.

Out-of-scope: Billing/Payment processing, User Authentication (assumed handled by Auth service), and long-term analytics.

Non-Functional Requirements

Scale: Support millions of concurrent users and billions of daily requests.

Latency: P99 overhead added to the API request must be < 5ms.

Availability & Reliability: 99.99% availability. Use a "Fail-Open" strategy.

Consistency: Distributed consistency for counters within a specific region.

Fault Tolerance: Handle Redis shard failures gracefully.

Estimation

Traffic: 1M RPM = ~16,600 QPS.

Storage:

Key size: OrgID:KeyID:Metric (~50 bytes).

Value: Counter + Timestamp (~16 bytes).

10M active keys * 66 bytes

\approx

660 MB.

Even with metadata, the state fits easily in a standard 16GB Redis node.

Bandwidth: 16.6k QPS * 200 bytes/request

\approx

3.3 MB/s (negligible for modern NICs).

Blueprint

Concise Summary: A sidecar-compatible Rate Limit Service that utilizes a distributed Redis Cluster for atomic counter management using the Sliding Window Counter algorithm.

Major Components:

API Gateway: Entry point that extracts metadata (API Key, Org ID) and calls the Rate Limiter.

Rate Limit Service: A stateless Go-based service that executes the limiting logic.

Redis Cluster: The source of truth for all counters, using Lua scripts for atomicity.

Config Store: A persistent DB (PostgreSQL) to store the static limits (e.g., Free Tier = 3 RPM).

Simplicity Audit: This design avoids complex message queues or stream processing for the critical path to keep latency at a minimum.

Architecture Decision Rationale:

Why this architecture?: Redis is industry-standard for high-speed counters. Go provides the necessary concurrency and low memory footprint for the service layer.

Functional Satisfaction: Hierarchical keys in Redis (org:{id}:key:{id}) support multi-level limiting.

Non-functional Satisfaction: Redis Cluster provides horizontal scaling and high availability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global Anycast IP routes traffic to the nearest regional API Gateway.

API Gateway:

Authentication: Validates JWT/API Key before calling the Rate Limiter.

Headers: Injects X-RateLimit-Remaining into the response sent back to the client.

Circuit Breaking: If the Rate Limit Service is down, the Gateway allows traffic through (Fail-Open).

Service

Topology & Scaling: Stateless Go microservices deployed in K8s, scaling based on CPU and Request Count.

API Schema Design:

Internal Endpoint: POST /v1/check-limit

Protocol: gRPC (for internal speed/low serialization overhead).

Request: { key_id, org_id, metric: "tokens", increment: 500 }

Response: { allowed: true, remaining: 4500, reset_at: 1712345678 }

Resilience: Use a local cache (LRU) for limit definitions (e.g., Tier 1 = 50,000 TPM) to avoid hitting the Config DB on every request.

Storage

Access Pattern: 100% Key-Value lookups.

Technical Selection: PostgreSQL for limit configurations.

Database Table:

limit_configs: id (PK), entity_type (Org/Key), entity_id, metric (RPM/TPM), limit_value.

Distribution: Read-replicas for the Config DB since writes (changing a user's limit) are rare compared to reads.

Cache

Purpose: Redis Cluster acts as the distributed "counter" store.

Key-Value Schema:

Key: ratelimit:{org_id}:{key_id}:{metric}:{window_timestamp}

Data Structure: Redis Hash or Sorted Set.

Algorithm: Sliding Window Counter.

Use INCR and EXPIRE.

Calculation: current_window_count * (1 - weight) + next_window_count.

Lua Script:

  local current_key = KEYS[1]
  local limit = tonumber(ARGV[1])
  local amount = tonumber(ARGV[2])
  local current = redis.call("GET", current_key) or 0
  if tonumber(current) + amount > limit then
    return 0
  else
    redis.call("INCRBY", current_key, amount)
    redis.call("EXPIRE", current_key, 60)
    return 1
  end

Wrap Up

Advanced Topics

Trade-offs: We choose Consistency over Latency for the counters by using Redis, but we prioritize Availability over Strictness via the Fail-Open mechanism.

Reliability: If Redis becomes a bottleneck, we can implement Local Batching: a service node accumulates "10 tokens" locally before sending a single INCRBY 10 to Redis, reducing network traffic by 10x at the cost of slight inaccuracy.

Token Estimation: For OpenAI-specific TPM limiting, we use a lightweight tokenizer (like BPE) in the Rate Limit service to estimate costs before the LLM executes, then update with the exact count after the response is generated.

Security: Prevent "Rate Limit DoS" by applying an L4 rate limit at the Cloud Load Balancer level (limiting requests per IP) before they reach the more expensive L7 rate limiter.