The Question
Design

Universal LLM Gateway & Multi-Tenant Proxy Service

Design a high-scale, external-facing LLM platform that provides a unified API for multiple foundation models (e.g., OpenAI, Anthropic, Llama). The system must support multi-tenant API key management, streaming responses via SSE/Websockets, real-time token-based quota enforcement, and asynchronous usage billing. Address specific challenges regarding provider rate-limit management, model-agnostic routing, and minimizing latency overhead for streaming traffic at a scale of 10M+ daily requests.
Redis
Kafka
PostgreSQL
ClickHouse
Flink
SSE
Envoy
vLLM
gRPC
Prometheus
Questions & Insights

Clarifying Questions

What is the primary persona and scale? (Assumption: Internal developers and external B2B clients. Target: 10M requests/day, peak 500 QPS).
Does the platform host models or proxy to providers (OpenAI/Anthropic)? (Assumption: MVP focuses on a unified proxy/gateway to external providers and self-hosted models via vLLM/Triton).
What are the latency requirements? (Assumption: Focus on Time-To-First-Token (TTFT) for streaming responses, aiming for <200ms overhead).
Are we handling fine-tuning or just inference? (Assumption: Inference-only for MVP; fine-tuning is out of scope).
How is billing handled? (Assumption: Token-based usage tracking with per-tenant quotas).

Thinking Process

Core Bottleneck: LLM requests are long-lived (streaming) and expensive. The system must handle high-concurrency persistent connections while tracking token usage accurately for billing.
Progressive Walkthrough:
How do we provide a single unified API while abstracting different provider schemas?
How do we ensure "fair use" so one tenant doesn't saturate the provider's rate limits (TPM/RPM)?
How do we minimize costs through caching and intelligent model routing?
How do we capture usage metrics asynchronously without adding latency to the inference path?

Bonus Points

Semantic Caching: Using a Vector Database (e.g., Pinecone/Milvus) to cache responses for semantically similar prompts, reducing costs by up to 30%.
Fallback & Circuit Breaking: Implementing "Model Fallback" logic (e.g., if GPT-4 is down/slow, fallback to Claude 3 or a local Llama-3 instance).
Dynamic Cost Routing: Routing requests to different providers based on real-time spot pricing or remaining monthly commit quotas.
Privacy Scrubbing: Automated PII (Personally Identifiable Information) detection and masking before sending data to external providers.
Design Breakdown

Functional Requirements

Core Use Cases:
Unified API for Chat Completions (REST/Streaming).
API Key management and Tenant isolation.
Real-time token usage tracking and quota enforcement.
Model fallback and load balancing across multiple provider keys.
Scope Control:
In-scope: Inference proxy, Rate limiting, Usage logging, Multi-provider support.
Out-of-scope: Model training, RAG (Retrieval Augmented Generation) pipeline, Fine-tuning orchestration.

Non-Functional Requirements

Scale: Support 100M+ tokens per day; horizontally scalable proxy layer.
Latency: Minimal overhead (<50ms) on top of the provider's intrinsic latency.
Availability: 99.99% uptime; the proxy must not be a single point of failure.
Consistency: Eventual consistency for usage/billing metrics; strong consistency for API key validation.
Fault Tolerance: Automatic retries with exponential backoff for provider 429/5xx errors.
Security: TLS 1.3, API Key encryption at rest, and request/response logging for audit trails.

Estimation

Traffic: 10M requests/day \approx 115 requests/sec (Avg). Peak \approx 500 QPS.
Storage:
Metadata (Keys/Users): <10 GB.
Usage Logs: 10M requests/day * 1KB/log = 10 GB/day. (3.6 TB/year).
Bandwidth:
Ingress: 500 QPS * 2KB (prompt) = 1 MB/s.
Egress (Streaming): 500 QPS * 10KB (avg response) = 5 MB/s.

Blueprint

Concise Summary: A multi-tenant LLM Gateway that abstracts model providers, enforces quotas, and captures streaming usage metrics via an asynchronous pipeline.
Major Components:
API Gateway: Handles authentication, SSL termination, and global rate limiting.
Model Proxy Service: The core logic engine that transforms requests, manages streaming (SSE), and performs model routing.
Usage Aggregator: Asynchronously processes token counts from a message bus to update tenant balances.
Redis Cache: Stores active API keys and real-time rate limit counters.
Simplicity Audit: The design avoids complex orchestration of model weights and focuses on a "Gateway Pattern," which is the fastest way to deliver value.
Architecture Decision Rationale:
Why this architecture?: The Proxy pattern allows for a unified SDK (e.g., OpenAI-compatible) regardless of the underlying model, enabling "provider-swapping" without client-side changes.
Functional Satisfaction: Meets the need for multi-tenancy and unified access.
Non-functional Satisfaction: Scalable through stateless proxies and high availability through redundant provider paths.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global Load Balancing (GSLB) to route traffic to the nearest regional Model Proxy.
Security & Perimeter:
API Gateway: Uses Kong or Envoy to validate JWTs/API Keys.
Rate Limiting: Enforces "Hard Limits" (e.g., 50 QPS per user) at the edge to protect the Proxy Service.

Service

Topology & Scaling: Stateless Model Proxy deployed in Kubernetes across multiple Availability Zones. Scaling is based on "Active Connections" rather than CPU, given the long-lived nature of LLM streaming.
API Schema Design:
POST /v1/chat/completions: OpenAI-compatible schema.
GET /v1/models: Returns available models across all providers.
GET /v1/usage: Returns current billing period token consumption.
Resilience & Reliability:
Streaming Stability: Uses Server-Sent Events (SSE). The proxy maintains the connection to the client while consuming the provider stream.
Circuit Breakers: If OpenAI returns 503s consistently, the Router trips and shifts traffic to Anthropic.
Observability:
RED Metrics: Request rate, Error rate, Duration (TTFT).
Traces: Distributed tracing (Jaeger/Otell) to measure latency added by the proxy vs. provider.

Storage

Access Pattern:
High read for API keys and configurations.
High write for usage logs and audit trails.
Database Table Design:
Tenants: tenant_id, api_key_hash, plan_id, status.
Usage: tenant_id, model_id, tokens_in, tokens_out, timestamp.
Technical Selection:
PostgreSQL: For relational metadata (Tenants, API Keys).
ClickHouse: For high-volume usage analytics and billing data.
Distribution Logic: Sharded by tenant_id to ensure isolation and performance.

Cache

Purpose & Justification:
Rate Limiting: Redis stores sliding-window counters for Tokens-Per-Minute (TPM).
Semantic/Exact Cache: To avoid re-generating the same LLM response for identical prompts (e.g., common customer support questions).
Key-Value Schema:
ratelimit:{tenant_id}:{model_id} -> counter.
cache:{prompt_hash} -> response_json (TTL: 24h).
Failure Handling: If Redis fails, the system defaults to "Allow All" to maintain availability (Fail-Open), with a local in-memory cache for emergency rate limiting.

Messaging

Purpose & Decoupling: Decouples the critical inference path from the billing/analytics path.
Event / Topic Schema: llm.usage.events: {tenant_id, model, tokens_used, cost_est, latency_ms}.
Throughput & Partitioning: Partitioned by tenant_id to ensure usage records for a single user are processed in order.
Technical Selection: Kafka. Required for its durability and replayability in case the Usage Aggregator fails.

Data Processing

Processing Model: Streaming aggregation.
Processing DAG: Source (Kafka) -> Enrich (Lookup tenant pricing) -> Aggregate (Sum tokens) -> Sink (ClickHouse).
Technical Selection: Flink or a simple Golang Consumer Group for MVP. Flink is preferred for Staff-level designs due to its exactly-once processing guarantees for billing accuracy.

Infrastructure (Optional)

Observability: Prometheus for metrics, Grafana for dash-boarding token consumption vs. provider costs.
Platform Security: Secrets managed via HashiCorp Vault. Provider API keys are encrypted with AES-256 and never logged.
Wrap Up

Advanced Topics

Trade-offs:
Latency vs. Accuracy: We use asynchronous usage tracking. This means a user might slightly exceed their quota by a few seconds before the Aggregator catches up, but it prevents blocking the user's request.
Provider Neutrality: Abstracting providers adds a small maintenance cost as provider APIs evolve, but it prevents vendor lock-in.
Reliability: Uses "Hedging" for high-priority requests—sending the same request to two providers and taking the first response (costly but ultra-low latency).
Security: Request signing is used for internal service-to-service communication to prevent unauthorized model access.
Distinguishing Insights:
Token Estimation: Before sending to the provider, use a local tokenizer (e.g., Tiktoken) to perform a "Pre-flight Check" to see if the user has enough remaining quota for the estimated max tokens.