The Question

Universal LLM Gateway & Multi-Tenant Proxy Service

Design a high-scale, external-facing LLM platform that provides a unified API for multiple foundation models (e.g., OpenAI, Anthropic, Llama). The system must support multi-tenant API key management, streaming responses via SSE/Websockets, real-time token-based quota enforcement, and asynchronous usage billing. Address specific challenges regarding provider rate-limit management, model-agnostic routing, and minimizing latency overhead for streaming traffic at a scale of 10M+ daily requests.

Redis

Kafka

PostgreSQL

ClickHouse

Flink

SSE

Envoy

vLLM

gRPC

Prometheus

Questions & Insights

Clarifying Questions

What is the primary persona and scale? (Assumption: Internal developers and external B2B clients. Target: 10M requests/day, peak 500 QPS).

Does the platform host models or proxy to providers (OpenAI/Anthropic)? (Assumption: MVP focuses on a unified proxy/gateway to external providers and self-hosted models via vLLM/Triton).

What are the latency requirements? (Assumption: Focus on Time-To-First-Token (TTFT) for streaming responses, aiming for <200ms overhead).

Are we handling fine-tuning or just inference? (Assumption: Inference-only for MVP; fine-tuning is out of scope).

How is billing handled? (Assumption: Token-based usage tracking with per-tenant quotas).

Thinking Process

Core Bottleneck: LLM requests are long-lived (streaming) and expensive. The system must handle high-concurrency persistent connections while tracking token usage accurately for billing.

Progressive Walkthrough:

How do we provide a single unified API while abstracting different provider schemas?

How do we ensure "fair use" so one tenant doesn't saturate the provider's rate limits (TPM/RPM)?

How do we minimize costs through caching and intelligent model routing?

How do we capture usage metrics asynchronously without adding latency to the inference path?

Bonus Points

Semantic Caching: Using a Vector Database (e.g., Pinecone/Milvus) to cache responses for semantically similar prompts, reducing costs by up to 30%.

Fallback & Circuit Breaking: Implementing "Model Fallback" logic (e.g., if GPT-4 is down/slow, fallback to Claude 3 or a local Llama-3 instance).

Dynamic Cost Routing: Routing requests to different providers based on real-time spot pricing or remaining monthly commit quotas.

Privacy Scrubbing: Automated PII (Personally Identifiable Information) detection and masking before sending data to external providers.

Design Breakdown

Functional Requirements

Core Use Cases:

Unified API for Chat Completions (REST/Streaming).

API Key management and Tenant isolation.

Real-time token usage tracking and quota enforcement.

Model fallback and load balancing across multiple provider keys.

Scope Control:

In-scope: Inference proxy, Rate limiting, Usage logging, Multi-provider support.

Out-of-scope: Model training, RAG (Retrieval Augmented Generation) pipeline, Fine-tuning orchestration.

Non-Functional Requirements

Scale: Support 100M+ tokens per day; horizontally scalable proxy layer.

Latency: Minimal overhead (<50ms) on top of the provider's intrinsic latency.

Availability: 99.99% uptime; the proxy must not be a single point of failure.

Consistency: Eventual consistency for usage/billing metrics; strong consistency for API key validation.

Fault Tolerance: Automatic retries with exponential backoff for provider 429/5xx errors.

Security: TLS 1.3, API Key encryption at rest, and request/response logging for audit trails.

Estimation

Traffic: 10M requests/day

\approx

115 requests/sec (Avg). Peak

\approx

500 QPS.

Storage:

Metadata (Keys/Users): <10 GB.

Usage Logs: 10M requests/day * 1KB/log = 10 GB/day. (3.6 TB/year).

Bandwidth:

Ingress: 500 QPS * 2KB (prompt) = 1 MB/s.

Egress (Streaming): 500 QPS * 10KB (avg response) = 5 MB/s.

Blueprint

Concise Summary: A multi-tenant LLM Gateway that abstracts model providers, enforces quotas, and captures streaming usage metrics via an asynchronous pipeline.

Major Components:

API Gateway: Handles authentication, SSL termination, and global rate limiting.

Model Proxy Service: The core logic engine that transforms requests, manages streaming (SSE), and performs model routing.

Usage Aggregator: Asynchronously processes token counts from a message bus to update tenant balances.

Redis Cache: Stores active API keys and real-time rate limit counters.

Simplicity Audit: The design avoids complex orchestration of model weights and focuses on a "Gateway Pattern," which is the fastest way to deliver value.

Architecture Decision Rationale:

Why this architecture?: The Proxy pattern allows for a unified SDK (e.g., OpenAI-compatible) regardless of the underlying model, enabling "provider-swapping" without client-side changes.

Functional Satisfaction: Meets the need for multi-tenancy and unified access.

Non-functional Satisfaction: Scalable through stateless proxies and high availability through redundant provider paths.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global Load Balancing (GSLB) to route traffic to the nearest regional Model Proxy.

Security & Perimeter:

API Gateway: Uses Kong or Envoy to validate JWTs/API Keys.

Rate Limiting: Enforces "Hard Limits" (e.g., 50 QPS per user) at the edge to protect the Proxy Service.

Service

Topology & Scaling: Stateless Model Proxy deployed in Kubernetes across multiple Availability Zones. Scaling is based on "Active Connections" rather than CPU, given the long-lived nature of LLM streaming.

API Schema Design:

POST /v1/chat/completions: OpenAI-compatible schema.

GET /v1/models: Returns available models across all providers.

GET /v1/usage: Returns current billing period token consumption.

Resilience & Reliability:

Streaming Stability: Uses Server-Sent Events (SSE). The proxy maintains the connection to the client while consuming the provider stream.

Circuit Breakers: If OpenAI returns 503s consistently, the Router trips and shifts traffic to Anthropic.

Observability:

RED Metrics: Request rate, Error rate, Duration (TTFT).

Traces: Distributed tracing (Jaeger/Otell) to measure latency added by the proxy vs. provider.

Storage

Access Pattern:

High read for API keys and configurations.

High write for usage logs and audit trails.

Database Table Design:

Tenants: tenant_id, api_key_hash, plan_id, status.

Usage: tenant_id, model_id, tokens_in, tokens_out, timestamp.

Technical Selection:

PostgreSQL: For relational metadata (Tenants, API Keys).

ClickHouse: For high-volume usage analytics and billing data.

Distribution Logic: Sharded by tenant_id to ensure isolation and performance.

Cache

Purpose & Justification:

Rate Limiting: Redis stores sliding-window counters for Tokens-Per-Minute (TPM).

Semantic/Exact Cache: To avoid re-generating the same LLM response for identical prompts (e.g., common customer support questions).

Key-Value Schema:

ratelimit:{tenant_id}:{model_id} -> counter.

cache:{prompt_hash} -> response_json (TTL: 24h).

Failure Handling: If Redis fails, the system defaults to "Allow All" to maintain availability (Fail-Open), with a local in-memory cache for emergency rate limiting.

Messaging

Purpose & Decoupling: Decouples the critical inference path from the billing/analytics path.

Event / Topic Schema: llm.usage.events: {tenant_id, model, tokens_used, cost_est, latency_ms}.

Throughput & Partitioning: Partitioned by tenant_id to ensure usage records for a single user are processed in order.

Technical Selection: Kafka. Required for its durability and replayability in case the Usage Aggregator fails.

Data Processing

Processing Model: Streaming aggregation.

Processing DAG: Source (Kafka) -> Enrich (Lookup tenant pricing) -> Aggregate (Sum tokens) -> Sink (ClickHouse).

Technical Selection: Flink or a simple Golang Consumer Group for MVP. Flink is preferred for Staff-level designs due to its exactly-once processing guarantees for billing accuracy.

Infrastructure (Optional)

Observability: Prometheus for metrics, Grafana for dash-boarding token consumption vs. provider costs.

Platform Security: Secrets managed via HashiCorp Vault. Provider API keys are encrypted with AES-256 and never logged.

Wrap Up

Advanced Topics

Trade-offs:

Latency vs. Accuracy: We use asynchronous usage tracking. This means a user might slightly exceed their quota by a few seconds before the Aggregator catches up, but it prevents blocking the user's request.

Provider Neutrality: Abstracting providers adds a small maintenance cost as provider APIs evolve, but it prevents vendor lock-in.

Reliability: Uses "Hedging" for high-priority requests—sending the same request to two providers and taking the first response (costly but ultra-low latency).

Security: Request signing is used for internal service-to-service communication to prevent unauthorized model access.

Distinguishing Insights:

Token Estimation: Before sending to the provider, use a local tokenizer (e.g., Tiktoken) to perform a "Pre-flight Check" to see if the user has enough remaining quota for the estimated max tokens.