The Question
DesignShared LLM Inference Platform
Design a shared, external-facing platform for serving large language model inference. The system should support multi-tenancy with per-tenant rate limiting and cost attribution, ensure low-latency responses under high concurrency, and maintain high availability with graceful degradation.
API Gateway
Server-Sent Events (SSE)
Redis Rate Limiting
Kafka
PostgreSQL
February 8, 2026