The Question
Design

High-Throughput GPU Inference Batching System

Design a scalable infrastructure to wrap a fixed-endpoint inference API. The system must support high-concurrency requests and optimize GPU utilization via a server-side batching mechanism that balances latency and throughput.
Redis Streams
Redis Cluster
Kubernetes
Envoy
gRPC
Prometheus
Jaeger
JWT
mTLS
Questions & Insights

Clarifying Questions

What is the peak QPS and the target latency SLO? (Assumption: 10,000 QPS with a p99 latency requirement of < 500ms).
What is the maximum batch size supported by the fixed inference API? (Assumption: Max batch size is 64 requests).
What is the payload size for input and output? (Assumption: Text-based, ~2KB per request).
Is the client communication synchronous or asynchronous? (Assumption: Clients expect a synchronous-like experience, but we will use an async-polling or long-polling pattern internally to handle high concurrency).
Do we need to handle request priorities (e.g., premium vs. free users)? (Assumption: No, FIFO for the MVP).

Thinking Process

Core Bottleneck: GPU under-utilization and memory overhead when processing requests individually.
Key Strategy: Implement a Dynamic Batching Service that acts as a buffer between the high-concurrency API and the fixed GPU workers.
Progressive Logic:
How do we ingest 10k+ requests without blocking? (Use a distributed message queue).
How do we group them efficiently? (The Batcher service implements a "Wait-or-Full" logic).
How do we deliver results back to the user? (Result Store with Polling/WebSockets).
How do we scale the Batcher? (Partition-based batching to avoid global locks).

Bonus Points

Adaptive Batching: Dynamically adjust the wait_time based on current traffic volume to minimize latency during low-traffic periods and maximize throughput during spikes.
Zero-Copy Serialization: Use Protobuf or Arrow for internal data transfer to reduce CPU overhead during batch construction and deconstruction.
GPU Backpressure Propagation: Implement a feedback loop where the Batcher slows down ingestion if the GPU Worker's internal queue/memory utilization exceeds 90%.
Locality-Aware Batching: If the system scales across regions, ensure batching happens at the edge or within the same AZ to minimize cross-region data transfer costs.
Design Breakdown

Functional Requirements

Core Use Cases:
Users submit inference requests via REST API.
Requests are batched and processed by the GPU model.
Users retrieve the inference result.
Scope Control:
In-Scope: API Gateway, Request Queue, Batching Logic, Result Storage.
Out-of-Scope: Model training, Model optimization (TensorRT/ONNX), User authentication service.

Non-Functional Requirements

Scale: Must handle 10k QPS and scale horizontally.
Latency: Batching overhead should be < 50ms; total E2E latency < 500ms.
Availability & Reliability: 99.9% uptime; requests should not be lost if a worker fails (at-least-once delivery).
Consistency: Eventual consistency for results; strict ordering is not required within a batch.
Fault Tolerance: Dead-letter queues (DLQ) for failed inference attempts.

Estimation

Traffic: 10,000 QPS.
Storage: 10k requests/sec * 2KB/request = 20MB/sec. For 1 hour of retention: ~72GB.
Bandwidth:
Ingress: 10,000 * 2KB = 20 MB/s.
Egress (Results): ~20 MB/s.
GPU Workers: If one batch of 64 takes 200ms, one worker handles 320 QPS. We need ~32 GPU workers to handle 10k QPS.

Blueprint

Concise Summary: A high-throughput pipeline using a distributed queue to decouple request ingestion from GPU execution, featuring a dedicated Batcher Service for optimal GPU utilization.
Major Components:
API Gateway: Entry point for SSL termination and request validation.
Request Queue (Redis Streams): Fast, in-memory buffer for incoming inference tasks.
Batcher Service: The core logic that aggregates N messages or waits T milliseconds before calling the GPU API.
Result Store (Redis): Short-lived storage for finished inference results.
Simplicity Audit: This architecture avoids complex stream processing frameworks (like Flink) in favor of a lightweight consumer-group-based batcher, which is easier to deploy and scale for an MVP.
Architecture Decision Rationale:
Why this architecture?: Distributed queues provide the necessary buffer to handle bursts without crashing the fixed-rate GPU workers.
Functional Requirement Satisfaction: Batching logic directly addresses the "group requests" requirement.
Non-functional Requirement Satisfaction: Redis provides sub-millisecond latency for queuing, and horizontal scaling of the Batcher ensures high availability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global Load Balancer (GSLB) routes traffic to the nearest regional API Gateway.
Security & Perimeter: API Gateway handles JWT validation and Rate Limiting (1000 requests per user/min) to prevent DDoS on the GPU cluster.

Service

Topology & Scaling: Stateless API instances deployed in K8s, auto-scaling on CPU (70% threshold).
API Schema Design:
POST /v1/inference: { "input": "...", "client_id": "..." } -> Returns task_id.
GET /v1/result/{task_id}: Returns status (PENDING/SUCCESS) and data.
Resilience: 3 retries with exponential backoff for the Batcher calling the GPU API.

Storage

Access Pattern: High write/read (1:1 ratio). Data is transient (TTL = 10 mins).
Database Table Design:
Result Table (Redis Hash): Key: task_id, Fields: status, output, timestamp.
Technical Selection: Redis (In-memory KV) for the Result Store.
Distribution Logic: Partitioning by task_id using Redis Cluster to handle 20k+ ops/sec.

Cache

Purpose & Justification: Deduplicate identical inference requests (e.g., same prompt) to save GPU cycles.
Key-Value Schema: Key: SHA256(input_payload), Value: task_id or cached_result. TTL: 5 minutes.
Failure Handling: If Redis fails, bypass cache and go straight to the queue.

Messaging

Purpose & Decoupling: Decouples synchronous HTTP requests from asynchronous GPU processing.
Event / Topic Schema: inference-requests topic. Payload: { "task_id": "...", "payload": "...", "ts": "..." }.
Throughput & Partitioning: Redis Streams with 16 shards to allow parallel Batcher consumers.
Technical Selection: Redis Streams for low latency and simplicity.

Data Processing

Processing Model: The Batcher Service uses a hybrid trigger:
Size Trigger: 64 messages reached.
Time Trigger: 50ms elapsed since the first message in the current window.
Processing DAG: Read from Stream -> Accumulate in Memory -> Call GPU API -> Disperse Results to Redis -> ACK Stream.
Scalability: Multiple Batcher instances consume from different partitions of the Redis Stream.

Infrastructure (Optional)

Observability:
Metrics: Track batch_size_distribution, gpu_worker_latency, and queue_depth.
Tracing: Jaeger for end-to-end tracing from API Gateway to GPU API.
Wrap Up

Advanced Topics

Trade-offs: We trade off latency for throughput. A single request might wait 50ms in the Batcher, but the system overall can handle 10x the load.
Reliability: Using Redis Streams' Consumer Groups ensures that if a Batcher instance dies, the messages are re-delivered to another instance (NACK mechanism).
Bottleneck Analysis: The "Fixed GPU API" is the ultimate bottleneck. If it slows down, the Request Queue will grow. We must implement TTL on the Queue to drop stale requests and maintain system sanity.
Security: All internal communication (Batcher to GPU API) uses mTLS. Input sanitization is performed at the API Gateway to prevent injection attacks into the model prompts.
Distinguishing Insights: To handle Hot Keys (many users asking for the same inference), we can implement a "request collapsing" mechanism in the Batcher where it identifies duplicate tasks within the same batch and only sends one to the GPU, then replicates the result for all original task IDs.