The Question

High-Concurrency GPU Inference Batching System

Design a scalable infrastructure for a high-concurrency inference API. The system must use a fixed, black-box GPU inference endpoint. Your primary focus should be the architectural components required to implement a high-performance dynamic batching service that aggregates incoming user requests into optimal groups before calling the GPU worker. Address the challenges of request-to-result mapping in a distributed environment, handling backpressure during peak loads (10k+ QPS), and minimizing the latency overhead introduced by the batching logic.

Redis

gRPC

Redis Streams

Micro-batching

Python Asyncio

JWT

VPC

Pub/Sub

Questions & Insights

Clarifying Questions

What is the target throughput and latency SLA? (e.g., 10,000 QPS with a p99 < 200ms).

What is the optimal batch size and maximum batch window for the model? (e.g., Batch size of 32, max wait time of 50ms).

What are the typical request/response payload sizes? (e.g., 1KB text input vs. 5MB image).

Should the system support request prioritization? (e.g., Premium users vs. free tier).

How are results returned to the client? (Synchronous blocking call, WebSockets, or Asynchronous polling/webhooks).

Assumptions

QPS: 5,000 Average, 10,000 Peak.

Latency: The batching process itself should add < 10ms overhead; total model inference takes ~100-200ms.

Batching: Fixed maximum batch size (e.g., 64) and a time-out trigger (e.g., 50ms).

Model: Large Language Model or Vision model where GPU utilization is the primary cost/bottleneck.

API: Synchronous REST/gRPC for the end-user (client waits for the result).

Thinking Process

Core Bottleneck: GPU idle time and memory fragmentation. Individual requests waste TFLOPS; batching maximizes throughput but introduces "queuing delay."

Key Strategy: Implement a "Wait-Notify" pattern. The API Gateway accepts a request, assigns a unique request_id, pushes it to a high-speed queue, and parks the request thread. The Batcher drains the queue, groups requests, executes the GPU call, and writes results to a pub/sub channel.

Progressive Walkthrough:

How do we decouple the web arrival from the GPU processing? (High-speed Distributed Queue).

How do we group requests efficiently without starvation? (Window-based Micro-batching).

How do we return the result to the correct parked connection? (Redis Pub/Sub or Distributed Map).

How do we handle backpressure if the GPU workers are overwhelmed? (Adaptive Rate Limiting).

Bonus Points

Adaptive Batching: Dynamically adjust batch size based on incoming traffic volume to balance throughput and latency (PID controller approach).

Zero-Copy Data Transfer: Using Shared Memory (e.g., Apache Plasma) or RDMA between the Batcher and GPU Workers to minimize serialization overhead for large payloads (images/videos).

Locality-Aware Routing: Routing requests for the same model version/weights to specific worker groups to avoid GPU "cold starts" or weight-swapping latency.

Speculative Batching: If resources allow, beginning processing early if a batch is "mostly full" and predicted traffic suggests the remaining slots won't fill in time.

Design Breakdown

Functional Requirements

Core Use Cases:

Users submit inference requests and receive predictions.

System automatically groups multiple requests into a single batch for the GPU.

System handles the coordination of mapping batch results back to individual requests.

Scope Control:

In-Scope: Batching logic, request/result synchronization, worker scaling, and the API facade.

Out-of-Scope: Model training, model weight optimization (quantization), and the internal logic of the black-box inference API.

Non-Functional Requirements

Scale: Must handle 10k peak QPS.

Latency: Batching overhead (queuing + grouping) must be < 50ms.

Availability & Reliability: 99.9% uptime. If a worker fails, the batch must be retried or individual requests failed gracefully.

Consistency: Not applicable (stateless inference), but "Result-to-Request" mapping must be 100% accurate.

Fault Tolerance: Use of Dead Letter Queues (DLQ) for poisoned requests (e.g., malformed input that crashes the model).

Estimation

Traffic: 10,000 QPS.

Batch Size: 50 (average).

GPU Throughput: Each GPU can handle ~20 batches/sec.

Worker Fleet: 10,000 / (50 * 20) = 10 GPUs required.

Storage: Temporary storage in Redis for results (10k req/s 1KB 10s retention) = ~100MB RAM.

Bandwidth: 10,000 req/s * 1KB = 10MB/s Inbound.

Blueprint

Concise Summary: A high-performance inference wrapper utilizing a distributed queue (Redis) to decouple the request arrival from GPU execution, with a dedicated Batching Service that acts as a consumer and orchestrator.

Major Components:

Inference Gateway: A stateless API layer that generates Request IDs and parks client connections using a Wait/Notify pattern.

Request Queue: A high-throughput Redis Stream for buffering incoming inference tasks.

Batching Service: The core logic that pulls tasks from Redis, aggregates them into batches based on size/time, and calls the GPU API.

Result Store (Redis Pub/Sub): A low-latency mechanism to broadcast results back to the Gateway.

Simplicity Audit: This architecture avoids complex service meshes and uses Redis as both a queue and a coordinator, which is sufficient for MVP scale and provides sub-millisecond coordination.

Architecture Decision Rationale:

Why this architecture?: GPU inference is significantly faster per-request when batched. Decoupling ensures that a slow GPU call doesn't block the entire API frontend.

Functional Satisfaction: Users get a standard Request/Response experience while the backend gains the efficiency of batching.

Non-functional Satisfaction: Redis provides the necessary low-latency state management to keep the "batching tax" minimal.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

Inference Gateway: Stateless nodes scaled via CPU/Request count. Each node maintains a local map of RequestID -> ResponseChannel.

Batching Service: Scaled based on "Queue Depth." If the Redis Stream length grows, more Batcher instances are spawned.

API Schema Design:

Endpoint: POST /v1/predict

Protocol: gRPC (preferred for low overhead) or REST.

Request: { "model_id": "string", "input_data": "bytes" }

Response: { "request_id": "uuid", "output": "bytes", "latency_ms": "int" }

Resilience & Reliability:

Timeout: The Gateway has a 30s timeout. If no result is received via Pub/Sub, it returns a 504.

Backpressure: If Redis memory usage > 80%, Gateway returns 429 (Too Many Requests).

Security:

JWT-based AuthN at the Gateway.

Internal traffic between Gateway and Batcher is via VPC-private IPs.

Cache

Purpose & Justification: Redis is used for Request Buffering and Result Notification. This solves the latency bottleneck of cross-service coordination.

Key-Value Schema:

Request Queue: Redis Stream inference_tasks.

Result PubSub: Channel pattern results:{gateway_id} or global channel where Gateways filter for their own IDs.

Technical Selection: Redis (Cluster mode).

Failure Handling: If Redis fails, the Gateway enters "Direct Mode" (calling the GPU API individually without batching) to maintain availability at the cost of performance, or fails fast if the GPU API cannot handle single-request load.

Messaging

Purpose & Decoupling: Redis Streams provide the async boundary.

Event Schema: { "req_id": "uuid", "gateway_id": "string", "payload": "..." }.

Throughput: Redis handles >100k ops/sec, easily covering the 10k QPS requirement.

Failure Handling: Use Redis Consumer Groups with ACK. If a Batching Service crashes mid-inference, the message is re-delivered to another worker after a visibility timeout.

Data Processing

Processing Model: Micro-batching.

Processing DAG:

Pull: Batcher pulls up to max_batch_size (64) from Redis using XREADGROUP.

Wait: If fewer than 64 items, wait up to max_wait_ms (50ms).

Execute: Call the Blackbox GPU API with the collected array.

Fan-out: Iterate through the API response, publishing each result to the Result PubSub channel using the req_id.

ACKnowledge messages in the Redis Stream.

Technical Selection: Python with Asyncio or Go is preferred for high-concurrency coordination and low memory overhead.

Wrap Up

Advanced Topics

Trade-offs (Latency vs. Throughput): Increasing max_wait_ms improves GPU efficiency (lower cost) but increases p99 latency for users. 50ms is the "sweet spot" for real-time feel.

Reliability:

Circuit Breaker: If the GPU API returns 5xx consistently, the Batcher stops pulling from the queue to allow the GPU to recover.

Bottleneck Analysis:

Redis Centralization: Redis could become a SPOF. Mitigation: Use Redis Sentinel or Cluster with high-availability failover.

Result Fan-out: If 100 Gateway nodes are all listening to one Pub/Sub channel, they each process every message. Optimization: Each Gateway node listens to its own unique channel (e.g., results:gateway_001) and the Batcher routes results accordingly.

Security: Data is ephemeral; Redis is configured with no-persistence to maximize speed and ensure no PII leaks into disk snapshots.