High-Concurrency GPU Inference Batching System
Design a scalable infrastructure for a high-concurrency inference API. The system must use a fixed, black-box GPU inference endpoint. Your primary focus should be the architectural components required to implement a high-performance dynamic batching service that aggregates incoming user requests into optimal groups before calling the GPU worker. Address the challenges of request-to-result mapping in a distributed environment, handling backpressure during peak loads (10k+ QPS), and minimizing the latency overhead introduced by the batching logic.
RedisgRPCRedis StreamsMicro-batchingPython AsyncioGoJWTVPCPub/Sub
00