The Question
DesignInventory Reservation System with TTL Auto-Release
Design a high-concurrency inventory management system that supports atomic stock reservation for a 5-minute window. The system must expose an interface to block inventory for an order, confirm the order, and query current stock levels. If an order is not confirmed within 5 minutes, the reserved stock must be released back to the available pool automatically. Discuss how you would handle race conditions between confirmation and expiration, ensure strong consistency under high load (e.g., flash sales), and maintain system reliability in the event of worker or cache failures.
Redis
PostgreSQL
Lua
SQS
RabbitMQ
Transactional Outbox
Optimistic Locking
Questions & Insights
Clarifying Questions
Scale & Performance: What is the expected Peak QPS for
blockInventory (e.g., flash sale scenarios) and the total number of SKUs?Durability: Should the inventory state survive a total system crash, or is an in-memory solution with periodic snapshots sufficient for the MVP?
Consistency: Is "at-least-once" release acceptable for the 5-minute timeout, or must it be "exactly 5 minutes" with hard real-time guarantees?
Network Partitions: How should the system behave if the confirmation arrives exactly as the 5-minute timer expires?
Assumptions for Design:
Scale: 10k Write QPS, 50k Read QPS (standard high-scale e-commerce).
Consistency: Strong consistency is required for inventory counts to prevent over-selling.
Persistence: Inventory must be persisted in a RDBMS, but performance is optimized via Redis.
Timeout: The 5-minute window is a business rule; a slight delay (seconds) in releasing is acceptable.
Thinking Process
To build a modular, extensible, and thread-safe inventory system, we focus on atomic state transitions and reliable asynchronous triggers.
Atomic Decrement: How do we prevent over-selling under high concurrency?
Strategy: Use Redis
DECRBY with a Lua script to ensure the check-and-set operation is atomic.State Management: How do we track the lifecycle of a "Reserved" item?
Strategy: Use a relational database to track
OrderInventory status (RESERVED, CONFIRMED, RELEASED).The 5-Minute Trigger: How do we reliably release inventory without polling the entire database?
Strategy: Use a Delayed Message Queue (e.g., SQS or Redis TTL keyspace notifications) to trigger a "Check & Release" worker after 300 seconds.
Concurrency Control: How do we handle race conditions between a late
confirmOrder and a releaseInventory task?Strategy: Use Database-level optimistic locking (versioning) or a state-machine transition check.
Bonus Points
Transactional Outbox Pattern: Ensure that the inventory decrement in the DB and the scheduling of the 5-minute timer happen within a single atomic transaction to prevent orphaned reservations.
Idempotency Keys: Use
orderId as an idempotency token to ensure that retried blockInventory calls do not decrement stock multiple times.Distributed Sharding: Partition inventory by
productId to avoid hot-spot contention on a single database node during high-traffic events.Write-Back Caching: Use Redis as the primary counter for performance, with an asynchronous process syncing counts back to the RDBMS to reduce disk I/O pressure.
Design Breakdown
Functional Requirements
Core Use Cases:
blockInventory: Atomically reserve a specific quantity for an order.confirmOrder: Permanently commit the reservation.getInventory: Retrieve current available stock.Auto-Release: Automatically return reserved stock to "available" if not confirmed within 5 minutes.
Scope Control:
In-Scope: Atomic stock management, timeout logic, basic persistence.
Out-of-Scope: Payment processing, shipping logistics, user authentication.
Non-Functional Requirements
Scale: Support up to 1M SKUs and high-concurrency bursts.
Latency:
getInventory and blockInventory must return in < 50ms (p99).Availability: 99.99% availability; inventory must be available even if the background cleanup worker is temporarily down.
Consistency: Strong consistency for stock levels (no overselling).
Fault Tolerance: If the timer service fails, a secondary batch job should recover "stuck" reservations.
Estimation
Traffic: 10,000 block requests/sec.
Storage: 1M SKUs 100 bytes = 100MB (RAM/Redis); 10M orders/day 200 bytes = 2GB/day (DB).
Bandwidth: Negligible for metadata (approx. 5-10 Mbps).
Timer Load: 10,000 delayed messages per second.
Blueprint
The design uses a Cache-Aside + Delayed Worker architecture. Redis provides the speed and atomicity for high-concurrency inventory counts, while a Relational Database provides the source of truth for order states. A Message Queue handles the 5-minute delayed execution for inventory release.
Inventory Service: The core logic coordinator for blocking and confirming.
Redis Cache: Stores atomic counters for each
productId.Relational DB: Stores the permanent record of inventory and reservation states.
Delayed Message Queue: Orchestrates the 5-minute "Time-to-Live" for reservations.
Simplicity Audit: This architecture avoids complex distributed locking (like ZooKeeper/Etcd) by relying on Redis's single-threaded atomic operations and simple queue-based asynchronous processing.
Architecture Decision Rationale:
Redis is used because RDBMS locking (
SELECT FOR UPDATE) scales poorly for high-concurrency flash sales.Delayed Queue is chosen over a "Scanning Cron Job" because scanning a DB table every second is inefficient and doesn't scale as the number of active orders grows.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: Stateless microservices deployed across multiple Availability Zones. Scaling is based on Request QPS.
API Schema Design:
POST /v1/inventory/block: Request:
{ productId, count, orderId }Response:
200 OK or 409 Conflict (Insufficient Stock)POST /v1/inventory/confirm:Request:
{ orderId }GET /v1/inventory/{productId}:Response:
{ count }Resilience:
Dead Letter Queues (DLQ): If the Cleanup Worker fails to release stock, the message is sent to a DLQ for manual intervention.
Idempotency:
confirmOrder is idempotent. If called twice, the second call does nothing.Storage
Access Pattern: Heavy read-modify-write for
blockInventory.Database Table Design:
Products:
id (PK), sku, available_stock, total_stock, version (for optimistic locking)Reservations:
order_id (PK), product_id, count, status (RESERVED|CONFIRMED|EXPIRED), created_atTechnical Selection: PostgreSQL. Support for ACID transactions and row-level locking ensures data integrity.
Distribution Logic: Sharded by
product_id to distribute write load.Cache
Purpose: Performance and Atomic Counters.
Key-Value Schema:
Key:
stock:{productId}, Value: integer.Logic: Use Lua script:
if redis.call('get', KEYS[1]) >= ARGV[1] then return redis.call('decrby', KEYS[1], ARGV[1]) else return -1 end.Failure Handling: If Redis goes down, the service falls back to the RDBMS (degraded performance). On Redis recovery, it is re-hydrated from the RDBMS.
Messaging
Purpose: Implements the 5-minute delay.
Technical Selection: AWS SQS (with
DelaySeconds=300) or RabbitMQ (Delayed Exchange).Failure Handling: Consumers use a "Visibility Timeout." If the worker crashes mid-processing, the message reappears for another worker.
Data Processing
Cleanup Worker:
Consumes message:
{ orderId }.Check DB:
SELECT status FROM Reservations WHERE order_id = ?.If
status == 'RESERVED':Update status to
EXPIRED.Increment Redis:
INCRBY stock:{productId} {count}.Increment DB:
UPDATE Products SET available_stock = available_stock + {count}.Wrap Up
Advanced Topics
Trade-offs: We choose Consistency over Availability (CP) for the inventory count. It is better to fail a request than to sell the same item twice.
Race Condition: A user confirms an order at 4m 59s, but the cleanup worker starts at 5m 00s.
Resolution: Both use a DB transaction. The worker's
UPDATE statement includes WHERE status = 'RESERVED'. If the status is already CONFIRMED, the UPDATE affects 0 rows, and the worker terminates gracefully.Performance Optimization:
Batching: Syncing Redis to DB can be done in batches every 1-5 seconds to reduce DB write IOPS.
Read-Through:
getInventory checks Redis first; if missing, it loads from DB.Security: Service-to-service communication via mTLS. RBAC to ensure only the Order Service can call
blockInventory.