The Question
DesignLow-Latency Stock Exchange Design
Design a high-performance electronic trading platform capable of matching buy and sell orders with microsecond-level determinism. The system should support standard order types, ensure strict price-time priority, and remain resilient to component failures while maintaining a complete audit trail of all transactions.
LMAX Disruptor
In-Memory Matching Engine
Kafka
Redis
Event Sourcing
Questions & Insights
Clarifying Questions
What is the scale of the system? (Assumed: 100,000 peak orders per second, 10,000 active symbols).
What order types must be supported for the MVP? (Assumed: Limit and Market orders only).
What are the latency requirements? (Assumed: < 10ms end-to-end latency for order acknowledgment; deterministic matching).
Is clearing and settlement part of this design? (Assumed: The exchange handles matching and generates trade reports; clearing/settlement is an asynchronous downstream process).
What is the availability requirement? (Assumed: 99.99% availability during trading hours; zero data loss for executed trades).
Thinking Process
The core challenge of a stock exchange is deterministic matching at high speed.
How do we ensure fairness and order? Use a Sequencer to assign a monotonic sequence number to every incoming order before it hits the matching engine.
How do we achieve low latency matching? Implement an In-Memory Matching Engine using a Price-Time Priority (FIFO) algorithm.
How do we maintain durability without sacrificing speed? Use an Event Sourcing pattern where the sequence of orders is persisted to a write-ahead log (WAL) before matching.
How do we scale? Partition (shard) the Matching Engines by Symbol/Ticker to allow parallel processing across different stocks.
Bonus Points
LMAX Disruptor Pattern: Utilizing ring buffers and lock-free concurrency for the sequencer to handle millions of events per second with minimal GC pressure.
Kernel Bypass (UDP/Multicast): For high-frequency trading (HFT) environments, using technologies like Solarflare/Mellanox with DPDK to bypass the Linux TCP/IP stack for market data distribution.
Deterministic Replay: Designing the matching engine as a pure function of its inputs (the sequenced order stream), allowing a backup node to achieve the exact same state by replaying the log.
Hardware Acceleration (FPGA): Mentioning that the most critical matching paths in top-tier exchanges are often moved to FPGA for nanosecond-level consistency.
Design Breakdown
Functional Requirements
Order Management: Users can place, update, and cancel Limit/Market orders.
Order Matching: Match buy and sell orders based on Price-Time priority.
Trade Generation: Create trade execution reports when a match occurs.
Market Data Feed: Provide real-time L1 (best bid/offer) and L2 (order book depth) updates.
Risk Checks: Basic validation of user balance and symbol permissions.
Non-Functional Requirements
Ultra-low Latency: Critical for matching and order status updates.
High Throughput: Must handle bursts during market open/close.
Determinism: The same sequence of orders must always result in the same trade executions.
Durability: Every acknowledged order and executed trade must be recoverable.
Estimation
Throughput: 100k orders/sec.
Payload Size: ~200 bytes per order.
Ingress Bandwidth: 100k * 200B = 20 MB/s (easily handled by 1Gbps network).
Storage: 100k orders/sec 6.5 hours/day 200B ≈ 468 GB per day for raw order logs.
Memory: 10k symbols 1000 orders/book 100 bytes/order ≈ 1 GB (Memory is not the bottleneck; CPU cache misses are).
Blueprint
Concise Summary: A partitioned, event-driven architecture where an Order Gateway validates requests, a Sequencer ensures strict ordering, and Sharded Matching Engines process orders in-memory.
Major Components:
Order Gateway: Handles client connections (FIX/Websocket), authentication, and pre-trade risk checks.
Sequencer: Assigns global sequence IDs to orders to guarantee deterministic processing.
Matching Engine: In-memory data structure sharded by ticker symbol that executes the FIFO matching logic.
Market Data Service: Aggregates matches and order book changes to broadcast to subscribers.
Simplicity Audit: This design avoids complex distributed transactions by using ticker-based sharding and single-threaded matching logic per shard, which is the standard for high-performance exchanges.
Architecture Decision Rationale:
Why this architecture?: Partitioning by symbol eliminates the need for cross-node coordination during matching.
Functional Satisfaction: Covers the full lifecycle from order entry to trade execution.
Non-functional Satisfaction: Event sourcing ensures durability and determinism; in-memory matching ensures the required low latency.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
Order Gateway: Stateless, scaled horizontally behind an L7 Load Balancer.
Matching Engine: Stateful, sharded by
SymbolID. Active-Passive pair per shard for high availability.API Schema Design:
POST /v1/orders: (REST or FIX protocol)symbol, side (buy/sell), type, price, quantity, client_order_id.DELETE /v1/orders/{order_id}: Cancel order.Idempotency:
client_order_id used to prevent duplicate submissions.Resilience:
Circuit Breaker: Triggered if the Sequencer lag exceeds a threshold.
Failover: If an active Matching Engine fails, the passive node replays the Sequencer log to rebuild the state.
Security:
Mutual TLS (mTLS) for institutional clients; HMAC signatures for retail API keys.
Storage
Access Pattern:
High-frequency sequential writes (Logging).
Low-frequency point lookups (Audit/History).
Database Table Design:
Trades Table:
trade_id (PK), order_id_buy, order_id_sell, symbol, price, quantity, timestamp.Orders Table:
order_id (PK), user_id, symbol, status (Filled, Cancelled), remaining_qty.Technical Selection:
TimescaleDB or Standard PostgreSQL: For trade history and order audit trails.
Raft-based Log (or Kafka with idempotency): For the Sequencer.
Distribution Logic: Sharded by
symbol.Cache
Purpose & Justification: Provide sub-millisecond access to the current "Top of Book" (Best Bid/Offer) for the UI without hitting the database.
Key-Value Schema:
Key:
ticker:{symbol}:l1, Value: {bid: 100.1, ask: 100.2, volume: 500}.Key:
ticker:{symbol}:book, Value: Sorted list of price levels (L2 data).Technical Selection: Redis. High throughput, supports Sorted Sets which are perfect for representing price levels.
Failure Handling: If Redis fails, the Market Data Service continues to stream live updates; users may see a temporary lag in "snapshot" views.
Messaging
Purpose & Decoupling: The Sequencer acts as the decoupling point between the Gateway and the Matching Engines.
Event / Topic Schema:
Topic:
order_input_{shard_id}: All incoming order requests.Topic:
trade_output: All successful matches for downstream clearing.Throughput & Partitioning: Partitioned by
SymbolID to ensure all orders for "AAPL" go to the same matching engine instance.Technical Selection: Kafka or Redpanda. Chosen for high-throughput persistence and replayability.
Wrap Up
Advanced Topics
Monitoring:
Order-to-Ack Latency: P99 measurements.
Sequencer Lag: Critical metric for system health.
Trade-offs:
Consistency vs. Availability: We choose Consistency (CP in CAP). It is better to stop trading a specific symbol than to allow non-deterministic matching or double-spending of assets.
Bottlenecks:
Single-threaded matching per symbol. While this ensures FIFO, a single extremely active symbol (e.g., TSLA) can be limited by the clock speed of a single CPU core.
Optimization:
Zero-Copy: Use direct byte buffers for passing messages between the Gateway and the Sequencer to avoid expensive object serialization.