The Question
DesignStock Exchange Matching Engine Design
Design a high-performance stock exchange system capable of handling 100,000 orders per second with sub-millisecond matching latency. The system must ensure strict price-time priority, deterministic execution, and zero data loss for accepted orders. Detail the architecture for order sequencing, in-memory matching, and high-fanout market data distribution while addressing fault tolerance and recovery mechanisms.
C++
Java
Kafka
Aeron
LMAX Disruptor
UDP Multicast
Cassandra
PostgreSQL
Redis
FIX Protocol
SBE
Questions & Insights
Clarifying Questions
Traffic Scale: What is the peak number of orders per second (throughput) and the total number of traded symbols (stocks)?
Latency Targets: What is the acceptable end-to-end latency for order matching (e.g., sub-millisecond, microseconds)?
Order Types: Does the MVP need to support complex orders (Stop-Loss, Fill-or-Kill), or just simple Limit and Market orders?
Data Retention: How long do we need to store order history for audit and compliance (e.g., 7 years)?
Connectivity: Will clients connect via standard REST/WebSockets, or industry-standard protocols like FIX (Financial Information eXchange)?
Assumptions for this design:
Scale: 100,000 orders per second at peak.
Latency: Sub-10ms P99 (ambitious for a general system, standard for exchanges).
Core Assets: Supporting 10,000 tickers.
Protocols: Using FIX for institutional traders and WebSockets for retail/market data.
Thinking Process
Deterministic Sequencing: How do we ensure every node in the system sees orders in the exact same order? (Solution: A centralized Sequencer).
In-Memory Matching: How do we achieve ultra-low latency? (Solution: Single-threaded, in-memory Limit Order Book).
Reliable Recovery: How do we prevent data loss if the matching engine crashes? (Solution: Write-Ahead Logging (WAL) and Snapshotted state).
Market Data Fan-out: How do we broadcast price updates to millions of users without slowing down the matching engine? (Solution: Async Market Data Bus).
Bonus Points
LMAX Disruptor Pattern: Using a high-performance inter-thread messaging library to handle 100k+ TPS on a single thread with minimal garbage collection overhead.
Hardware Acceleration: Potential for FPGA-based matching engines for sub-microsecond execution in high-frequency trading (HFT) environments.
Clock Synchronization: Implementing PTP (Precision Time Protocol) instead of NTP for microsecond-level accuracy across distributed logs.
Zero-Copy Networking: Utilizing technologies like DPDK or kernel bypass to reduce the CPU overhead of processing network packets.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can place/cancel Limit and Market orders.
The system matches buy and sell orders based on price-time priority.
The system generates trade execution reports.
The system broadcasts real-time market data (Price/Volume).
Scope Control:
In-Scope: Order Gateway, Sequencer, Matching Engine, Market Data Broadcaster.
Out-of-Scope: Settlement/Clearing (post-trade), User Wallet Management, KYC/Onboarding.
Non-Functional Requirements
Scale: Support 100k TPS and millions of concurrent market data subscribers.
Latency: P99 < 10ms from gateway to execution report.
Availability & Reliability: 99.999% availability; zero data loss for accepted orders.
Consistency: Strict serializability for the order book (no race conditions on the same stock).
Fault Tolerance: Automatic failover to a hot-standby matching engine.
Security: Strict authentication for FIX sessions; encryption of data at rest.
Estimation
Traffic Estimation:
100k orders/sec.
Each order is ~200 bytes.
100,000 * 200 = 20 MB/sec ingress.
Storage Estimation:
20 MB/sec 3600s 8 hours/trading day = ~576 GB per day for raw logs.
7 years of history = ~1.4 PB (with compression and indexing).
Bandwidth Estimation:
Outgoing Market Data: High fan-out. If 1k updates/sec and 100k subscribers, bandwidth can spike to multiple Gbps (mitigated by multicast or tiered distribution).
Blueprint
Concise Summary: The system utilizes an "Order Gateway" to receive requests, a "Sequencer" to timestamp and order them, and an in-memory "Matching Engine" to execute trades against a Limit Order Book.
Major Components:
Order Gateway: Entry point for FIX/WebSocket connections; handles auth and validation.
Sequencer: Assigns a global monotonically increasing ID to every event to ensure deterministic replay.
Matching Engine: In-memory component that maintains the Limit Order Book (LOB) and executes trades.
Market Data Bus: Asynchronously broadcasts trade events and price updates to the public.
Trade Ledger: Durable storage for executed trades and audit trails.
Simplicity Audit: This design avoids complex distributed locking by using a single-threaded execution model per symbol partition, which is the industry standard for performance.
Architecture Decision Rationale:
Why?: Determinism is the most critical requirement; a Sequencer-based architecture ensures that a secondary matching engine will produce the same result if the primary fails.
Functional: Meets all order lifecycle and market data requirements.
Non-functional: Provides the lowest possible latency by keeping the active order book in RAM.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Not applicable for core order flow, but CDN is used for static retail trading UI assets.
Security & Perimeter:
API Gateway: Validates FIX signatures and JWTs for WebSockets.
Rate Limiting: Per-account limits to prevent algorithmic "spamming" that could degrade matching engine performance.
Termination: SSL/TLS termination happens at the Gateway to keep internal latency low.
Service
Topology & Scaling:
Matching Engine: Partitioned by Ticker Symbol (e.g., Tickers A-M in Engine 1, N-Z in Engine 2) to allow horizontal scaling.
Stateless Gateways: Can scale horizontally behind a Load Balancer.
API Schema Design:
Protocol: Binary (SBE) or FIX.
NewOrder:
{AccountID, Symbol, Side, Price, Quantity, OrderType, ClOrdID}.ExecutionReport:
{OrderID, Status, ExecutedQty, RemainingQty, AvgPrice}.Resilience & Reliability:
Sequencer: Active-Passive pair using Raft or Paxos to ensure a single source of truth for order ID assignment.
Matching Engine Recovery: On startup, it loads the latest snapshot of the Order Book and replays events from the Sequenced Event Bus.
Observability: High-frequency metrics for "Tick-to-Trade" latency (time from gateway receipt to execution report).
Storage
Access Pattern:
Write-heavy for logs and trade history.
Extremely low-latency read/write for the active Order Book.
Database Table Design:
Trades Table:
trade_id (PK), buy_order_id, sell_order_id, symbol, price, quantity, timestamp.Orders Table:
order_id (PK), account_id, symbol, side, original_qty, price, status.Technical Selection:
WAL: High-performance local SSD or specialized journals (e.g., Chronicle Queue).
Trade Ledger: TimescaleDB or Cassandra for scalable, time-series storage of historical trades.
Distribution Logic: Sharded by
symbol to ensure all orders for "AAPL" hit the same matching partition.Cache
Purpose & Justification: The Live Order Book is the ultimate cache. It must reside in-memory for microsecond matching.
Key-Value Schema: A specialized
TreeMap or Double-Ended Queue structured by Price Level. Key: Price. Value: List of orders at that price (FIFO).Technical Selection: Custom C++ or Java (with Off-heap memory) data structures. Redis is generally too slow for the core matching engine loop but can be used in the Market Data Service to store the current "Last Traded Price" (LTP).
Failure Handling: If the in-memory state is lost, it is rebuilt from the Sequenced Event Bus logs.
Messaging
Purpose & Decoupling:
Sequenced Event Bus: Decouples the Gateway from the Matching Engine and provides persistence for recovery.
Market Data Bus: Decouples matching from broadcasting to millions of clients.
Technical Selection:
Sequenced Bus: Aeron or Kafka (with low-latency tuning). Aeron is preferred for sub-millisecond needs.
Market Data: UDP Multicast (for institutional) or NATS/Redis PubSub (for retail WebSockets).
Failure Handling: Dead-letter queues for malformed orders identified after sequencing.
Data Processing
Processing Model: Real-time stream processing.
Processing DAG:
Input: Raw Trade Events.
Transform: Aggregate into OHLC (Open, High, Low, Close) candles (1m, 5m, 1h).
Sink: NoSQL storage and Market Data Bus.
Technical Selection: Flink or custom lightweight aggregators for low-latency candle generation.
Wrap Up
Advanced Topics
Trade-offs: We prioritize Consistency and Latency over Horizontal Scalability of a single symbol. A single stock cannot be easily split across multiple machines without complex distributed locks, so we scale by partitioning symbols.
Reliability: We use a Hot Standby pattern. The standby matching engine consumes the same Sequenced Event Bus as the primary but does not send output reports. If the primary dies, the standby takes over.
Bottleneck Analysis: The Sequencer can become a single point of failure and a bottleneck. We use hardware-level time-stamping and high-performance ring buffers (Disruptor) to maximize its throughput.
Security: FIX sessions often use dedicated leased lines (Direct Connect) to bypass the public internet for both security and latency.