The Question
Design

Real-time Online Chess Platform Design

Design a high-concurrency online chess platform supporting 1 million daily active users. Focus on the low-latency requirements for move propagation in speed chess (Bullet/Blitz), ELO-based matchmaking algorithms, and a robust strategy for synchronizing game clocks and game state across distributed WebSocket servers.
WebSocket
Redis
PostgreSQL
Kafka
FEN
PGN
ELO Rating
JWT
Anycast
Questions & Insights

Clarifying Questions

Scale and Traffic: What are the target DAU (Daily Active Users) and Peak Concurrent Users (PCU)?
Assumption: 1M DAU, 100k PCU.
Game Modes: Do we need to support variants (e.g., 960, Chess.com-specific modes) or just standard chess?
Assumption: Standard chess only for MVP.
Latency Sensitivity: What is the acceptable latency for move propagation?
Assumption: Under 200ms for move delivery to ensure fairness in "Bullet" (1 min) games.
Anti-Cheat: Is a sophisticated server-side anti-cheat engine required for the MVP?
Assumption: Basic move validation (legal moves only) is required; deep analysis is out-of-scope for MVP.
Social Features: Are we building chat, friends lists, or leaderboards?
Assumption: Global leaderboard and game history are in-scope; real-time chat is out-of-scope for MVP.

Thinking Process

The State Problem: How do we manage the live game state across two players while ensuring moves are validated and clocks are synchronized?
The Matchmaking Bottleneck: How do we pair 100k concurrent players based on ELO ratings quickly without causing high wait times?
The Communication Layer: Should we use Long Polling, WebSockets, or gRPC-web for real-time move updates?
Data Persistence: How do we store millions of historical games (PGN format) while keeping the database performant?

Bonus Points

Time Synchronization: Use a "Network Time Protocol" (NTP) style handshake to calculate RTT (Round Trip Time) and adjust player clocks to prevent "latency advantage."
WASM-based Validation: Execute the chess engine (e.g., Stockfish) in the browser via WebAssembly for instant UI feedback, while mirroring validation on the server for security.
Connection Resiliency: Implement session resumption tokens to allow players to reconnect to a live game within 30 seconds without losing their state or timing out.
Geo-aware Game Sharding: Host the Game Service in multiple regions and match players to the closest shared regional cluster to minimize latency.
Design Breakdown

Functional Requirements

Core Use Cases:
User Registration and Profile/ELO management.
Matchmaking (Pairing players with similar ELO).
Real-time gameplay (Move validation, clock management, win/loss/draw detection).
View game history and PGN exports.
Real-time global leaderboard.
Scope Control:
In-Scope: Standard chess, ELO-based matchmaking, real-time moves, basic leaderboards.
Out-of-Scope: Tournaments, engine-based anti-cheat, spectator mode (large scale), in-game chat, variations (960).

Non-Functional Requirements

Scale: Support 100k concurrent games (200k players).
Latency: Move delivery < 100ms (P99) excluding network transit.
Availability & Reliability: 99.9% availability; games should not "disappear" if a server crashes.
Consistency: Strong consistency for game state (a move cannot be undone or duplicated).
Security: Prevent unauthorized moves (moving opponent's pieces) and basic clock tampering.

Estimation

Traffic:
100k PCU = 50k active games.
Average game length: 40 moves per player (80 total).
Average move time: 5 seconds.
Peak Move QPS: 100k / 5s = 20k moves/sec.
Storage:
Game Record (PGN): ~1KB per game.
1M games/day = 1GB/day.
365GB/year (easily handled by a single Relational DB, though sharding recommended for growth).
Bandwidth:
Inbound: 20k moves/sec * 500 bytes ≈ 10 MB/s.
Outbound: 20k moves/sec * 500 bytes (to opponent) ≈ 10 MB/s.

Blueprint

Concise Summary: A WebSocket-driven architecture centered around a distributed Game State Manager and an ELO-based Matchmaking service.
Major Components:
API Gateway: Handles authentication, rate limiting, and routing.
Matchmaking Service: Uses a Redis-based queue to pair players by ELO range.
Game Service: Stateful WebSocket servers that manage the live chess board, clock, and validation.
State Store (Redis): Tracks active game metadata and connection mapping.
Archive DB (PostgreSQL): Stores user profiles, ELO, and completed game history (PGN).
Simplicity Audit: This design avoids complex message buses for move delivery, opting for direct WebSocket routing to minimize latency, which is the primary UX killer in chess.
Architecture Decision Rationale:
Why this architecture?: WebSockets are essential for the bidirectional, low-latency nature of chess moves. Redis provides the sub-millisecond state lookups needed for clock synchronization.
Functional Satisfaction: Covers matchmaking, live play, and history.
Non-functional Satisfaction: Horizontally scalable Game Services handle the 100k PCU; Redis ensures low latency.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a Global Accelerator (Anycast IP) to route players to the nearest regional data center.
API Gateway:
Auth: JWT-based authentication.
Rate Limiting: Limit move requests to 10 per second to prevent DOS attacks on the validation engine.
WebSocket Termination: Gateway handles the initial handshake and upgrades the connection to the Game Service.

Service

Topology & Scaling:
Matchmaking Service: Stateless, scales on QPS.
Game Service: Stateful. Use sticky sessions or a consistent hashing mechanism where both players in a game are pinned to the same server instance to simplify clock management.
API Schema Design:
POST /matchmake: Join queue (JSON: {user_id, rank, time_control}).
WS /game/{game_id}: Persistent move stream.
GET /history/{user_id}: Paginated game list.
Resilience & Reliability:
Heartbeats: 5-second interval Web Socket pings.
Reconnection: If a client disconnects, the Game Service keeps the game "alive" for a grace period (e.g., 60s) before declaring a forfeit.

Storage

Access Pattern:
High frequency read/write for active games (Redis).
Heavy write for game completion (Postgres).
Periodic read for history.
Database Table Design:
Users: id, username, password_hash, current_elo.
Games: id, white_id, black_id, result, pgn_data, created_at.
Technical Selection:
PostgreSQL: Strong ACID for ELO updates and complex querying of game history.
Redis: For the Matchmaking queue (Sorted Sets) and active game state (Hashes).
Distribution Logic: Partition Games table by user_id or created_at (month) to ensure the index stays in memory.

Cache

Purpose: Store active game states (Board position in FEN string, Clocks, Player IDs).
Key-Value Schema:
Key: game:{game_id} -> Value: {fen, white_time, black_time, last_move_ts}.
Key: player_loc:{user_id} -> Value: {server_ip} (for routing).
Technical Selection: Redis.
Failure Handling: Use Redis Sentinel or Cluster for high availability. If the cache fails, active games may be interrupted, but historical data remains safe in Postgres.

Messaging

Purpose: Decouple the Game Service from the Persistence Layer. Writing to the DB is slower than move validation; an async worker handles the PGN generation and DB insertion.
Event Schema: game_id, final_pgn, winner_id, end_reason.
Technical Selection: Kafka or SQS. Kafka allows for potential downstream consumers like "Cheat Detection" or "Statistics Aggregator" in the future.

Infrastructure (Optional)

Observability:
Prometheus/Grafana for monitoring "Games per Server" and "Matchmaking Wait Time."
ELK Stack for move logs (crucial for debugging "illegal move" disputes).
Wrap Up

Advanced Topics

Trade-offs (Consistency vs Availability): In chess, Consistency is paramount. We cannot have two different board states for the same game. Therefore, we use a single leader (Game Server) per game.
Matchmaking Logic:
Players are added to a Redis Sorted Set where the score is their ELO.
The Matchmaker searches for players with a score within +/- X of the requester.
X expands every few seconds to prevent infinite wait times.
Clock Synchronization:
When Player A moves, the client sends the timestamp.
The Server calculates the delta using its own clock (to prevent client-side time-freezing cheats).
Total time = Server_Now - Start_Turn_Timestamp - Network_RTT/2.
Bottleneck Analysis:
Hot Shards: The Leaderboard is a potential hot spot. Use a cached "Top 100" refreshed every 60 seconds rather than real-time DB queries.
Zombie Connections: High PCU results in many idle WebSockets. Implement aggressive timeouts for inactive users.