The Question
Design

Real-Time Chess Platform Design

Design a high-concurrency, real-time chess platform similar to Chess.com. The system must support ELO-based matchmaking, sub-200ms move latency for blitz play, server-side move validation to prevent cheating, and move-reversion (undo) for casual play. Address how you would handle 100,000 concurrent games, player disconnections, and persistent storage of millions of historical matches in PGN format.
WebSockets
Redis
PostgreSQL
Kafka
FEN
PGN
Anycast IP
CDC
Questions & Insights

Clarifying Questions

What is the expected scale of the platform? (Assumption: 1 million Daily Active Users (DAU) and 100,000 Concurrent Users (CCU) at peak).
What are the latency requirements for move delivery? (Assumption: End-to-end latency should be < 200ms for a responsive feel, critical for "Bullet" or "Blitz" formats).
How does the matchmaking work? (Assumption: ELO-based matchmaking where users are paired with others within a specific rating range).
Is "Undo" available for all games? (Assumption: No. Undo is restricted to casual/private games to maintain the integrity of rated matches).
Do we need to store full game histories? (Assumption: Yes, in PGN (Portable Game Notation) format for post-game analysis and user profiles).

Thinking Process

The Connection Bottleneck: How do we maintain 100k persistent connections efficiently? Use a distributed WebSocket architecture with a Pub/Sub backbone to route messages to the correct server where the opponent is connected.
The Matchmaking Logic: How do we pair players fast without "starvation"? Use a sliding window ELO range in a dedicated Matchmaking Service backed by an in-memory store like Redis.
The State Consistency: How do we ensure both players see the exact same board state? Implement a "Server-as-Source-of-Truth" model where the server validates every move before updating the official state and broadcasting it.
The Undo/History Mechanism: How do we revert state reliably? Store moves as an ordered log (Event Sourcing lite). Undo involves popping the last N moves from the stack and re-broadcasting the previous state.

Bonus Points

Anti-Cheat Integration: Discussing an asynchronous pipeline where game PGNs are sent to a pool of Stockfish engines to detect high Centipawn Loss (CPL) correlations with engine moves.
Disconnection Resiliency: Implementing a "Grace Period" where the game clock pauses briefly or continues while the server holds the WebSocket session in a "waiting" state for a 30-60 second reconnection window.
Clock Drift Mitigation: Using server-side timestamps for "move received" rather than client-side, combined with RTT (Round Trip Time) calculation to adjust the remaining time fairly.
Global Latency Optimization: Utilizing an Edge Network (Anycast IP) to terminate WebSockets closer to the user, reducing the TCP handshake penalty.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can join a matchmaking queue based on time control (e.g., 3+0, 5+5).
Users can play moves in real-time with move validation.
Users can "Undo" moves in casual games.
Users can view a global leaderboard.
Users can view their game history.
Scope Control:
In-Scope: Matchmaking, Real-time play, Leaderboards, Basic Undo.
Out-of-Scope: Live streaming (Twitch integration), AI personalities, Tournament brackets, Social feed.

Non-Functional Requirements

Scale: Handle 100k concurrent games.
Latency: P99 move-to-display latency < 200ms.
Availability & Reliability: 99.9% uptime; active games should not be lost if a single node fails.
Consistency: Strong consistency for game state (both players must see the same board).
Security & Privacy: Protect against move-spoofing and unauthorized state changes.

Estimation

Traffic Estimation:
100k CCU = 50k active games.
Avg moves per second: 1 move every 5 seconds per player -> 100k / 5 = 20k Move QPS.
Storage Estimation:
PGN for one game: ~2KB.
1M games/day: 1M * 2KB = 2GB/day.
1 year: ~730GB (highly manageable for RDBMS).
Bandwidth Estimation:
Move payload: ~500 bytes.
20k QPS * 500B = 10MB/s (Low bandwidth, high packet count).

Blueprint

Concise Summary: A microservices architecture leveraging WebSockets for real-time bi-directional communication, with Redis serving as the high-speed state store for active games and matchmaking.
Major Components:
WebSocket Gateway: Manages persistent connections and routes moves to the Game Service.
Matchmaking Service: Pools players by ELO and time control to create game sessions.
Game Service: The "Brain" that validates moves using a chess engine library and manages the move stack (Undo functionality).
Redis: Stores active game states and the matchmaking queue.
PostgreSQL: Persists completed games (PGN) and user profiles.
Simplicity Audit: This design avoids complex distributed locking by pinning a specific game ID to a specific Game Service instance or using Redis atomic operations for state updates.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: WebSockets are essential for the sub-200ms latency requirement. A centralized Game Service ensures move validation occurs before any player sees a move, preventing cheating.
Functional Requirement Satisfaction: Handles matchmaking via Redis Sorted Sets, real-time play via WebSockets, and move history/undo via a List structure in Redis.
Non-functional Requirement Satisfaction: Scalable horizontally by adding more WebSocket and Game Service nodes.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a Global Load Balancer with Anycast IP to route users to the nearest regional data center to minimize WebSocket latency.
Security & Perimeter:
API Gateway: Handles AuthN (JWT validation) before upgrading to WebSocket.
Rate Limiting: Limits move attempts to prevent DoS on the move validation engine.

Service

Topology & Scaling: Stateless Game Services that fetch state from Redis. This allows any Game Service instance to process any move if a specific instance dies.
API Schema Design:
POST /matchmaking/join: { userId, timeControl, elo }
WS /game/{gameId}: Bi-directional move exchange.
POST /game/{gameId}/undo: Request undo (Casual games only).
Resilience & Reliability:
Heartbeats: Client sends pings every 5s to keep the WebSocket alive.
Graceful Reconnection: If the WebSocket drops, the client sends the last_move_id to catch up on missed moves from the Redis buffer.

Storage

Access Pattern:
High frequency write/read for active games (Redis).
High volume, low frequency write for completed games (PostgreSQL).
Database Table Design:
Games Table: game_id (PK), white_user_id, black_user_id, pgn_data (TEXT), result, created_at.
Moves Table (Optional for analytics): move_id, game_id, san_move (e.g., "e4"), timestamp.
Technical Selection:
Redis: Used for the "Hot State" (current FEN board string, move list, clocks).
PostgreSQL: Used for "Cold Storage" (User data, game history).

Cache

Purpose & Justification: Redis stores the live board state (FEN string) and the remaining time for both players. This prevents expensive DB hits during the game.
Key-Value Schema:
game:state:{id}: Hash containing { board: "fen_string", white_time: 300, black_time: 300, moves: ["e4", "e5"] }.
TTL: 24 hours (enough for any game to finish).
Failure Handling: If Redis loses a node, the board state is ephemeral but critical. We use Redis Sentinel/Cluster for high availability.

Messaging

Purpose & Decoupling: Kafka is used to decouple the "Game Finished" event from the persistence and leaderboard updates.
Event Schema: GameFinishedEvent: { game_id, winner_id, pgn, duration }.
Technical Selection: Kafka for its replayability and high throughput.

Data Processing

Processing Model: A Background Worker (Consumer) consumes GameFinishedEvents.
Tasks:
Update PostgreSQL with the final PGN.
Update the ELO ratings of both players in the Users table.
Update the Redis Sorted Set for the Leaderboard.
Technical Selection: Custom Python/Go workers for flexibility.
Wrap Up

Advanced Topics

Trade-offs (Consistency vs Availability): During a game, we favor Consistency (CP). It is better to reject a move or pause the game than to allow divergent board states.
Undo Implementation: The Game Service maintains a List of moves in Redis. When undo is called, the service pops the last two moves (one for each player), generates the new board state (FEN), and broadcasts the update to both clients.
Matchmaking Optimization: Use a "Bucket" approach in Redis. Users are placed in buckets based on ELO (e.g., 1200-1250). If no match is found in 5 seconds, the service expands the search to adjacent buckets (1150-1300).
Security: Move validation is performed using a library like chess.js on the server. The client only sends the move (e.g., e2-e4). The server validates if the move is legal for the current board state before updating Redis.