The Question
DesignMassively Multiplayer Online Game (MMO) Backend
Design a scalable backend architecture for a Massively Multiplayer Online (MMO) game supporting 100,000 concurrent users. The system must handle real-time spatial state synchronization, persistent player progress, and seamless transitions between world zones. Address specific challenges such as the 'N^2' communication problem, stateful server scaling, and maintaining high-frequency game ticks (20Hz+) while ensuring transactional integrity for player inventories.
WebSockets
Redis
PostgreSQL
Kafka
Agones
Kubernetes
UDP
ECS
Anycast
Protobuf
Questions & Insights
Clarifying Questions
What is the target concurrency and world scale?
Assumption: 1,000,000 DAU, 100,000 Concurrent Users (CCU), with up to 2,000 players per "World Instance" or "Zone."
What is the gameplay latency requirement (Tick Rate)?
Assumption: Action-oriented gameplay requiring a 20Hz tick rate (50ms processing window) and sub-100ms end-to-end network latency.
How is the game world structured?
Assumption: A seamless world partitioned into zones. Players transition between zones via a handoff mechanism.
What is the persistence strategy for player progress?
Assumption: Player inventory and stats must be durable (ACID-compliant for transactions), while position can be eventually consistent.
Thinking Process
The Statefulness Paradox: Unlike standard web apps, MMOs are stateful. How do we maintain a high-frequency game loop while ensuring state isn't lost during a crash?
The N^2 Broadcast Problem: If 1,000 players are in one spot, broadcasting every move to everyone creates exponential traffic. How do we implement "Interest Management"?
Zone Handover: How does a player move from one server (Zone A) to another (Zone B) without a loading screen or losing connection?
Consistency vs. Performance: Where do we draw the line between in-memory speed (gameplay) and disk durability (items/currency)?
Bonus Points
Spatial Partitioning (Quadtrees/Grids): Using spatial indexing to dynamically calculate "Area of Interest" (AOI) to reduce network egress.
Delta Compression & Bit-packing: Custom binary protocols (moving away from JSON/Protobuf) to minimize MTU overhead for 20Hz updates.
Client-Side Prediction & Server Reconciliation: Techniques to mask latency by simulating movement locally and correcting via authoritative server snapshots.
Deterministic Lockstep vs. State Sync: Choosing State Synchronization for MMOs to allow late-joining and better anti-cheat control.
Design Breakdown
Functional Requirements
Core Use Cases:
Player authentication and character selection.
Real-time movement and interaction (combat/looting).
Persistence of player attributes, inventory, and progression.
Global and Zone-based chat.
Scope Control:
In-Scope: MVP architecture for stateful game servers, spatial partitioning, and persistence.
Out-of-Scope: Voice-over-IP (VOIP), complex auction house physics, anti-cheat heuristics (ML-based).
Non-Functional Requirements
Scale: Support 100k CCU across multiple regional clusters.
Latency: Game loop processing < 50ms; Network RTT < 100ms.
Availability: 99.9% availability; "World" state must recover quickly from server failure.
Consistency: Strong consistency for inventory/transactions; Eventual consistency for world position.
Security: TLS for auth; custom encrypted binary UDP/WebSockets for gameplay to prevent packet sniffing/injection.
Estimation
Traffic: 100k CCU * 20 packets/sec = 2M packets/sec globally.
Bandwidth: 2M packets * 100 bytes/packet ≈ 200 MB/s (1.6 Gbps) egress.
Storage: 1M DAU * 50 KB character data ≈ 50 GB (Small, but high IOPS for frequent saves).
Compute: 100k CCU / 500 players per core ≈ 200 high-performance CPU cores for game logic.
Blueprint
The design utilizes a Zone-based Stateful Architecture. Players connect to a Gateway which routes them to a specific Zone Server. The Zone Server maintains the "Source of Truth" in memory for the duration of the session, periodically flushing state to a persistent store.
Gateway (Agones/Sidecar): Manages persistent WebSocket/UDP connections and proxies traffic.
Zone Server (Stateful): Runs the game loop (ECS pattern) and handles spatial partitioning.
World Registry: A service that tracks which player is in which zone and handles handoffs.
Redis (Session Store): Stores transient "hot" state like player location and session tokens.
PostgreSQL: Authoritative store for player metadata and inventory.
Simplicity Audit: We avoid a "Seamless Single World" mesh (complex) in favor of "Zoned Instances" which is easier to scale and debug for an MVP.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery: Game assets (textures/models) are served via CDN (CloudFront/Akamai).
Traffic Routing: AWS Global Accelerator provides Anycast IP addresses to route traffic to the nearest regional Gateway over the AWS backbone, reducing jitter.
Gateway: Implements TLS termination and Packet validation. For the MVP, we use WebSockets for bi-directional communication, transitioning to UDP (KCP) for high-performance needs later.
Service
Zone Server (Stateful):
Uses an Entity Component System (ECS).
Interest Management: Divides the zone into a 2D grid. A player only receives updates for entities in their grid cell and adjacent cells.
Tick Rate: Fixed 20Hz loop.
Handoff: When a player crosses a boundary, the
World Registry facilitates a ticket-based transfer to the neighboring Zone Server.API Schema:
POST /auth/login: REST (HTTPS) - Returns JWT and Gateway IP.Stream: MoveEntity: WebSocket (Binary) - [EntityID, X, Y, Z, Timestamp].Stream: Action: WebSocket (Binary) - [ActionID, TargetID, SkillID].Resilience: If a Zone Server crashes, the
World Registry detects the heartbeat failure and restarts the zone on a new node. Players reconnect and the server reloads the last checkpoint from Redis/DB.Storage
Access Pattern:
Read: Heavy at login.
Write: Constant periodic "checkpoints" (every 30s) and critical "event-driven" writes (looting a rare item).
Database Table Design:
Players:
player_id (PK), username, last_zone_id, position_blob.Inventory:
item_id (PK), player_id (FK), item_type, stats_json.Technical Selection: PostgreSQL with a JSONB column for flexible item attributes. It provides the ACID guarantees necessary for virtual economies.
Distribution: Sharded by
player_id to handle scale beyond 1M users.Cache
Purpose: Acts as a high-speed "Checkpointer" and Session Store.
Schema:
Session:<Token> -> PlayerID, GatewayID, ZoneID.ZoneState:<ZoneID> -> Protobuf encoded binary of all non-player entities (NPCs, dropped loot).Technical Selection: Redis (Cluster mode).
Failure Handling: If Redis fails, Zone Servers fallback to the last DB save.
Messaging
Purpose: Decouples game logic from heavy DB writes and provides an audit log for analytics/anti-cheat.
Event Schema:
[PlayerID, EventType, Payload, Timestamp].Technical Selection: Kafka.
Rationale: High throughput allows every "kill" or "trade" to be logged without blocking the 50ms game tick.
Infrastructure (Optional)
Observability: Prometheus metrics for "Tick Latency" (the most important metric). If tick latency > 50ms, the server is overloaded.
Orchestration: Agones (built on Kubernetes) to manage the lifecycle of stateful game server pods.
Wrap Up
Advanced Topics
Consistency vs. Availability: We choose Consistency (CP) for character data (if the DB is down, you can't trade) but Availability (AP) for movement.
Bottleneck Analysis: The "Hot Spot" shard occurs when thousands of players gather for a "World Boss."
Optimization: Implement "Layering" (instancing the same zone multiple times) when a population threshold is hit.
Security: All logic is Server-Authoritative. The client only sends "Intent" (e.g., "I want to move to X"), and the server validates if that move is possible given the player's speed.
Distinguishing Insight: Clock Synchronization. Use a simplified NTP-like handshake at the start of a session to calculate RTT and clock offset. This is critical for the server to "rewind" state to validate a hit-scan shot from a player with 100ms lag.