DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Massively Multiplayer Online Game (MMO) Backend

Design a scalable backend architecture for a Massively Multiplayer Online (MMO) game supporting 100,000 concurrent users. The system must handle real-time spatial state synchronization, persistent player progress, and seamless transitions between world zones. Address specific challenges such as the 'N^2' communication problem, stateful server scaling, and maintaining high-frequency game ticks (20Hz+) while ensuring transactional integrity for player inventories.
WebSockets
Redis
PostgreSQL
Kafka
Agones
Kubernetes
UDP
ECS
Anycast
Protobuf
Questions & Insights

Clarifying Questions

What is the target concurrency and world scale?
Assumption: 1,000,000 DAU, 100,000 Concurrent Users (CCU), with up to 2,000 players per "World Instance" or "Zone."
What is the gameplay latency requirement (Tick Rate)?
Assumption: Action-oriented gameplay requiring a 20Hz tick rate (50ms processing window) and sub-100ms end-to-end network latency.
How is the game world structured?
Assumption: A seamless world partitioned into zones. Players transition between zones via a handoff mechanism.
What is the persistence strategy for player progress?
Assumption: Player inventory and stats must be durable (ACID-compliant for transactions), while position can be eventually consistent.

Thinking Process

The Statefulness Paradox: Unlike standard web apps, MMOs are stateful. How do we maintain a high-frequency game loop while ensuring state isn't lost during a crash?
The N^2 Broadcast Problem: If 1,000 players are in one spot, broadcasting every move to everyone creates exponential traffic. How do we implement "Interest Management"?
Zone Handover: How does a player move from one server (Zone A) to another (Zone B) without a loading screen or losing connection?
Consistency vs. Performance: Where do we draw the line between in-memory speed (gameplay) and disk durability (items/currency)?

Bonus Points

Spatial Partitioning (Quadtrees/Grids): Using spatial indexing to dynamically calculate "Area of Interest" (AOI) to reduce network egress.
Delta Compression & Bit-packing: Custom binary protocols (moving away from JSON/Protobuf) to minimize MTU overhead for 20Hz updates.
Client-Side Prediction & Server Reconciliation: Techniques to mask latency by simulating movement locally and correcting via authoritative server snapshots.
Deterministic Lockstep vs. State Sync: Choosing State Synchronization for MMOs to allow late-joining and better anti-cheat control.
Design Breakdown

Functional Requirements

Core Use Cases:
Player authentication and character selection.
Real-time movement and interaction (combat/looting).
Persistence of player attributes, inventory, and progression.
Global and Zone-based chat.
Scope Control:
In-Scope: MVP architecture for stateful game servers, spatial partitioning, and persistence.
Out-of-Scope: Voice-over-IP (VOIP), complex auction house physics, anti-cheat heuristics (ML-based).

Non-Functional Requirements

Scale: Support 100k CCU across multiple regional clusters.
Latency: Game loop processing < 50ms; Network RTT < 100ms.
Availability: 99.9% availability; "World" state must recover quickly from server failure.
Consistency: Strong consistency for inventory/transactions; Eventual consistency for world position.
Security: TLS for auth; custom encrypted binary UDP/WebSockets for gameplay to prevent packet sniffing/injection.

Estimation

Traffic: 100k CCU * 20 packets/sec = 2M packets/sec globally.
Bandwidth: 2M packets * 100 bytes/packet ≈ 200 MB/s (1.6 Gbps) egress.
Storage: 1M DAU * 50 KB character data ≈ 50 GB (Small, but high IOPS for frequent saves).
Compute: 100k CCU / 500 players per core ≈ 200 high-performance CPU cores for game logic.

Blueprint

The design utilizes a Zone-based Stateful Architecture. Players connect to a Gateway which routes them to a specific Zone Server. The Zone Server maintains the "Source of Truth" in memory for the duration of the session, periodically flushing state to a persistent store.
Gateway (Agones/Sidecar): Manages persistent WebSocket/UDP connections and proxies traffic.
Zone Server (Stateful): Runs the game loop (ECS pattern) and handles spatial partitioning.
World Registry: A service that tracks which player is in which zone and handles handoffs.
Redis (Session Store): Stores transient "hot" state like player location and session tokens.
PostgreSQL: Authoritative store for player metadata and inventory.
Simplicity Audit: We avoid a "Seamless Single World" mesh (complex) in favor of "Zoned Instances" which is easier to scale and debug for an MVP.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery: Game assets (textures/models) are served via CDN (CloudFront/Akamai).
Traffic Routing: AWS Global Accelerator provides Anycast IP addresses to route traffic to the nearest regional Gateway over the AWS backbone, reducing jitter.
Gateway: Implements TLS termination and Packet validation. For the MVP, we use WebSockets for bi-directional communication, transitioning to UDP (KCP) for high-performance needs later.

Service

Zone Server (Stateful):
Uses an Entity Component System (ECS).
Interest Management: Divides the zone into a 2D grid. A player only receives updates for entities in their grid cell and adjacent cells.
Tick Rate: Fixed 20Hz loop.
Handoff: When a player crosses a boundary, the World Registry facilitates a ticket-based transfer to the neighboring Zone Server.
API Schema:
POST /auth/login: REST (HTTPS) - Returns JWT and Gateway IP.
Stream: MoveEntity: WebSocket (Binary) - [EntityID, X, Y, Z, Timestamp].
Stream: Action: WebSocket (Binary) - [ActionID, TargetID, SkillID].
Resilience: If a Zone Server crashes, the World Registry detects the heartbeat failure and restarts the zone on a new node. Players reconnect and the server reloads the last checkpoint from Redis/DB.

Storage

Access Pattern:
Read: Heavy at login.
Write: Constant periodic "checkpoints" (every 30s) and critical "event-driven" writes (looting a rare item).
Database Table Design:
Players: player_id (PK), username, last_zone_id, position_blob.
Inventory: item_id (PK), player_id (FK), item_type, stats_json.
Technical Selection: PostgreSQL with a JSONB column for flexible item attributes. It provides the ACID guarantees necessary for virtual economies.
Distribution: Sharded by player_id to handle scale beyond 1M users.

Cache

Purpose: Acts as a high-speed "Checkpointer" and Session Store.
Schema:
Session:<Token> -> PlayerID, GatewayID, ZoneID.
ZoneState:<ZoneID> -> Protobuf encoded binary of all non-player entities (NPCs, dropped loot).
Technical Selection: Redis (Cluster mode).
Failure Handling: If Redis fails, Zone Servers fallback to the last DB save.

Messaging

Purpose: Decouples game logic from heavy DB writes and provides an audit log for analytics/anti-cheat.
Event Schema: [PlayerID, EventType, Payload, Timestamp].
Technical Selection: Kafka.
Rationale: High throughput allows every "kill" or "trade" to be logged without blocking the 50ms game tick.

Infrastructure (Optional)

Observability: Prometheus metrics for "Tick Latency" (the most important metric). If tick latency > 50ms, the server is overloaded.
Orchestration: Agones (built on Kubernetes) to manage the lifecycle of stateful game server pods.
Wrap Up

Advanced Topics

Consistency vs. Availability: We choose Consistency (CP) for character data (if the DB is down, you can't trade) but Availability (AP) for movement.
Bottleneck Analysis: The "Hot Spot" shard occurs when thousands of players gather for a "World Boss."
Optimization: Implement "Layering" (instancing the same zone multiple times) when a population threshold is hit.
Security: All logic is Server-Authoritative. The client only sends "Intent" (e.g., "I want to move to X"), and the server validates if that move is possible given the player's speed.
Distinguishing Insight: Clock Synchronization. Use a simplified NTP-like handshake at the start of a session to calculate RTT and clock offset. This is critical for the server to "rewind" state to validate a hit-scan shot from a player with 100ms lag.