The Question
DesignScalable Real-time Chat System Design
Design a real-time messaging platform similar to Discord that supports millions of concurrent users. The system must handle persistent message history, real-time delivery with low latency, and a presence system (online/offline status). Specifically address how the architecture handles massive group servers with over 100,000 concurrent members, focusing on storage partitioning, real-time fan-out challenges, and maintaining high availability during peak traffic spikes.
WebSockets
ScyllaDB
PostgreSQL
Redis
NATS
Snowflake ID
Anycast IP
Questions & Insights
Clarifying Questions
Scale: What is the target scale for Daily Active Users (DAU) and Peak Concurrent Users (PCU)?
Server Size: What is the maximum number of members a single "Server" (Guild) can hold? (e.g., 1 million+ like official game servers).
Persistence: Do we need to store message history indefinitely, and should it be searchable?
Features: Should we focus strictly on text-based messaging for the MVP, or are Voice/Video and Screen Sharing required?
Presence: How real-time does the "Online/Idle/DND" status need to be across large servers?
Assumptions:
DAU: 100 million; PCU: 15 million.
Max Server Size: 1 million members.
Message History: Persistent forever.
MVP Focus: Real-time text chat, server/channel management, and presence. Voice/Video is out of scope for the MVP.
Thinking Process
Core Bottleneck: Real-time delivery to millions of concurrent users and the "Fan-out" problem (sending one message to 100k+ online members in a large server).
Step 1: How do we maintain persistent connections? We use a WebSocket Gateway to maintain bi-directional stateful connections for real-time delivery.
Step 2: How do we handle massive message storage? Relational databases fail at this scale; we need a wide-column NoSQL store (ScyllaDB/Cassandra) for message partitioning.
Step 3: How do we scale presence? A distributed Pub/Sub mechanism is needed, but for massive servers, we must use "Lazy Loading" (only showing presence for users in the current UI viewport) to avoid O(N^2) traffic.
Bonus Points
Message IDs: Use Snowflake-like IDs (k-sortable) to ensure chronological ordering across distributed shards without a central lock.
Read-states Optimization: Instead of a "read" flag per message/user, store a
last_read_message_id per channel/user to drastically reduce write load.Gateway Sharding: Implement consistent hashing on Gateway servers to ensure users from the same guild land on the same "Manifold" or "Session" worker to optimize internal pub/sub routing.
Data Locality: Discuss ScyllaDB's shard-per-core architecture to minimize CPU cache misses during high-volume chat spikes.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can join/create Servers and Channels.
Users can send and receive real-time messages.
Users can see the online/offline status (Presence) of friends/server members.
Users can view message history (infinite scroll).
Scope Control:
In-scope: Text chat, Presence, Guild/Channel metadata.
Out-of-scope: Voice/Video, File attachments (beyond URLs), Search, Rich Embeds, Nitro/Billing.
Non-Functional Requirements
Scale: Support 100M+ DAU and billions of messages per day.
Latency: End-to-end message delivery < 200ms.
Availability: 99.99% (Chat is "always-on" utility).
Consistency: Eventual consistency for message history is acceptable, but message ordering within a channel must be strictly preserved.
Fault Tolerance: No single point of failure; Gateway servers must handle graceful reconnection.
Estimation
Traffic:
100M DAU * 20 messages/day = 2 Billion messages/day.
Average QPS: ~23,000 writes/sec.
Peak QPS (10x): ~230,000 writes/sec.
Storage:
2B messages * 100 bytes/msg = 200 GB/day.
73 TB/year (before replication).
Bandwidth:
23k writes/sec * 100 bytes = 2.3 MB/s (Inbound).
Outbound is much higher due to Fan-out (e.g., avg 100 users per channel) = 230 MB/s.
Blueprint
Concise Summary: A WebSocket-based real-time architecture utilizing a stateful Gateway for message delivery, backed by ScyllaDB for high-throughput message persistence and Redis for session/presence tracking.
Major Components:
WebSocket Gateway: Maintains persistent TCP connections to clients for real-time bi-directional events.
Chat Service: Handles incoming messages, validates permissions, and persists them to storage.
Presence Service: Tracks user heartbeats and broadcasts status changes.
Metadata DB (Postgres): Stores relational data like Guilds, Channels, and Roles.
Message Store (ScyllaDB): High-performance wide-column store for chat history.
Simplicity Audit: This design avoids complex stream processing (Flink) or heavy service meshes for the MVP, focusing on direct WebSocket-to-Service communication.
Architecture Decision Rationale:
ScyllaDB over Postgres: Postgres cannot handle the write-throughput and partitioning required for billions of messages.
WebSocket over Long Polling: Reduced header overhead and lower latency for bi-directional communication.
Redis for Presence: High-speed, TTL-based storage is perfect for ephemeral heartbeats.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Use a Global Anycast IP to route users to the nearest regional Data Center.
Security & Perimeter:
API Gateway: Handles TLS termination and JWT validation.
Rate Limiting: Applied at the UserID level to prevent spamming messages.
Service
Topology & Scaling:
Gateway Servers: Stateful. Scaled horizontally based on concurrent connection count (usually 50k-100k per instance).
API Servers: Stateless. Scaled based on CPU/Request Latency.
API Schema Design:
POST /v1/channels/{id}/messages: REST for sending (provides ACK).WS /gateway: WebSocket for receiving events.Idempotency: Clients generate a
nonce to prevent duplicate messages during retries.Resilience & Reliability:
Exponential Backoff: For client reconnections to avoid a "Thundering Herd" after a Gateway restart.
Message ACKs: Clients must ACK a message ID; if not received, the Gateway retries or the client fetches missing data on reconnect.
Storage
Access Pattern:
Writes: Very high (new messages).
Reads: High (loading channel history, infinite scroll).
Database Table Design:
Message Table (ScyllaDB):
channel_id (Partition Key): All messages for a channel live together.message_id (Clustering Key): K-sortable (Snowflake) for chronological ordering.author_id, content, timestamp.Metadata (Postgres):
guilds table, channels table, members table (roles/permissions).Technical Selection: ScyllaDB. Chosen for its C++ implementation of Cassandra's model, offering lower p99 latency and better resource utilization for the massive write volume.
Cache
Purpose & Justification:
Presence: Redis stores
user_id -> {status, last_active}.Session: Mapping
user_id -> gateway_id to route messages to the correct stateful server.Key-Value Schema:
Key:
presence:{user_id}, Value: Hash of status. TTL: 60s (heartbeat based).Failure Handling: If Redis fails, presence defaults to "Offline". This is acceptable for an MVP as it's non-critical data.
Messaging
Purpose & Decoupling: Used as the internal backbone for fan-out. When a message is persisted, it is published to the broker.
Throughput & Partitioning: Partitioned by
channel_id to ensure message ordering within a specific chat.Technical Selection: Redis Pub/Sub (for low latency presence) or NATS/Kafka (for message fan-out). For a Discord-scale MVP, a distributed Pub/Sub like NATS is preferred for low-latency delivery over persistent storage like Kafka.
Wrap Up
Advanced Topics
Trade-offs (PACELC): We choose Availability over Consistency (AP). If a message is slightly delayed in history but delivered real-time, it's better than the system being down.
Reliability:
Gateway Sharding: Users are assigned to Gateway "buckets" to limit the blast radius if one node crashes.
ScyllaDB Replication: 3x replication across Availability Zones.
Bottleneck Analysis (The 1M Member Server):
If 100k people are online in a server, one message generates 100k outgoing packets.
Optimization: The Gateway should not send presence updates for all 1M members. It only sends updates for members visible in the user's UI "member list" (Client-side Viewport).
Security:
End-to-Transit Encryption: TLS 1.3 for all connections.
Permissions: Every message write checks the Postgres
members table (cached in Redis) to ensure the user has SEND_MESSAGES permission for that channel_id.