The Question
DesignDesign a Scalable Real-time Messaging Platform
Design a real-time communication system similar to Discord that supports massive persistent connections (1M+ PCU), instantaneous message delivery across channels, and a scalable presence system. The system must handle 'megaservers' with hundreds of thousands of members and ensure message durability and ordering. Discuss your strategy for WebSocket management, storage selection for high-write chat logs, and the architectural trade-offs involved in presence broadcasting at scale.
ScyllaDB
PostgreSQL
Redis
Kafka
WebSockets
gRPC
Snowflake ID
CDN
Questions & Insights
Clarifying Questions
Scale: What is the expected scale for the MVP? (Assumed: 100M Registered Users, 10M DAU, 1M Peak Concurrent Users (PCU)).
Core Features: Are we including Voice/Video or just Text and Presence? (Assumed: Text chat, Servers/Channels, and Real-time Presence only for MVP).
Persistence: Do messages need to be stored forever or is there a TTL? (Assumed: Permanent storage with horizontal scalability).
Group Size: What is the maximum size of a "Server"? (Assumed: Support for "Megaservers" up to 500k members).
Client Support: Mobile and Desktop? (Assumed: Cross-platform support via WebSockets).
Thinking Process
The WebSocket Problem: How do we maintain 1M+ persistent connections and route messages to the correct user sessions?
The Presence Bottleneck: How do we broadcast "Online/Offline" status to thousands of friends/server-mates without creating an O(N^2) traffic storm?
The Storage Choice: Which database can handle heavy write throughput and provide low-latency retrieval by
channel_id + timestamp?The Delivery Guarantee: How do we ensure messages are delivered in order and handle offline synchronization?
Bonus Points
Snowflake IDs: Using decentralized k-sortable unique IDs to ensure message ordering across distributed shards without a central lock.
Presence Fan-out Optimization: Implementing a "Lazy Load" or "View-port" based presence strategy for large servers to prevent thundering herds.
ScyllaDB/Cassandra Compaction: Tuning the storage engine (Time Window Compaction Strategy) specifically for chat access patterns to minimize disk I/O.
Gateway Sharding: Using consistent hashing for Gateway nodes to minimize the impact of node failures on persistent connections.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can create servers and channels.
Users can send and receive real-time messages in channels.
Users can see the real-time online/offline status of friends.
Users can view message history in channels.
Scope Control:
In-scope: Text messaging, Presence, Server/Channel management, Auth.
Out-of-scope: Voice/Video calls, File uploads (CDN), Search functionality, Message reactions.
Non-Functional Requirements
Scale: Support 1M PCU and 100k message writes per second.
Latency: End-to-end message delivery < 200ms.
Availability & Reliability: 99.99% availability; messages must not be lost once acknowledged.
Consistency: Eventual consistency for presence; strong ordering for messages within a channel.
Fault Tolerance: System must survive the failure of individual Gateway or Database nodes.
Estimation
Traffic Estimation:
1M PCU. If 10% of users send a message every 10 seconds: 1M * 0.1 / 10 = 10k Messages/sec (Average).
Peak: 100k Messages/sec.
Storage Estimation:
Average message size: 100 bytes.
100k \text{ msg/sec} * 86,400 \text{ sec/day} \approx 8.6 \text{ billion msgs/day}.
8.6B * 100 \text{ bytes} \approx 860 \text{ GB/day}.
Yearly storage: \approx 310 \text{ TB}.
Bandwidth Estimation:
Incoming: 100k * 100 \text{ bytes} = 10 \text{ MB/s}.
Outgoing (Fan-out): If average channel has 20 active users, 10 \text{ MB/s} * 20 = 200 \text{ MB/s}.
Blueprint
This architecture focuses on a high-concurrency Gateway layer to manage persistent connections and a distributed wide-column store for message persistence.
Gateway Service (WebSocket): Manages stateful connections and bi-directional real-time communication.
Chat Service: Handles message validation, persistence, and fan-out logic.
Presence Service: Tracks user heartbeats and broadcasts status updates.
ScyllaDB/Cassandra: Stores messages using a partition key of
channel_id for fast sequential reads.Redis: Stores transient session mapping and presence state.
Simplicity Audit: We avoid complex service meshes or global transactions. We use ScyllaDB because it scales linearly, which is a requirement even for a "Discord MVP" due to the inherent viral nature of the product.
Architecture Decision Rationale:
Why this architecture?: WebSockets are essential for low-latency bi-directional updates. A wide-column store (ScyllaDB) is chosen over RDBMS because chat history is write-heavy and grows indefinitely.
Functional Satisfaction: Covers real-time delivery via Gateway and history via ScyllaDB.
Non-functional Satisfaction: Redis provides sub-millisecond presence checks; ScyllaDB ensures no single point of failure for data.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery: Use CDN for static assets (CSS/JS) and emojis.
Load Balancing: L4 (TCP) Load Balancer for WebSocket connections using round-robin or least-connections. SSL termination happens here.
Gateway Cluster: Nodes maintain a mapping of
User_ID -> WebSocket_Connection. Heartbeats every 30s to prune dead connections.Service
Chat Service:
Protocol: gRPC for internal service communication.
Idempotency: Clients attach a
nonce or client_msg_id.Ordering: Generates a Snowflake ID (64-bit) for every message to ensure total ordering within a channel.
Presence Service:
Handles "Online", "Idle", "Do Not Disturb", and "Offline".
Optimization: For servers > 1000 members, presence is only sent for users currently looking at the member list (viewport-based presence).
Observability: Prometheus metrics for "Connection Count" and "Message E2E Latency".
Storage
Access Pattern:
Write: High volume inserts (Messages).
Read: Fetch last N messages for a
channel_id.Database Table Design:
Messages (ScyllaDB):
Partition Key:
channel_idClustering Key:
message_id (Snowflake, descending)Fields:
author_id, content, timestamp.Metadata (PostgreSQL):
Tables for
Servers, Channels, Server_Members.Indexed on
user_id and server_id.Technical Selection: ScyllaDB (NoSQL Wide-column). It handles high write-throughput better than Postgres and supports partitioning by
channel_id, allowing message history to be retrieved in a single disk seek.Cache
Purpose: Store session metadata and user presence state.
Key-Value Schema:
user:presence:{id} -> {status, last_active_ts}.user:gateway:{id} -> gateway_node_ip (used for routing incoming messages to the correct socket).Technical Selection: Redis. High performance for TTL-based presence heartbeats.
Messaging
Purpose: Decouples the Chat Service from the Gateway. When a message is sent to a channel, it is published to a topic.
Event Schema:
channel_id, message_payload, recipient_list.Throughput: Kafka handles the fan-out. Each Gateway node subscribes to a subset of partitions.
Technical Selection: Kafka. Required for high-throughput message distribution and ability to "replay" if a Gateway node restarts.
Wrap Up
Advanced Topics
Trade-offs: We choose Availability over Consistency (AP) for Presence. It is acceptable if a user sees a friend as "online" for a few seconds longer than they actually are.
Reliability:
Gateway Failure: If a Gateway node dies, clients reconnect to another node. Redis is updated with the new
User -> Gateway mapping.Backpressure: If Kafka is slow, the Chat Service can buffer messages locally or return a 503 to the client.
Bottleneck Analysis:
Hot Channels: A channel with 100k active users (e.g., a major gaming event) creates a massive fan-out.
Optimization: Use "Sub-groups" for fan-out or limit the broadcast only to the "Top 100" active members in massive channels.
Security:
All messages encrypted in transit via TLS.
Member permissions checked in the Chat Service via Postgres
Server_Members table before allowing a write.