The Question
DesignDesign a Distributed Real-time Messaging System (Slack)
Design a real-time messaging platform capable of supporting millions of concurrent users across multiple workspaces. The system must handle persistent 1-on-1 and group channel communications, real-time presence indicators, and seamless message history retrieval. Discuss the trade-offs between WebSocket state management and stateless scalability, your choice of storage for high-volume write-heavy workloads, and how you would mitigate 'thundering herd' problems during presence updates in large channels.
WebSockets
Cassandra
Redis
Kafka
NoSQL
JWT
TLS
LSM-tree
Questions & Insights
Clarifying Questions
Scale: What is the target Daily Active User (DAU) count and what is the peak concurrent connection count?
Entity Model: Do we need to support both 1-on-1 Direct Messages (DM) and large multi-tenant Public/Private Channels?
Data Retention: Is the requirement for infinite message history, or is there a "free tier" vs. "paid tier" retention policy (e.g., 90 days vs. infinite)?
Presence: How real-time does the "Online/Offline" status need to be, and what is the maximum channel size for presence updates?
Media: Is the MVP focused on text-only, or must we support file uploads and link previews immediately?
Assumptions:
DAU: 10 Million.
Message Volume: 1 Billion messages per day (~12k avg QPS, 60k peak QPS).
Persistence: Infinite history is required for the MVP.
Presence: Required for all users.
Latency: End-to-end message delivery under 200ms (99th percentile).
Thinking Process
The Real-time Core: How do we maintain persistent connections for millions of users to ensure sub-second delivery? (Answer: WebSockets + Stateful Gateway).
The Write Heavy Storage: How do we store billions of messages such that they are instantly searchable and retrievable by channel? (Answer: NoSQL/LSM-tree based storage like Cassandra/DynamoDB).
Presence Scalability: How do we handle the "thundering herd" of status updates when a 10k-user channel sees someone go online? (Answer: Heartbeat-based presence with intelligent fan-out).
Push & Notify: How do we reach users who are not currently connected? (Answer: Decoupled Push Notification Service via Message Queue).
Bonus Points
Flannel-style Edge Caching: Implementing an edge-cache layer (similar to Slack's "Flannel") that caches workspace metadata (users, channels) closer to the user to reduce load on the core DB and decrease startup latency.
Message Ordering with Vector Clocks: Ensuring causal consistency in a distributed system where clocks might drift, preventing messages from appearing out of order.
Presence Optimization: Using a "Pull-on-Demand" or "Lazy Fan-out" model for presence in very large channels (>2000 users) to avoid O(N^2) update storms.
Operational Sharding: Sharding by Workspace ID (
team_id) as the primary partition key to ensure that a single workspace's traffic is isolated and failures are contained.Design Breakdown
Functional Requirements
Core Use Cases:
Real-time 1-on-1 and Group/Channel messaging.
User Presence (Online, Offline, DND).
Message Persistence (History retrieval).
Push notifications for offline users.
Scope Control:
In-scope: Text messaging, presence, history, basic workspace/channel structure.
Out-of-scope: Video/Voice calls (WebRTC), complex search indexing (ElasticSearch), and App Integrations/Bots.
Non-Functional Requirements
Scale: Support 10M DAU and 1M+ concurrent WebSocket connections.
Latency: <200ms message delivery; <500ms history loading.
Availability & Reliability: 99.99% availability (Highly available gateway and storage).
Consistency: Causal consistency for message ordering within a channel.
Fault Tolerance: Handle "Gateway" server crashes without losing message delivery (automatic reconnection).
Security: TLS for all transit; Workspaces must be logically isolated.
Estimation
Traffic: 1B messages/day / 86,400s ≈ 11,500 messages/sec (Avg). Peak is ~60k QPS.
Storage: Avg message size 200 bytes. 1B messages/day = 200 GB/day. 73 TB/year (without replication). With 3x replication = ~220 TB/year.
Bandwidth: 12k msgs/sec * 200 bytes = 2.4 MB/s (Avg). Peak = ~12 MB/s. This is manageable for a standard backbone.
Connections: 10M DAU, assuming 20% concurrent = 2M concurrent WebSocket connections. If one server handles 50k connections, we need ~40 Gateway servers.
Blueprint
Concise Summary: A WebSocket-based real-time architecture where a stateful Gateway manages connections, a NoSQL store handles high-volume message persistence, and Redis manages ephemeral presence state.
Major Components:
WebSocket Gateway: Stateful service maintaining persistent TCP connections to clients for bidirectional real-time communication.
Chat Service: Stateless logic coordinator that routes messages, validates permissions, and interacts with storage.
Presence Service: High-throughput service using heartbeats to track user online/offline status.
Message Store: LSM-tree optimized database (Cassandra) for high-write message ingestion and time-ordered retrieval.
Redis Cache: Ephemeral store for user presence and session mapping.
Message Queue (Kafka): Decouples message delivery from side effects like push notifications and analytics.
Simplicity Audit: This design avoids complex "Micro-services" for every sub-feature, grouping logic into a "Chat Service" while separating the stateful "Gateway" to maximize resource efficiency.
Architecture Decision Rationale:
Why this architecture?: WebSockets are the industry standard for low-latency chat. NoSQL is chosen over SQL because chat history is write-heavy and naturally partitioned by Channel ID.
Functional Satisfaction: Covers real-time delivery, presence, and history.
Non-functional Satisfaction: Scaling is achieved by horizontally adding Gateways and using a distributed DB.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery: Use CDN for static assets (icons, JS/CSS bundles).
Load Balancing: L4 LB (NLB) for WebSocket Gateways to handle long-lived TCP connections; L7 LB (ALB) for the RESTful Chat Service.
API Gateway: Handles SSL termination and JWT-based authentication before routing to internal services.
Service
WebSocket Gateway (Stateful):
Keeps a mapping of
User_ID -> Connection_Object. When a message arrives for a user, the Chat Service finds which Gateway the user is connected to (via Redis) and pushes the message.
Chat Service (Stateless):
API:
POST /v1/messages (REST/gRPC), GET /v1/channels/{id}/history.Logic: Verifies if user belongs to the
channel_id, timestamps the message, and writes to Cassandra.Presence Service:
Clients send a heartbeat every 30 seconds.
Service updates a TTL-based key in Redis:
status:user_123 = "online".If the key expires, the user is considered offline.
Resilience:
Exponential backoff for client reconnections.
If a Gateway node dies, clients detect the socket drop and reconnect to a different node.
Storage
Access Pattern: Write-heavy (every message). Read-heavy (when users open channels or scroll back).
Database Table Design (Cassandra):
Table: `messages
channel_id (Partition Key)timestamp (Clustering Key, Descending)message_id (UUID)user_id (UUID)content (Text)Table: `channel_members
channel_id (Partition Key)user_id (Clustering Key)Technical Selection: Cassandra or ScyllaDB.
Rationale: Support for high-write throughput and excellent performance for "Get last N messages in a channel" queries using clustering keys.
Cache
Purpose: To store the
User_ID -> Gateway_Node_IP mapping (Session Sticky Info) and the Presence status.Key-Value Schema:
user:session:{user_id} -> gateway_ip_address (TTL 24h)user:presence:{user_id} -> status_string (TTL 60s)Technical Selection: Redis Cluster.
Rationale: Sub-millisecond latency for presence lookups. Using Redis Pub/Sub for small-scale presence fan-out.
Messaging
Purpose: Decouple the critical path of message delivery from secondary tasks.
Event Schema:
{ "type": "NEW_MESSAGE", "channel_id": "...", "sender_id": "...", "content": "..." }.Failure Handling: Kafka provides durability. If the Notification Service is down, it can replay messages from the offset to ensure no push notifications are missed.
Technical Selection: Kafka.
Rationale: High throughput and message retention allow multiple consumers (Analytics, Search Indexer, Notifications) to consume the same stream.
Wrap Up
Advanced Topics
Trade-offs (PACELC): We prioritize Availability and Partition Tolerance (AP). In a network partition, some users might see messages slightly out of order or delayed (Eventual Consistency), which is acceptable for chat compared to the system being unavailable.
Reliability:
Message Acknowledgment: The client should receive an
ACK from the server. If not received, the client retries with the same client_msg_id (Idempotency).Bottleneck Analysis:
Hot Channels: A channel like
#general in a 50k-person company. We optimize this by using "Lazy Loading" (don't push the message to all 50k sockets if they aren't active) and batching presence updates.Security:
E2EE (End-to-End Encryption): While Slack doesn't traditionally use E2EE, it could be implemented by encrypting payloads on the client-side, but this breaks server-side search. For MVP, we stick to Encryption at Rest and mTLS between services.
Distinguishing Insights:
Presence Fan-out Optimization: Instead of pushing "User A is online" to 10,000 people, we only push that update to users who currently have the chat window open for a channel where User A is a member. This "Viewport-based" update significantly reduces the number of WebSocket messages sent.