DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Real-time Messaging System (Slack)

Design a high-scale real-time messaging platform like Slack that supports millions of concurrent users. The system must handle private and group channels (up to 100k members), provide instant message delivery with strict ordering within channels, and track user presence (online/offline). Discuss how you would handle massive fan-out for popular channels, ensure high availability despite stateful connections, and optimize storage for billions of daily messages.
WebSockets
Kafka
Cassandra
Redis
gRPC
Snowflake ID
Anycast IP
JWT
Questions & Insights

Clarifying Questions

Scale and Traffic: What is the target Daily Active User (DAU) count and the average number of messages sent per user?
Assumption: 20M DAU, 100 messages per user/day (~2B messages total/day).
Channel Dynamics: What is the maximum number of users in a single channel?
Assumption: Support up to 100k users per channel, but 99% of channels have < 500 users.
Message Types: Does the MVP focus on text, or are rich media (images/videos) and threads required?
Assumption: Focus on text, basic threading, and online/offline presence.
Consistency vs. Availability: In a partition scenario, do we prioritize message order or system uptime?
Assumption: High availability is critical, but message ordering within a channel must be strictly maintained.

Thinking Process

Core Bottleneck: Handling persistent stateful connections for millions of concurrent users while ensuring low-latency delivery.
Key Progressive Questions:
How do we maintain 20M+ concurrent duplex connections efficiently? (WebSocket Gateways).
How do we route a message from User A on Server 1 to User B on Server 50? (Pub/Sub Messaging Bus).
How do we handle "thundering herds" when a large channel (100k users) receives a message? (Lazy loading and fan-out optimization).
How do we store trillions of messages while allowing fast "scroll-back" queries? (LSM-tree based NoSQL).

Bonus Points

Presence Optimization: Instead of heartbeat-per-friend, use a "lazy pull" or "delta-based" presence update for massive channels to reduce bandwidth by 80%.
Intelligent Last-Read: Use a "Read-Marker" service with a sidecar pattern to update unread counts asynchronously, preventing write-amplification on the main message path.
Conflict-Free Replicated Data Types (CRDTs): Mentioning CRDTs for collaborative features like "User is typing..." or shared canvas if the scope expands.
Edge Terminated WebSockets: Using Anycast IP and Global Accelerator to terminate TLS/TCP at the edge, reducing the handshake RTT for mobile users.
Design Breakdown

Functional Requirements

Core Use Cases:
Real-time 1-on-1 and Group/Channel messaging.
Message persistence and history retrieval.
Presence indicators (Online/Offline/Away).
Channel discovery and management.
Scope Control:
In-scope: Real-time delivery, message storage, presence, basic threading.
Out-of-scope: Voice/Video calls, advanced file search/indexing, third-party app integrations (Slack Apps).

Non-Functional Requirements

Scale: Support 20M DAU and 100k users per channel.
Latency: End-to-end message delivery < 200ms (P99).
Availability: 99.99% (High availability via multi-AZ deployment).
Consistency: Sequential consistency for messages within a specific channel.
Fault Tolerance: Automatic reconnection and message retries for flaky mobile networks.
Security: TLS in-transit, encryption at rest, and strict RBAC for private channels.

Estimation

Traffic Estimation:
Write QPS: 2B messages / 86,400s \approx 23k QPS (Avg). Peak (3x) \approx 70k QPS.
Read QPS: (Scrollback + Presence + Initial Load) \approx 10x Write QPS \approx 230k QPS.
Storage Estimation:
2B messages/day * 500 bytes (avg) \approx 1TB/day.
1 Year storage \approx 365TB (before replication/indexing).
Bandwidth Estimation:
Inbound: 23k QPS * 500 bytes \approx 11.5 MB/s.
Outbound (Fan-out): Assuming avg. 20 users per channel \approx 230 MB/s.

Blueprint

Concise Summary: A stateful WebSocket-based architecture using a distributed Pub/Sub (Kafka) to route messages between users, backed by a wide-column NoSQL database (Cassandra) for high-write message persistence.
Major Components:
WebSocket Gateway: Maintains persistent connections and handles real-time push to clients.
Message Service: Stateless service that validates, persists, and orchestrates message delivery.
Presence Service: Tracks user heartbeats and manages online status using a fast In-Memory store.
Kafka (Messaging Layer): Acts as the decoupling buffer and message router for channel-based fan-out.
Simplicity Audit: This design avoids complex distributed locking by relying on Kafka's partition-key (ChannelID) to maintain message order.
Architecture Decision Rationale:
Why this architecture?: WebSockets are the industry standard for low-latency duplex communication. Cassandra's LSM-tree architecture is optimized for the high-write volume inherent in chat systems.
Functional Satisfaction: Covers real-time delivery and history.
Non-functional Satisfaction: Horizontally scalable across all layers.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a Global Accelerator for Anycast IP routing to the nearest AWS/GCP region to minimize TCP handshake latency.
Security & Perimeter:
API Gateway: Handles JWT validation and rate limiting (e.g., 50 messages/sec per user) to prevent spam.
SSL/TLS: Terminated at the Load Balancer to offload compute from internal services.

Service

Topology & Scaling:
WebSocket Gateway: Stateful, scaled based on concurrent connections (max 50k per instance). Uses a consistent hashing ring to map UserID to a specific server.
Message Service: Stateless, scaled by CPU/QPS.
API Schema Design:
POST /v1/messages: Protocol (gRPC). Request: {channel_id, sender_id, text, type}. Response: {message_id, timestamp}.
GET /v1/channels/{id}/messages: Protocol (REST). For history retrieval.
Resilience & Reliability:
Heartbeats: Clients send heartbeats every 30s to keep the WS connection alive.
Sequence IDs: Use a Snowflake-style ID generator to ensure K-sortable message IDs for ordering.

Storage

Access Pattern: Heavy writes (new messages) and localized reads (scrollback in a specific channel).
Database Table Design:
messages: channel_id (Partition Key), created_at (Clustering Key), message_id, sender_id, content.
This design ensures all messages for a channel are stored together on disk, making "fetch last 50 messages" extremely fast.
Technical Selection: Cassandra / ScyllaDB.
Rationale: Excellent write throughput and the partition/clustering key model perfectly fits time-series chat data.
Distribution Logic: Partition by channel_id. Large channels are spread via sub-partitioning if they exceed size limits (e.g., channel_id_YYYY_MM).

Cache

Purpose & Justification: Reduce DB load for frequently accessed metadata (Channel member lists) and user sessions.
Key-Value Schema:
chan_mem:{channel_id} -> Set of UserIDs.
user_session:{user_id} -> WebSocket Server ID (to know where to route messages).
Failure Handling: If Redis fails, the system falls back to DB reads (performance degradation, not outage).

Messaging

Purpose & Decoupling: Kafka decouples the "Message Received" event from the "Deliver to 100k Users" process.
Throughput & Partitioning: Partition Kafka topics by channel_id. This ensures all messages for one channel are processed by one consumer, maintaining order.
Technical Selection: Kafka.
Rationale: High throughput, durability (if a WS server crashes, we can replay messages), and built-in partitioning.
Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability and Partition Tolerance (AP). If a presence node goes down, showing a user as "Offline" (Eventual Consistency) is better than blocking the whole chat.
Reliability:
Message Acknowledgments: The client must ACK a message. If the WS Gateway doesn't receive an ACK, it retries or sends a Push Notification (FCM/APNS).
Bottleneck Analysis:
Hot Shards: A "Breaking News" channel with 1M users creates a hot shard.
Optimization: For very large channels, skip the Pub/Sub fan-out to the UI and use a "Pull" model where the client fetches updates every 2 seconds when the window is active.
Distinguishing Insights:
To minimize "Presence Storms" (everyone logging in at 9 AM), use a jittered heartbeat and only broadcast presence updates to active friends, not the entire workspace.
For mobile, implement a "Gateway-to-Gateway" handoff to maintain sessions during IP changes (switching from WiFi to LTE).