The Question
DesignScalable Enterprise Messaging Platform (Slack-style)
Design a real-time messaging system similar to Slack that supports millions of concurrent users. The system must handle persistent 1-to-1 and group channels, real-time presence indicators, and message history search. Focus on solving the challenge of high-volume message fan-out in large channels and ensuring message ordering and low-latency delivery under peak load. Discuss trade-offs between consistency and availability in a global deployment.
WebSockets
Cassandra
Redis
Kafka
Elasticsearch
PostgreSQL
FCM
APNS
JWT
Prometheus
Snowflake ID
Questions & Insights
Clarifying Questions
Scale: What is the target Daily Active Users (DAU) and the average number of messages sent per user?
Assumption: 20M DAU, 50 messages/user/day, resulting in 1B messages per day.
Group Size: What is the maximum number of users in a single channel?
Assumption: Support up to 100,000 users in a single channel, though most have < 500.
Message Types: Does the MVP support file uploads, or just text and emojis?
Assumption: Text and basic metadata (emojis/reactions) are priority. Media involves Object Storage (S3).
Latency: What is the target end-to-end delivery latency?
Assumption: < 200ms for real-time delivery to online users.
Search: Is full-text search required for the MVP?
Assumption: Yes, users must be able to search their message history.
Thinking Process
The core challenge of Slack is the Fan-out problem: one message sent to a channel must be delivered to thousands of concurrent users instantly while maintaining a consistent order.
How do we maintain a persistent real-time connection? Use WebSockets at the Gateway layer to enable bi-directional, low-latency communication.
How do we handle message ordering and persistence? Assign a monotonically increasing sequence ID (or Snowflake ID) per channel and write to a high-throughput NoSQL store before acknowledgment.
How do we efficiently update Presence (online/offline) for thousands of peers? Implement a Pub/Sub model where presence changes are broadcasted only to "active" observers to avoid O(N^2) traffic.
How do we scale for large channels? Distinguish between "small" and "large" channels; use a pulling/lazy-loading mechanism for presence in massive channels to prevent notification storms.
Bonus Points
CRDTs for Message States: Use Conflict-free Replicated Data Types for reactions and threading to handle concurrent edits without complex locking.
Presence Optimization: Implement "Lazy Presence" for large organizations where status is only fetched when a user is actually visible on the screen.
Intelligent Last-Read Tracking: Use a separate, highly optimized atomic counter service to track "Unread" counts across multiple devices without heavy DB scans.
Storage Tiering: Use TTLs or archival strategies to move 2-year-old Slack data to cheaper Cold Storage while keeping the last 30 days in high-performance SSD-backed partitions.
Design Breakdown
Functional Requirements
Core Use Cases:
Real-time 1-to-1 and group channel messaging.
Presence tracking (Online, Away, Busy).
Persistent message history and search.
Push notifications for offline users.
Scope Control:
In-scope: Text messaging, presence, message history, basic search.
Out-of-scope: Video/Voice calls, complex App Integrations, Enterprise Grid (multi-org) federation.
Non-Functional Requirements
Scale: 20M DAU, 1B messages/day, 100k peak write QPS.
Latency: < 200ms message delivery; Presence updates < 2s.
Availability & Reliability: 99.99% availability (CAP: prioritize Availability over Consistency).
Consistency: Sequential consistency per channel (messages must appear in the same order for everyone).
Fault Tolerance: Horizontal scaling of WebSocket gateways; no single point of failure.
Security: TLS in transit, encryption at rest, and strict RBAC for channel access.
Estimation
Traffic Estimation:
Write QPS: 1B messages / 86,400s \approx 12k/s. Peak: 25k - 30k/s.
Read QPS (Fan-out): If avg channel has 20 active users, Read QPS \approx 200k - 300k/s.
Storage Estimation:
1B messages/day * 200 bytes \approx 200 GB/day.
5 years storage \approx 365 TB (excluding replicas).
Bandwidth Estimation:
Ingress: 12k/s * 200 bytes \approx 2.4 MB/s.
Egress: 240k/s * 200 bytes \approx 48 MB/s.
Blueprint
The architecture uses a WebSocket Gateway for real-time delivery, a Message Service for persistence into a wide-column NoSQL store (Cassandra), and a Presence Service backed by Redis for ephemeral state.
WebSocket Gateway: Maintains persistent TCP connections to clients for push-based delivery.
Message Service: Validates and persists messages, ensuring ordering within a channel partition.
Presence Service: Tracks heartbeats and broadcasts status changes via Pub/Sub.
Search Service: Ingests message streams into an inverted index for history lookup.
Storage Layer: Uses Cassandra for message history (high write throughput) and RDBMS for user/channel metadata.
Simplicity Audit: This design avoids complex distributed locking by relying on Cassandra's LWW (Last Write Wins) and a single source of truth for message IDs. It prioritizes the "Messaging" path as the critical hot path.
Architecture Decision Rationale:
WebSocket over Long Polling: Critical for Slack's "instant" feel and reduced header overhead.
Cassandra: Ideal for message history where the primary access pattern is "messages within a channel sorted by time."
Redis for Presence: Presence is transient; persistence isn't required, but sub-millisecond lookups are.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Use a global CDN (e.g., Cloudflare) for static assets (UI, emojis). DNS uses latency-based routing to the nearest AWS/GCP region.
Security & Perimeter:
API Gateway: Handles JWT validation and request throttling (e.g., 100 requests/sec per user).
WebSocket Gateway: Handles TLS termination and connection stickiness. Connections are mapped to specific gateway instances in Redis to allow targeted push.
Service
Topology & Scaling:
Stateless services (Message, User) scale on CPU/Request count.
WebSocket Gateways are stateful (track connections) and scale based on open socket count (limit ~50k per instance).
API Schema Design:
POST /v1/messages: Sends a message. Protocol: REST (initially) or WebSocket.GET /v1/channels/{id}/messages: Paginated history.Resilience & Reliability:
Retry with Jitter: For message delivery failures.
Sequence IDs: Clients attach a
client_msg_id for idempotency to prevent duplicate messages during retries.Storage
Access Pattern:
Write-heavy: 1B writes/day.
Read-heavy: Fetching history and real-time fan-out.
Database Table Design:
Cassandra (Messages):
Partition Key: channel_idClustering Key: message_timestamp (DESC), message_idAllows O(1) lookup for the latest messages in a channel.
RDS (Metadata):
Users: user_id, email, workspace_id.Channels: channel_id, workspace_id, name, type (public/private).Technical Selection: Cassandra is chosen over RDBMS for messages due to its linear scalability and ability to handle high-write volume across multiple nodes.
Cache
Purpose & Justification: Redis is used for Presence Tracking and Session Mapping.
Key-Value Schema:
presence:{user_id}: Value online/away/timestamp. TTL of 60s, refreshed by heartbeats.user_conn:{user_id}: Value gateway_ip. Used to route incoming messages to the correct WebSocket server.Failure Handling: If Redis fails, presence defaults to "Offline."
Messaging
Purpose & Decoupling: Kafka acts as the backbone for asynchronous tasks.
Event / Topic Schema:
message_events: Contains the message payload, sender, and channel.Throughput & Partitioning: Partitioned by
channel_id to ensure that workers (Search, Push) process messages for the same channel in order.Technical Selection: Kafka for high-throughput and message replayability (critical for rebuilding search indices).
Data Processing
Processing Model: Stream processing using Kafka Consumers (Go/Java).
Processing DAG:
Kafka -> Search Worker -> Elasticsearch and Kafka -> Push Worker -> FCM/APNS.Technical Selection: Custom Go-based workers for low memory footprint and high concurrency.
Infrastructure (Optional)
Distributed Coordination: Not heavily needed for MVP; service discovery handled by Kubernetes/Consul.
Observability: Prometheus for metrics (active connections, message latency) and Jaeger for tracing message flow across services.
Wrap Up
Advanced Topics
Trade-offs: We choose Eventual Consistency for message history across regions to maintain high availability. A user might see a message slightly later than another, but the order within a channel is preserved by the
channel_id partition.Reliability: To handle Hot Channels (e.g., #general with 100k users), the system switches from a push-model to a pull-lazy-load model for presence to avoid crashing the WebSocket gateways.
Bottleneck Analysis: The main bottleneck is the Fan-out. If one message goes to 100k people, the Message Service must generate 100k tasks.
Optimization: Use a "Channel-based Pub/Sub" within the Gateway layer. Each Gateway instance subscribes to a Redis/NATS topic for a channel only if it has at least one active user for that channel. This reduces redundant message processing.
Security: Strict isolation at the storage level ensures Workspace A cannot access Workspace B data through shard-level partitioning.