DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Enterprise Messaging Platform (Slack-style)

Design a real-time messaging system similar to Slack that supports millions of concurrent users. The system must handle persistent 1-to-1 and group channels, real-time presence indicators, and message history search. Focus on solving the challenge of high-volume message fan-out in large channels and ensuring message ordering and low-latency delivery under peak load. Discuss trade-offs between consistency and availability in a global deployment.
WebSockets
Cassandra
Redis
Kafka
Elasticsearch
PostgreSQL
FCM
APNS
JWT
Prometheus
Snowflake ID
Questions & Insights

Clarifying Questions

Scale: What is the target Daily Active Users (DAU) and the average number of messages sent per user?
Assumption: 20M DAU, 50 messages/user/day, resulting in 1B messages per day.
Group Size: What is the maximum number of users in a single channel?
Assumption: Support up to 100,000 users in a single channel, though most have < 500.
Message Types: Does the MVP support file uploads, or just text and emojis?
Assumption: Text and basic metadata (emojis/reactions) are priority. Media involves Object Storage (S3).
Latency: What is the target end-to-end delivery latency?
Assumption: < 200ms for real-time delivery to online users.
Search: Is full-text search required for the MVP?
Assumption: Yes, users must be able to search their message history.

Thinking Process

The core challenge of Slack is the Fan-out problem: one message sent to a channel must be delivered to thousands of concurrent users instantly while maintaining a consistent order.
How do we maintain a persistent real-time connection? Use WebSockets at the Gateway layer to enable bi-directional, low-latency communication.
How do we handle message ordering and persistence? Assign a monotonically increasing sequence ID (or Snowflake ID) per channel and write to a high-throughput NoSQL store before acknowledgment.
How do we efficiently update Presence (online/offline) for thousands of peers? Implement a Pub/Sub model where presence changes are broadcasted only to "active" observers to avoid O(N^2) traffic.
How do we scale for large channels? Distinguish between "small" and "large" channels; use a pulling/lazy-loading mechanism for presence in massive channels to prevent notification storms.

Bonus Points

CRDTs for Message States: Use Conflict-free Replicated Data Types for reactions and threading to handle concurrent edits without complex locking.
Presence Optimization: Implement "Lazy Presence" for large organizations where status is only fetched when a user is actually visible on the screen.
Intelligent Last-Read Tracking: Use a separate, highly optimized atomic counter service to track "Unread" counts across multiple devices without heavy DB scans.
Storage Tiering: Use TTLs or archival strategies to move 2-year-old Slack data to cheaper Cold Storage while keeping the last 30 days in high-performance SSD-backed partitions.
Design Breakdown

Functional Requirements

Core Use Cases:
Real-time 1-to-1 and group channel messaging.
Presence tracking (Online, Away, Busy).
Persistent message history and search.
Push notifications for offline users.
Scope Control:
In-scope: Text messaging, presence, message history, basic search.
Out-of-scope: Video/Voice calls, complex App Integrations, Enterprise Grid (multi-org) federation.

Non-Functional Requirements

Scale: 20M DAU, 1B messages/day, 100k peak write QPS.
Latency: < 200ms message delivery; Presence updates < 2s.
Availability & Reliability: 99.99% availability (CAP: prioritize Availability over Consistency).
Consistency: Sequential consistency per channel (messages must appear in the same order for everyone).
Fault Tolerance: Horizontal scaling of WebSocket gateways; no single point of failure.
Security: TLS in transit, encryption at rest, and strict RBAC for channel access.

Estimation

Traffic Estimation:
Write QPS: 1B messages / 86,400s \approx 12k/s. Peak: 25k - 30k/s.
Read QPS (Fan-out): If avg channel has 20 active users, Read QPS \approx 200k - 300k/s.
Storage Estimation:
1B messages/day * 200 bytes \approx 200 GB/day.
5 years storage \approx 365 TB (excluding replicas).
Bandwidth Estimation:
Ingress: 12k/s * 200 bytes \approx 2.4 MB/s.
Egress: 240k/s * 200 bytes \approx 48 MB/s.

Blueprint

The architecture uses a WebSocket Gateway for real-time delivery, a Message Service for persistence into a wide-column NoSQL store (Cassandra), and a Presence Service backed by Redis for ephemeral state.
WebSocket Gateway: Maintains persistent TCP connections to clients for push-based delivery.
Message Service: Validates and persists messages, ensuring ordering within a channel partition.
Presence Service: Tracks heartbeats and broadcasts status changes via Pub/Sub.
Search Service: Ingests message streams into an inverted index for history lookup.
Storage Layer: Uses Cassandra for message history (high write throughput) and RDBMS for user/channel metadata.
Simplicity Audit: This design avoids complex distributed locking by relying on Cassandra's LWW (Last Write Wins) and a single source of truth for message IDs. It prioritizes the "Messaging" path as the critical hot path.
Architecture Decision Rationale:
WebSocket over Long Polling: Critical for Slack's "instant" feel and reduced header overhead.
Cassandra: Ideal for message history where the primary access pattern is "messages within a channel sorted by time."
Redis for Presence: Presence is transient; persistence isn't required, but sub-millisecond lookups are.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a global CDN (e.g., Cloudflare) for static assets (UI, emojis). DNS uses latency-based routing to the nearest AWS/GCP region.
Security & Perimeter:
API Gateway: Handles JWT validation and request throttling (e.g., 100 requests/sec per user).
WebSocket Gateway: Handles TLS termination and connection stickiness. Connections are mapped to specific gateway instances in Redis to allow targeted push.

Service

Topology & Scaling:
Stateless services (Message, User) scale on CPU/Request count.
WebSocket Gateways are stateful (track connections) and scale based on open socket count (limit ~50k per instance).
API Schema Design:
POST /v1/messages: Sends a message. Protocol: REST (initially) or WebSocket.
GET /v1/channels/{id}/messages: Paginated history.
Resilience & Reliability:
Retry with Jitter: For message delivery failures.
Sequence IDs: Clients attach a client_msg_id for idempotency to prevent duplicate messages during retries.

Storage

Access Pattern:
Write-heavy: 1B writes/day.
Read-heavy: Fetching history and real-time fan-out.
Database Table Design:
Cassandra (Messages):
Partition Key: channel_id
Clustering Key: message_timestamp (DESC), message_id
Allows O(1) lookup for the latest messages in a channel.
RDS (Metadata):
Users: user_id, email, workspace_id.
Channels: channel_id, workspace_id, name, type (public/private).
Technical Selection: Cassandra is chosen over RDBMS for messages due to its linear scalability and ability to handle high-write volume across multiple nodes.

Cache

Purpose & Justification: Redis is used for Presence Tracking and Session Mapping.
Key-Value Schema:
presence:{user_id}: Value online/away/timestamp. TTL of 60s, refreshed by heartbeats.
user_conn:{user_id}: Value gateway_ip. Used to route incoming messages to the correct WebSocket server.
Failure Handling: If Redis fails, presence defaults to "Offline."

Messaging

Purpose & Decoupling: Kafka acts as the backbone for asynchronous tasks.
Event / Topic Schema:
message_events: Contains the message payload, sender, and channel.
Throughput & Partitioning: Partitioned by channel_id to ensure that workers (Search, Push) process messages for the same channel in order.
Technical Selection: Kafka for high-throughput and message replayability (critical for rebuilding search indices).

Data Processing

Processing Model: Stream processing using Kafka Consumers (Go/Java).
Processing DAG: Kafka -> Search Worker -> Elasticsearch and Kafka -> Push Worker -> FCM/APNS.
Technical Selection: Custom Go-based workers for low memory footprint and high concurrency.

Infrastructure (Optional)

Distributed Coordination: Not heavily needed for MVP; service discovery handled by Kubernetes/Consul.
Observability: Prometheus for metrics (active connections, message latency) and Jaeger for tracing message flow across services.
Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for message history across regions to maintain high availability. A user might see a message slightly later than another, but the order within a channel is preserved by the channel_id partition.
Reliability: To handle Hot Channels (e.g., #general with 100k users), the system switches from a push-model to a pull-lazy-load model for presence to avoid crashing the WebSocket gateways.
Bottleneck Analysis: The main bottleneck is the Fan-out. If one message goes to 100k people, the Message Service must generate 100k tasks.
Optimization: Use a "Channel-based Pub/Sub" within the Gateway layer. Each Gateway instance subscribes to a Redis/NATS topic for a channel only if it has at least one active user for that channel. This reduces redundant message processing.
Security: Strict isolation at the storage level ensures Workspace A cannot access Workspace B data through shard-level partitioning.