The Question
DesignScalable Real-time Messaging System
Design a high-concurrency real-time chat platform capable of supporting 100M DAU. The system must handle persistent WebSocket connections, ensure ordered message delivery, provide online/offline presence status, and support efficient retrieval of message history for both 1-on-1 and group conversations.
WebSocket
Redis
Cassandra
Kafka
Snowflake ID
Questions & Insights
Clarifying Questions
Scale and Traffic: What is the target Daily Active User (DAU) count and peak concurrent connections?
Feature Set: Are we supporting 1-on-1 chats only, or do we need Large Group chats (e.g., 100k+ members)?
Persistence: Do we need infinite message history, or is it a transient "ephemeral" chat?
Media: Should we handle file/image attachments in the MVP, or is this text-only?
Presence: Is real-time "online/offline" status a hard requirement for the MVP?
Assumptions for this design:
DAU: 100 Million.
Concurrent Connections: 10 Million.
Features: 1-on-1 and medium-sized group chats (up to 500 people).
Persistence: Permanent message storage required.
Connectivity: WebSockets for bi-directional real-time communication.
Thinking Process
Core Bottleneck: Maintaining millions of persistent WebSocket connections and efficiently routing messages to the specific server holding the recipient's connection.
Strategy:
Connection Management: Use a distributed WebSocket Gateway layer to handle long-lived connections.
Session Mapping: Use a Global Session Store (Redis) to track which user is connected to which Gateway instance.
Message Routing: Decouple message ingestion from delivery using a Message Bus to handle fan-out and asynchronous processing.
Storage Strategy: Use a NoSQL Wide-column store (Cassandra/ScyllaDB) to handle high-write volume and provide efficient chronological retrieval via partition keys.
Bonus Points
Causal Ordering: Using Lamport Timestamps or Snowflake IDs to ensure message ordering across distributed clock skews.
Presence Optimization: Implementing a "Last Seen" heartbeat with a pull-based model for large groups to avoid the "N^2 fan-out" problem.
Push Notification Bridge: A seamless transition strategy from WebSocket delivery to APNS/FCM when a user's connection drops.
Hot Partition Mitigation: Strategies for handling "Celebrity" group chats where thousands of messages are sent/received per second in a single thread.
Design Breakdown
Functional Requirements
Users can send and receive 1-on-1 messages in real-time.
Users can participate in group chats (up to 500 members).
Users can see the online/offline status of friends.
Message history is persisted and searchable.
Delivery status (Sent, Delivered, Read) notifications.
Non-Functional Requirements
Low Latency: Sub-100ms delivery for online users.
High Availability: 99.99% uptime; no single point of failure.
Scalability: Horizontal scaling of connection handlers and storage.
Reliability: At-least-once delivery guarantee.
Estimation
DAU: 100M.
Messages/Day: 100M users * 50 msgs = 5 Billion msgs/day.
Write QPS: 5B / 86400s \approx 60k msgs/sec. Peak \approx 120k/sec.
Storage: 5B msgs * 100 bytes \approx 500GB/day \approx 180TB/year (excluding replication).
Connections: 10M concurrent WebSockets. Each server (16GB RAM) can handle ~50k-100k connections \rightarrow ~150-200 Gateway nodes.
Blueprint
Concise Summary: A WebSocket-based microservices architecture utilizing a distributed session store for routing and a wide-column NoSQL database for performant message sequencing.
Major Components:
WebSocket Gateway: Manages persistent bi-directional connections and heartbeats.
Session Store: A distributed K-V store mapping UserID to GatewayID.
Chat Service: Handles message logic, validation, and persistence.
Presence Service: Tracks user online/offline status via heartbeats.
Message Bus: Handles asynchronous fan-out for group messages and background tasks.
Simplicity Audit: This design avoids complex Service Mesh or Global Transaction coordinators, focusing on horizontal scaling of stateless services and a battle-tested NoSQL storage layer.
Architecture Decision Rationale:
Why this architecture?: Decoupling the connection layer (Gateway) from the business logic (Chat Service) allows independent scaling of I/O-bound and CPU-bound tasks.
Functional Satisfaction: Covers real-time delivery, history, and presence.
Non-functional Satisfaction: Redis provides sub-millisecond session lookups; Cassandra provides high-throughput writes for message history.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
Latency-based DNS routing to direct users to the nearest regional data center.
L7 Load Balancer handles SSL termination and WebSocket upgrades.
Security:
WAF for DDoS protection.
JWT-based authentication validated at the API Gateway.
Service
Topology & Scaling:
WS Gateway: Stateful (holds connections), but shareable via Session Store. Scaled based on open file descriptors and memory.
Chat Service: Stateless, scales based on CPU/Request count.
API Schema Design:
POST /v1/messages: Sends a message (REST/WebSocket). GET /v1/history?convoId=xxx&cursor=yyy: Fetches paginated history.PUT /v1/presence: Manual status update.Resilience:
Exponential backoff for client reconnections.
Circuit breakers on the Chat Service to prevent cascading failure if the DB slows down.
Observability:
Metrics: Connection count per Gateway, Message E2E latency.
Storage
Access Pattern:
90% writes (sending messages), 10% reads (loading history).
Sequential reads by
conversation_id.Database Table Design:
Messages Table:
conversation_id (Partition Key), message_id (Clustering Key - TimeUUID), sender_id, content, metadata.This ensures all messages for one chat are physically co-located on disk, making history fetches extremely fast.
Technical Selection:
Cassandra / ScyllaDB: Chosen for its masterless architecture and ability to handle massive write volumes with tunable consistency (Local Quorum for balance).
Distribution Logic:
Sharded by
conversation_id. For very large groups, a composite key conversation_id + bucket_id (e.g., by month) is used to prevent single-partition size limits.Cache
Purpose: Session management and Presence tracking.
Key-Value Schema:
sess:{user_id} -> {gateway_id} (TTL: 30s, refreshed by heartbeat).pres:{user_id} -> {status, last_seen}.Technical Selection: Redis Cluster.
Rationale: High throughput, supports Pub/Sub for presence updates, and provides TTL mechanisms for automatic session expiration.
Failure Handling: If Redis fails, users may appear offline, but the Chat Service can fallback to broadcasting to all Gateways (expensive) or simply rejecting the message until failover completes.
Messaging
Purpose: Decoupling the delivery to offline users and handling group fan-out.
Event Schema:
{type: MSG_SEND, sender: uid, convo_id: cid, payload: {...}}.Throughput & Partitioning:
Kafka: Partitioned by
conversation_id to ensure message ordering within a single chat.Failure Handling: Dead Letter Queues (DLQ) for messages that fail to be stored or delivered after retries.
Wrap Up
Advanced Topics
Trade-offs (PACELC): We favor Availability and Partition Tolerance (AP). In a network partition, we allow messages to be sent even if presence is slightly stale. We use Eventual Consistency for delivery markers.
Reliability: To guarantee "at-least-once" delivery, the client expects an
ACK from the Gateway. If no ACK is received, the client retries with the same idempotency_key.Bottleneck - Group Fan-out: For a group of 500, one message results in 500 writes/lookups. For the MVP, we do this serially via workers. For "Celebrity" scale (1M members), we would move to a "Pull-on-Demand" model where only active users fetch the message.
Security: End-to-End Encryption (E2EE) using the Signal Protocol could be added, but it requires the Chat Service to only store encrypted blobs, moving key management to the clients.