The Question
DesignReal-time Scalable Messaging System Design
Design a globally distributed, real-time messaging platform similar to Facebook Messenger. The system must support persistent connections for low-latency delivery, multimodal content storage, and multi-device synchronization. Address how you would handle message ordering, delivery guarantees (Sent/Delivered/Read statuses), and efficient storage for billions of messages while maintaining high availability and security.
WebSocket
Cassandra
Redis
Kafka
S3
Questions & Insights
Clarifying Questions
Scale and Traffic: What is the target Daily Active User (DAU) count and peak message throughput?
Assumption: 500M DAU, averaging 20 messages/user/day (~10B messages/day). Peak QPS is ~250k.
Group Chats: Should the MVP support group chats, and if so, what is the size limit?
Assumption: Yes, group chats up to 250 participants are supported.
Encryption: Does "encryption" imply End-to-End Encryption (E2EE) like Signal, or Encryption-at-Rest/Transit?
Assumption: TLS for transit and AES-256 for rest. E2EE is a premium/toggle feature, but standard storage is used for multi-device sync history.
Message Retention: Is message history stored indefinitely or ephemeral?
Assumption: Indefinitely, requiring a scalable cold storage strategy.
Device Sync: How many concurrent devices per user?
Assumption: Max 5-10 devices (e.g., Phone, Tablet, Laptop).
Thinking Process
Connection Management: Utilize persistent WebSockets to maintain full-duplex communication for real-time delivery and status indicators.
State Synchronization: How does a device know it's missing messages? Use monotonically increasing sequence IDs (Sequence Numbers) per conversation rather than just timestamps to ensure gap detection.
Delivery Flow: A message is not considered "sent" until persisted in the database, and not "delivered" until the recipient's gateway acknowledges receipt.
Multi-Device Fan-out: When User A sends a message to User B, the system must push that message to all of User B’s active sessions and all of User A’s other active sessions for synchronization.
Bonus Points
WhatsApp-style E2EE with Signal Protocol: Implementing Double Ratchet algorithm for multi-device E2EE consistency.
Write Path Optimization: Using a Wide-Column store (Cassandra) with a composite key
(conversation_id, bucket_id, sequence_id) to ensure sequential disk I/O and fast range scans for history.Presence Optimization: For massive group chats, don't push presence updates for every user. Use a "lazy-loading" presence model or only show presence for the top 10 most active participants to save bandwidth.
Edge Terminated WebSockets: Terminate SSL/WS connections at edge PoPs (Points of Presence) to reduce handshake latency.
Design Breakdown
Functional Requirements
Real-time 1-on-1 and Group Messaging.
Multimodal support (Images/Videos/Voice via Object Storage).
Message Status: Sent, Delivered, Read.
Real-time sync across multiple logged-in devices.
Searchable message history.
Non-Functional Requirements
Low Latency: < 100ms for message delivery in optimal conditions.
High Availability: 99.99% (Global distribution).
Reliability: No message loss; guaranteed delivery via retries and ACKs.
Security: TLS 1.3 in transit; Encryption at rest.
Estimation
DAU: 500M.
Messages/Day: 10 Billion.
Avg Message Size: 100 bytes (text) + metadata.
Daily Storage: 10B * 150 bytes ≈ 1.5 TB/day.
Annual Storage: ~550 TB (without media).
Media: If 5% of messages are media (avg 500KB), that's 500M * 500KB = 250 TB/day. Requires aggressive TTL or cold storage for old media.
Bandwidth: 115k msgs/sec * 150 bytes ≈ 17 MB/s (Inbound), much higher for media.
Blueprint
Concise Summary: A WebSocket-based gateway architecture that bridges real-time clients to a distributed message service backed by a wide-column NoSQL database for high-throughput writes and sequential reads.
Major Components:
WebSocket Gateway: Manages persistent connections and maps UserIDs to specific server nodes.
Message Service: Handles business logic, message persistence, and fan-out orchestration.
Media Service: Handles multi-part uploads to S3 and generates CDN-friendly URLs.
Presence/Sync Service: Tracks user heartbeats and manages session-specific delivery queues.
Simplicity Audit: This architecture avoids complex distributed locking by using conversation-level sequencing and relies on a "Push-Pull" hybrid for sync, which is the industry standard for reliability.
Architecture Decision Rationale:
Why this architecture?: WebSockets are essential for the "Real-time" requirement. Cassandra is chosen because chat is write-heavy and naturally ordered by time/sequence.
Functional Satisfaction: Direct message routing meets 1-on-1/Group needs; Object storage handles media.
Non-functional Satisfaction: Decoupling the Gateway from the Message Service allows scaling connection handling independently from business logic.
High Level Architecture
Sub-system Deep Dive
Service
WebSocket Gateway:
Statefull nodes. Uses a "Sticky Session" approach.
Responsibilities: Heartbeats, TLS termination, message protocol (Protobuf/Thrift) serialization.
Message Service:
Stateless. Orchestrates the flow.
API:
sendMessage(recipient_id, content_type, body)fetchHistory(conversation_id, cursor, limit)updateStatus(message_id, status)Multi-Device Sync: Every device maintains a
last_synced_sequence_id. Upon reconnection, the client asks for all messages where sequence_id > last_synced_sequence_id.Storage
Data Model (Cassandra):
Table:
messages_by_conversationPartition Key:
conversation_idClustering Column:
sequence_id (Descending), message_idColumns:
sender_id, payload, timestamp, metadata_refIndexing: We do not index message content in Cassandra; ElasticSearch is used if full-text search is required (post-MVP).
Sequence ID: Uses a central service (e.g., Redis
INCR or a Snowflake-variant) per conversation to ensure no gaps in message ordering.Cache
Redis (Presence): Stores
user_id -> {status: "online/offline", last_active: timestamp, gateway_node_id: "ip-10-0-0-1"}.TTL: Presence keys expire after 30-60 seconds if no heartbeat is received from the Gateway.
Session Mapping: Crucial for routing. When User A sends to User B, Message Service looks up User B's
gateway_node_id in Redis.Messaging
Message Queue (Kafka):
Used for "Fan-out". When a message is sent to a group of 100, the Message Service pushes 1 event to Kafka.
A Consumer group handles the fan-out: identifying all 100 members, checking their online status, and pushing to their respective Gateways.
Guarantees: At-least-once delivery. Idempotency keys on the client handle duplicates.
Wrap Up
Advanced Topics
Monitoring:
Metrics: Socket count per gateway, Message E2E latency, Cassandra write throughput.
Tools: Prometheus for metrics, Jaeger for tracing message flow across microservices.
Trade-offs:
Consistency vs. Availability: We choose Availability (AP). In a partition, a user might see messages slightly out of order if the sequence service is unreachable, but the system remains functional.
Bottlenecks: The sequence ID generator can become a hotspot.
Optimization: Use coarse-grained timestamps with a tie-breaker for the MVP, or range-based ID allocation.
Failure Handling:
Gateway Failure: If a Gateway node dies, clients detect the socket drop and reconnect to a different node via the Load Balancer. Redis presence is updated via heartbeat timeout.
Alternatives:
PostgreSQL vs. Cassandra: For a smaller scale, Postgres with Partitions is easier to manage. At 10B messages/day, Cassandra’s LSM-tree architecture is superior for write-heavy workloads.