The Question

Scalable Real-Time Messaging System

Design a globally distributed messaging platform similar to Facebook Messenger. The system must support real-time delivery to billions of users, multimodal attachments (images/video), and seamless synchronization across multiple devices (mobile, web). Key requirements include low-latency message delivery, tracking delivery/read status, and high availability. Discuss how you handle millions of persistent connections, ensure message ordering, and manage storage for trillions of historical messages while maintaining reliability during network partitions.

WebSockets

Cassandra

Redis

Kafka

CDN

gRPC

TLS

AES-256

FCM

APNS

Questions & Insights

Clarifying Questions

Scale and Traffic: What is the target scale in terms of Daily Active Users (DAU) and average messages per user?

Assumption: 1 Billion DAU, average 50 messages/user/day (Total 50B msgs/day).

Multi-device Sync: Should messages be synchronized in real-time across all active sessions, and how do we handle "catch-up" for devices that were offline?

Assumption: Real-time sync via persistent connections for online devices; sequence-number based "catch-up" for offline devices.

Multimodal Data: What is the size limit for media (images/videos) and how are they handled relative to the text message?

Assumption: Media is uploaded to object storage; the message contains a metadata reference/URL.

Encryption: Is this End-to-End Encryption (E2EE) like WhatsApp, or server-side encryption like standard Messenger?

Assumption: MVP will use Server-side encryption (AES-256 at rest, TLS in transit) to simplify multi-device sync and server-side search.

Message Status: Do we need to track "Sent", "Delivered", and "Read" statuses?

Assumption: Yes, these are critical for the user experience.

Thinking Process

Core Bottleneck: Maintaining millions of concurrent persistent connections while ensuring low-latency delivery and reliable synchronization across multiple devices.

Key Questions for Architecture:

How do we manage the lifecycle of millions of WebSocket connections efficiently? (Session Gateway + Service Discovery).

How do we ensure message ordering and exactly-once delivery across multiple devices? (Sequence IDs + Idempotency keys).

How do we handle high-write throughput (500k+ QPS) for message persistence? (LSM-tree based NoSQL).

How do we decouple real-time delivery from background tasks like push notifications and analytics? (Message Queue/Event Bus).

Bonus Points

Presence Optimization: Using a "Lazy Pull" or "Gossip Protocol" for presence to avoid

O(N^2)

broadcast storms in large groups.

Operational Transformation (OT) or CRDTs: While usually for collaborative editing, using version vectors for conflict-free message state resolution across devices.

Edge Terminated WebSockets: Using Global Accelerator or Anycast IP to terminate WebSocket connections closer to the user to reduce TCP/TLS handshake latency.

Hot-Partition Mitigation: Implementing sub-sharding or caching for extremely active group chats (e.g., celebrity accounts or public channels).

Design Breakdown

Functional Requirements

Core Use Cases:

One-on-one and Group real-time messaging.

Multi-device synchronization (Phone, Tablet, Web).

Multimodal support (Images, Video, Audio).

Message status indicators (Sent, Delivered, Read).

Presence indicators (Online/Offline).

Scope Control:

In-scope: Real-time delivery, persistence, media metadata, sync logic.

Out-of-scope: Voice/Video calling (WebRTC), Message Search, Story features.

Non-Functional Requirements

Scale: Support 1B DAU and 1M+ Peak QPS.

Latency: End-to-end message delivery < 200ms (P99) under normal conditions.

Availability & Reliability: 99.99% availability; messages must not be lost once "Sent" status is acknowledged by the server.

Consistency: Eventual consistency for message history across devices, but strict ordering within a single conversation.

Fault Tolerance: Automatic failover of WebSocket Gateways without losing message state.

Security: TLS 1.3 for data-in-transit; AES-256 for data-at-rest.

Estimation

Traffic Estimation:

50B messages/day

\approx

580k Average QPS.

Peak QPS

\approx

1.2M (assuming 2x average).

Storage Estimation:

Avg message size: 100 bytes (text + metadata).

50B msgs * 100 bytes = 5 TB/day.

1.8 PB/year.

Bandwidth Estimation:

Ingress: 5 TB/day

\approx

460 Mbps (text only). Media adds significantly more (e.g., 500 MB/day/user

\approx

massive scale requiring CDN).

Blueprint

Concise Summary: A WebSocket-based architecture using a distributed Gateway layer for real-time duplex communication, backed by a high-performance NoSQL store (Cassandra) for message persistence and Redis for session/presence management.

Major Components:

WebSocket Gateway: Manages persistent connections and maps User IDs to specific server instances.

Message Service: Handles message logic, validation, and triggers persistence.

Presence Service: Tracks user online/offline status using heartbeats.

Media Service: Handles multimodal uploads to S3 and generates CDN-optimized URLs.

Notification Service: Sends out-of-band push notifications (FCM/APNS) for offline users.

Simplicity Audit: This design avoids complex E2EE key management and focus on a robust "Store-and-Forward" mechanism which is the most reliable way to handle multi-device sync for an MVP.

Architecture Decision Rationale:

Cassandra: Chosen for its high-write throughput and linear scalability, ideal for message logs.

Redis: Necessary for sub-millisecond lookups of session locations and presence.

Kafka: Decouples the critical path of message delivery from heavy processing tasks.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

CDN: Used for caching media (images/videos).

L7 Load Balancer: Handles SSL termination and routes traffic based on protocol (HTTP/gRPC for API, WebSockets for Messaging).

Security & Perimeter:

API Gateway: Handles AuthN/AuthZ (JWT validation), rate limiting, and request logging.

Service

Topology & Scaling: Stateless microservices deployed in Multi-AZ clusters. WebSocket Gateways are "sticky" but stateless; session info is stored in Redis.

API Schema Design:

sendMessage(receiver_id, content_type, body, idempotency_key): gRPC/WebSocket.

fetchMessages(conversation_id, last_seq_id): REST.

updateStatus(message_id, status): WebSocket.

Resilience & Reliability: Exponential backoff on client retries. Gateways send "Server ACKs" to clients immediately upon receiving a message to confirm persistence.

Storage

Access Pattern: Heavy writes (new messages), heavy reads for new sessions (syncing history), very few updates (status changes).

Database Table Design (Cassandra):

messages: partition_key: conversation_id, clustering_key: sequence_id (desc).

Fields: message_id, sender_id, content, media_url, timestamp.

Technical Selection: Cassandra.

Rationale: LSM-tree architecture handles massive write volume better than B-Tree (RDBMS). Wide-column store allows efficient retrieval of the latest

N

messages.

Distribution Logic: Partitioned by conversation_id to ensure all messages for a chat are co-located on a node, ensuring strict ordering during reads.

Cache

Purpose & Justification:

Presence Cache: Stores user_id -> last_active_timestamp.

Session Store: Stores user_id -> gateway_id so the Message Service knows where to route incoming real-time messages.

Technical Selection: Redis.

Rationale: High-speed KV lookups and TTL support for heartbeat management.

Messaging

Purpose & Decoupling: Kafka acts as the backbone to ensure that even if the Storage Worker is slow, the real-time delivery path remains unaffected.

Event / Topic Schema: message-events: {sender_id, receiver_id, payload, timestamp}.

Technical Selection: Kafka.

Rationale: High throughput, durability, and ability to replay events if a downstream service fails.

Wrap Up

Advanced Topics

Trade-offs:

Consistency vs Availability: Choosing AP (Availability/Partition Tolerance). During network partitions, some devices might see messages slightly out of order until the sync process reconciles them.

Reliability & Failure Handling:

If a WebSocket Gateway fails, the client detects the disconnect and reconnects to a new instance. The new Gateway fetches the "missed" messages from the Message DB based on the last sequence_id held by the client.

Multi-Device Sync Strategy:

Every message is assigned a monotonically increasing sequence_id per user/conversation.

When a device comes online, it sends its last_received_id. The server sends all messages where id > last_received_id.

Security:

mTLS: For service-to-service communication.

Encryption at rest: Database volumes are encrypted; sensitive fields (message body) can be encrypted with application-level keys.