The Question

Real-Time Messaging System for Millions of Concurrent Users

Design a globally scalable real-time messaging application similar to WhatsApp or Slack. The system must support millions of concurrent WebSocket connections, ensure message persistence with high write throughput, and provide real-time presence tracking (online/offline status). Detail how you would handle message delivery for users who are currently offline, ensure message ordering, and scale the session management layer to handle millions of active users while maintaining low latency.

WebSocket

Redis

Cassandra

ScyllaDB

Kafka

gRPC

NoSQL

Pub/Sub

JWT

TLS

Questions & Insights

Clarifying Questions

What is the peak number of concurrent users (CCU)? Knowing if it's 1M or 10M changes the connection management strategy significantly. Assumption: 2 million CCU.

What is the message volume and average size?Assumption: Average message is 500 bytes; peak throughput is 100k messages/second.

Are group chats supported, and if so, what is the maximum group size? Large groups cause "write amplification" challenges. Assumption: Support groups up to 500 members; 1-to-1 is the primary use case.

What are the requirements for message delivery guarantees?Assumption: At-least-once delivery with "delivered" and "read" receipts.

How long should we persist message history?Assumption: Indefinite persistence for the MVP.

Thinking Process

Connection Management: How do we maintain millions of long-lived WebSocket connections across a distributed fleet without overwhelming the system?

Message Routing: How does the system efficiently find which server a specific recipient is connected to (the "Where is Alice?" problem)?

Presence Scalability: How do we update and broadcast "Online" status without creating a quadratic traffic storm (

O(N^2)

Reliable Offline Handling: How do we transition seamlessly from real-time WebSocket delivery to asynchronous Push Notifications when a user disconnects?

Bonus Points

Message Sequence Guarantees: Using client-side monotonic sequence IDs and server-side vector clocks or Lamport timestamps to ensure message ordering despite network jitter.

Connection Draining: Implementing "GoAway" frames in WebSockets to gracefully migrate users during deployments without causing a massive "thundering herd" reconnect.

Presence Optimization: Using a "Pull-on-Demand" or "Lazy Presence" model for large groups/rosters to avoid broadcasting state changes to millions of idle users.

Hot Partition Mitigation: Implementing bucketing or caching for extremely active group chats to prevent single-node database bottlenecks.

Design Breakdown

Functional Requirements

Core Use Cases:

One-to-one and small group real-time messaging.

Message persistence (chat history).

Real-time presence (Online/Offline/Last Seen).

Reliable delivery to offline users via Push Notifications.

Scope Control:

In-scope: Text messaging, presence, delivery receipts.

Out-of-scope: End-to-end encryption (E2EE) key management, video/voice calls, large file attachments (handled via S3/CDN).

Non-Functional Requirements

Scale: Support 10M Daily Active Users (DAU) and 2M Concurrent Connections.

Latency: End-to-end message delivery < 200ms (99th percentile).

Availability & Reliability: 99.99% availability; zero message loss once acknowledged by the server.

Consistency: High availability over strong consistency (eventual consistency for history).

Security & Privacy: TLS for all connections; authentication via JWT.

Estimation

Traffic Estimation:

10M DAU, each sending 20 msgs/day = 200M msgs/day.

Average QPS: ~2,300 msgs/sec.

Peak QPS (5x): ~11,500 msgs/sec.

Storage Estimation:

200M msgs/day * 500 bytes = 100 GB/day.

1 Year = 36.5 TB (excluding replication/indexing).

Bandwidth Estimation:

11,500 msgs/sec * 500 bytes = 5.75 MB/s (Inbound). Outbound is higher due to delivery receipts and presence updates.

Blueprint

The architecture uses a Gateway + WebSocket Service pattern to maintain persistent connections, coupled with a Message Service for business logic and a Presence Service backed by an in-memory store.

WebSocket Gateway: Manages stateful TCP connections and maps UserIDs to specific Gateway instances.

Message Service: Orchestrates message validation, persistence, and routing.

Presence Service: Tracks heartbeats and broadcasts status via Pub/Sub.

Simplicity Audit: We avoid complex distributed locking by using a "last-write-wins" approach for status and standard NoSQL partitioning for messages.

Architecture Decision Rationale:

Why?: WebSockets are essential for low-latency bi-directional communication. NoSQL (Cassandra/ScyllaDB) is chosen for its superior write-throughput and easy horizontal scaling for message history.

Functional Satisfaction: Covers real-time, history, and presence.

Non-functional Satisfaction: Scalable via horizontal service addition; Redis provides the low latency needed for presence.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: DNS with latency-based routing (GeoDNS) to direct users to the nearest regional data center.

Security & Perimeter: L7 Load Balancer (e.g., NGINX/Envoy) performing SSL termination.

Rate Limiting: IP-based and User-ID based rate limiting at the Gateway to prevent API abuse and DoS attacks.

Service

WS Gateway Service:

Topology: Stateful fleet. Each node maintains a hash map of UserID -> WebSocketObject.

Scaling: Scales on "Number of open connections" and "Memory consumption."

Session Registry: A distributed Redis store mapping UserID -> GatewayNodeID. When User A sends a message to User B, the system queries this registry to find which Gateway node holds User B's connection.

Message Service:

API Schema:

sendMessage(recipient_id, content, client_msg_id) -> status: 202 Accepted.

Protocol: gRPC for internal service communication.

Idempotency: client_msg_id ensures retries don't create duplicate messages.

Presence Service:

Tracks user status via heartbeats sent over the WebSocket every 30 seconds.

If a heartbeat is missed (TTL expires), the user is marked as "Offline."

Storage

Access Pattern: High write (new messages), high read (fetching history).

Database Table Design (NoSQL):

messages Table:

channel_id (Partition Key): ID of the 1-to-1 chat or group.

message_id (Clustering Key): Time-UUID for chronological ordering.

sender_id, content, metadata.

Technical Selection: Cassandra / ScyllaDB. Optimized for LSM-tree based writes, perfect for immutable chat logs.

Distribution Logic: Partitioning by channel_id ensures all messages for a specific conversation are co-located, making history fetches extremely fast.

Cache

Purpose: Presence Tracking and Session Registry.

Key-Value Schema:

presence:{user_id} -> status: online, last_seen: timestamp.

session:{user_id} -> gateway_node_id.

Technical Selection: Redis. Use Redis Pub/Sub for broadcasting status changes to the friends/roster of an online user.

Failure Handling: If Redis fails, presence defaults to "Offline." On recovery, users' next heartbeats will repopulate the state.

Messaging

Purpose: Decoupling the "Message Delivery" flow from the "Push Notification" flow.

Event Schema: topic: offline-notifications, payload: {recipient_id, sender_name, snippet, msg_id}.

Technical Selection: Kafka or AWS SQS.

Failure Handling: Dead-letter queues (DLQ) for notifications that fail after multiple retries to third-party providers (APNS/FCM).

Data Processing

Push Worker: A fleet of consumers reading from the Messaging Layer.

Logic: It checks if the user is truly offline (double-checks Presence Service) before triggering a third-party Push Notification request.

Technical Selection: Custom Golang/Java microservice for high-concurrency I/O.

Infrastructure (Optional)

Observability: Prometheus metrics for "Connection Count" and "Message Latency." Jaeger for tracing a message from Sender -> Gateway -> Message Service -> Recipient Gateway -> Receiver.

Distributed Coordination: Not strictly needed for MVP; Gateway nodes are autonomous, and session state is in Redis.

Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability over Consistency (AP). If a presence update is delayed by 1 second, it's acceptable. If a message history fetch is eventually consistent, it's fine as long as the real-time delivery is fast.

Reliability:

Ack/Retry: Client sends message -> Server writes to DB -> Server returns "Server-Ack" -> Server pushes to recipient -> Recipient sends "Client-Ack."

If "Server-Ack" is not received, the client retries with the same client_msg_id.

Bottleneck Analysis:

Hot Partitions: A celebrity with millions of followers. Presence updates for them would be "Lazy" (only fetched when someone opens the chat).

Redis Load: Using Redis Clusters to shard the Presence/Session data by user_id.

Security: All traffic over WSS (WebSocket Secure). JWT tokens validated at the Gateway.