The Question

Design a Scalable Real-time Instant Messaging System

Design an end-to-end instant messaging system capable of supporting 10 million daily active users with sub-100ms message delivery latency. The system must handle 1-on-1 and small group chats, persistent message history, and real-time presence (online/offline status). Focus your design on how to maintain millions of concurrent connections, ensure reliable message delivery even when users are offline, and select a storage strategy that scales with billions of messages per month. Discuss the trade-offs between consistency and availability in the context of message ordering.

WebSockets

Redis

Cassandra

Kafka

gRPC

NoSQL

JWT

TLS 1.3

Prometheus

Questions & Insights

Clarifying Questions

Scale and Traffic: What is the target scale? (Assumption: 10M DAU, average 100 messages/user/day, totaling 1B messages daily).

Communication Type: Should we support 1-on-1, group chats, or both? (Assumption: 1-on-1 and small group chats up to 100 members).

Persistence: Do we need historical message sync across devices? (Assumption: Yes, permanent storage is required).

Real-time Requirements: What is the delivery latency threshold? (Assumption: < 100ms for message delivery).

Media Support: Are we handling images/videos for the MVP? (Assumption: Text only for MVP to prioritize core messaging reliability).

Thinking Process

Connection Management: How do we maintain millions of persistent connections efficiently? (Focus: WebSockets and Gateway Layer).

Message Delivery: How do we ensure a message reaches a recipient who is online vs. offline? (Focus: Presence Service and Push Notifications).

Storage Strategy: How do we store trillions of messages while keeping read latency low for chat history? (Focus: NoSQL/Wide-column stores).

Consistency & Ordering: How do we guarantee that messages appear in the same order for all participants? (Focus: Sequence IDs and Client-side sequencing).

Bonus Points

E2EE (End-to-End Encryption): Implementing the Signal Protocol for privacy, where the server only acts as a relay for encrypted blobs.

Connection Draining: Using a "Graceful Shutdown" mechanism for WebSocket servers to migrate millions of connections without causing a "Thundering Herd" on the Load Balancer.

Last-mile Reliability: Implementing an "ACK-ACK" protocol (Client-to-Server ACK and Server-to-Client ACK) to ensure 100% delivery under flaky network conditions.

Write-path Optimization: Utilizing a Log-Structured Merge (LSM) tree-based database (like Cassandra) to handle massive write bursts without locking.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can send/receive 1-on-1 messages in real-time.

Users can see the online/offline status of friends (Presence).

Users can sync chat history when logging in from a new device.

Push notifications for offline users.

Scope Control:

In-scope: Text messaging, Presence, Delivery receipts (Sent/Delivered).

Out-of-scope: Video/Voice calls, large group chats (>1k), read receipts (MVP simplification), and media processing.

Non-Functional Requirements

Scale: Support 10M DAU and 1B messages per day.

Latency: End-to-end message delivery under 100ms for online users.

Availability & Reliability: 99.99% uptime; messages must not be lost once "Sent" status is confirmed.

Consistency: High availability over strong consistency (AP in CAP), but strict message ordering within a single conversation.

Fault Tolerance: Automatic reconnection and session resumption after server failure.

Estimation

Traffic Estimation:

1B messages / 86,400s

\approx

11,500 messages per second (Avg QPS).

Peak QPS (3x)

\approx

35,000 writes/sec.

Storage Estimation:

1B messages * 100 bytes

\approx

100 GB/day.

36.5 TB per year. NoSQL sharding is mandatory.

Bandwidth Estimation:

Incoming: 11,500 * 100 bytes

\approx

1.15 MB/s.

Outgoing (Fan-out for groups):

\approx

5-10 MB/s.

Blueprint

Concise Summary: A WebSocket-based architecture using a distributed Gateway layer for real-time duplex communication, backed by a NoSQL wide-column store for message persistence and Redis for presence tracking.

Major Components:

API Gateway: Handles authentication and routing for RESTful operations (e.g., login, profile).

WebSocket (WS) Service: Manages persistent connections and facilitates real-time message relay.

Presence Service: Tracks user heartbeats and online status in a low-latency cache.

Chat Service: Processes business logic, message ID generation, and fan-out for groups.

Message Store: Distributed database optimized for time-series write-heavy workloads.

Simplicity Audit: This design avoids complex microservices by grouping message routing and presence into focused, high-performance clusters. It uses Kafka for async task offloading (notifications) to keep the write path fast.

Architecture Decision Rationale:

Why this architecture?: WebSockets are the industry standard for low-latency bidirectional communication. Cassandra/DynamoDB provides the horizontal scalability needed for chat history.

Functional Satisfaction: Covers real-time delivery, history, and status.

Non-functional Satisfaction: High write throughput handled by NoSQL; low latency achieved via persistent WS connections.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: DNS uses latency-based routing to connect users to the nearest regional Data Center.

Security & Perimeter:

API Gateway: Handles JWT validation and TLS termination.

Rate Limiting: Applied at the user ID level to prevent spamming (e.g., 20 msgs/sec).

L7 Load Balancing: Sticky sessions are NOT required because the system uses a distributed mapping of UserID -> WebSocket Server in Redis.

Service

Topology & Scaling:

WS Gateway: Stateless regarding business logic but stateful regarding TCP connections. Scaled based on open file descriptors and memory.

Chat Service: Stateless, scales based on CPU/Request count.

API Schema Design:

Endpoint: POST /v1/messages (Internal/Fallback)

Protocol: gRPC for internal service-to-service; WebSockets for client-to-server.

Idempotency: Client generates a request_id (UUID) to prevent duplicate messages during retries.

Resilience & Reliability:

Backoff: Clients use exponential backoff with jitter for WS reconnection.

Circuit Breaker: Implemented on the Chat Service when the Message Store latency exceeds 500ms.

Observability: RED metrics (Rate, Error, Duration) for all gRPC calls. Tracking "Time to Delivered" as a key SLI.

Storage

Access Pattern:

Write:Read ratio is roughly 1:1.

Primary access is by chat_id and sorted by timestamp.

Database Table Design:

Table: messages

chat_id (Partition Key): UUID

message_id (Clustering Key): Time-ordered UUID (TimeUUID)

sender_id: UUID

content: Text

created_at: Timestamp

Technical Selection: Cassandra.

Rationale: Excellent for write-heavy workloads; wide-column store naturally supports "Message History" where rows are stored physically together by chat_id.

Distribution Logic:

Data is sharded by chat_id.

For very large groups (hot partitions), an extra shard factor can be added to the partition key (e.g., chat_id_N).

Cache

Purpose & Justification: Presence tracking requires sub-millisecond lookups. Using a database for "Last Seen" heartbeats every 30s would overwhelm the disk.

Key-Value Schema:

Key: presence:{user_id}, Value: online|offline|timestamp, TTL: 60s.

Key: location:{user_id}, Value: ws_server_id (To route incoming messages to the right server).

Technical Selection: Redis.

Failure Handling: If Redis fails, presence is temporarily unknown (fail-open). Users are treated as offline.

Messaging

Purpose & Decoupling: Offloads non-critical tasks like Push Notifications and Analytics from the real-time message path.

Event / Topic Schema:

Topic: message_events. Payload: {sender_id, recipient_id, content_preview, timestamp}.

Throughput & Partitioning: Kafka partitioned by recipient_id to ensure ordered notification processing.

Technical Selection: Kafka.

Rationale: High throughput and message replayability if the Notification Engine goes down.

Infrastructure (Optional)

Observability: Prometheus for metrics, Jaeger for tracing message flow across the Gateway and Chat Service.

Distributed Coordination: Not used for the MVP to maintain simplicity. Mapping is managed in Redis rather than a complex Zookeeper setup.

Wrap Up

Advanced Topics

Trade-offs (Consistency vs Availability): This is an AP system. In a network partition, we prioritize being able to send messages even if "Presence" status is slightly stale.

Reliability:

Sequence IDs: To handle out-of-order delivery over different network paths, the Chat Service assigns a monotonically increasing sequence number per chat_id.

Bottleneck Analysis:

WebSocket Max Connections: A single server can handle ~65k-100k connections. For 10M users, we need ~100-150 nodes.

Hot Partitions: Popular users or large groups are handled by sharding the group ID.

Security: TLS 1.3 for all connections. Database encryption at rest (AES-256).

Optimization:

Message Pull vs Push: Clients "Push" to the server via WS. Server "Pushes" to recipient. If the recipient is offline, the client "Pulls" history on reconnection using the last received message_id.