The Question
Design

Design a Scalable Real-time Instant Messaging System

Design an end-to-end instant messaging system capable of supporting 10 million daily active users with sub-100ms message delivery latency. The system must handle 1-on-1 and small group chats, persistent message history, and real-time presence (online/offline status). Focus your design on how to maintain millions of concurrent connections, ensure reliable message delivery even when users are offline, and select a storage strategy that scales with billions of messages per month. Discuss the trade-offs between consistency and availability in the context of message ordering.
WebSockets
Redis
Cassandra
Kafka
gRPC
NoSQL
JWT
TLS 1.3
Prometheus
Questions & Insights

Clarifying Questions

Scale and Traffic: What is the target scale? (Assumption: 10M DAU, average 100 messages/user/day, totaling 1B messages daily).
Communication Type: Should we support 1-on-1, group chats, or both? (Assumption: 1-on-1 and small group chats up to 100 members).
Persistence: Do we need historical message sync across devices? (Assumption: Yes, permanent storage is required).
Real-time Requirements: What is the delivery latency threshold? (Assumption: < 100ms for message delivery).
Media Support: Are we handling images/videos for the MVP? (Assumption: Text only for MVP to prioritize core messaging reliability).

Thinking Process

Connection Management: How do we maintain millions of persistent connections efficiently? (Focus: WebSockets and Gateway Layer).
Message Delivery: How do we ensure a message reaches a recipient who is online vs. offline? (Focus: Presence Service and Push Notifications).
Storage Strategy: How do we store trillions of messages while keeping read latency low for chat history? (Focus: NoSQL/Wide-column stores).
Consistency & Ordering: How do we guarantee that messages appear in the same order for all participants? (Focus: Sequence IDs and Client-side sequencing).

Bonus Points

E2EE (End-to-End Encryption): Implementing the Signal Protocol for privacy, where the server only acts as a relay for encrypted blobs.
Connection Draining: Using a "Graceful Shutdown" mechanism for WebSocket servers to migrate millions of connections without causing a "Thundering Herd" on the Load Balancer.
Last-mile Reliability: Implementing an "ACK-ACK" protocol (Client-to-Server ACK and Server-to-Client ACK) to ensure 100% delivery under flaky network conditions.
Write-path Optimization: Utilizing a Log-Structured Merge (LSM) tree-based database (like Cassandra) to handle massive write bursts without locking.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can send/receive 1-on-1 messages in real-time.
Users can see the online/offline status of friends (Presence).
Users can sync chat history when logging in from a new device.
Push notifications for offline users.
Scope Control:
In-scope: Text messaging, Presence, Delivery receipts (Sent/Delivered).
Out-of-scope: Video/Voice calls, large group chats (>1k), read receipts (MVP simplification), and media processing.

Non-Functional Requirements

Scale: Support 10M DAU and 1B messages per day.
Latency: End-to-end message delivery under 100ms for online users.
Availability & Reliability: 99.99% uptime; messages must not be lost once "Sent" status is confirmed.
Consistency: High availability over strong consistency (AP in CAP), but strict message ordering within a single conversation.
Fault Tolerance: Automatic reconnection and session resumption after server failure.

Estimation

Traffic Estimation:
1B messages / 86,400s \approx 11,500 messages per second (Avg QPS).
Peak QPS (3x) \approx 35,000 writes/sec.
Storage Estimation:
1B messages * 100 bytes \approx 100 GB/day.
36.5 TB per year. NoSQL sharding is mandatory.
Bandwidth Estimation:
Incoming: 11,500 * 100 bytes \approx 1.15 MB/s.
Outgoing (Fan-out for groups): \approx 5-10 MB/s.

Blueprint

Concise Summary: A WebSocket-based architecture using a distributed Gateway layer for real-time duplex communication, backed by a NoSQL wide-column store for message persistence and Redis for presence tracking.
Major Components:
API Gateway: Handles authentication and routing for RESTful operations (e.g., login, profile).
WebSocket (WS) Service: Manages persistent connections and facilitates real-time message relay.
Presence Service: Tracks user heartbeats and online status in a low-latency cache.
Chat Service: Processes business logic, message ID generation, and fan-out for groups.
Message Store: Distributed database optimized for time-series write-heavy workloads.
Simplicity Audit: This design avoids complex microservices by grouping message routing and presence into focused, high-performance clusters. It uses Kafka for async task offloading (notifications) to keep the write path fast.
Architecture Decision Rationale:
Why this architecture?: WebSockets are the industry standard for low-latency bidirectional communication. Cassandra/DynamoDB provides the horizontal scalability needed for chat history.
Functional Satisfaction: Covers real-time delivery, history, and status.
Non-functional Satisfaction: High write throughput handled by NoSQL; low latency achieved via persistent WS connections.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: DNS uses latency-based routing to connect users to the nearest regional Data Center.
Security & Perimeter:
API Gateway: Handles JWT validation and TLS termination.
Rate Limiting: Applied at the user ID level to prevent spamming (e.g., 20 msgs/sec).
L7 Load Balancing: Sticky sessions are NOT required because the system uses a distributed mapping of UserID -> WebSocket Server in Redis.

Service

Topology & Scaling:
WS Gateway: Stateless regarding business logic but stateful regarding TCP connections. Scaled based on open file descriptors and memory.
Chat Service: Stateless, scales based on CPU/Request count.
API Schema Design:
Endpoint: POST /v1/messages (Internal/Fallback)
Protocol: gRPC for internal service-to-service; WebSockets for client-to-server.
Idempotency: Client generates a request_id (UUID) to prevent duplicate messages during retries.
Resilience & Reliability:
Backoff: Clients use exponential backoff with jitter for WS reconnection.
Circuit Breaker: Implemented on the Chat Service when the Message Store latency exceeds 500ms.
Observability: RED metrics (Rate, Error, Duration) for all gRPC calls. Tracking "Time to Delivered" as a key SLI.

Storage

Access Pattern:
Write:Read ratio is roughly 1:1.
Primary access is by chat_id and sorted by timestamp.
Database Table Design:
Table: messages
chat_id (Partition Key): UUID
message_id (Clustering Key): Time-ordered UUID (TimeUUID)
sender_id: UUID
content: Text
created_at: Timestamp
Technical Selection: Cassandra.
Rationale: Excellent for write-heavy workloads; wide-column store naturally supports "Message History" where rows are stored physically together by chat_id.
Distribution Logic:
Data is sharded by chat_id.
For very large groups (hot partitions), an extra shard factor can be added to the partition key (e.g., chat_id_N).

Cache

Purpose & Justification: Presence tracking requires sub-millisecond lookups. Using a database for "Last Seen" heartbeats every 30s would overwhelm the disk.
Key-Value Schema:
Key: presence:{user_id}, Value: online|offline|timestamp, TTL: 60s.
Key: location:{user_id}, Value: ws_server_id (To route incoming messages to the right server).
Technical Selection: Redis.
Failure Handling: If Redis fails, presence is temporarily unknown (fail-open). Users are treated as offline.

Messaging

Purpose & Decoupling: Offloads non-critical tasks like Push Notifications and Analytics from the real-time message path.
Event / Topic Schema:
Topic: message_events. Payload: {sender_id, recipient_id, content_preview, timestamp}.
Throughput & Partitioning: Kafka partitioned by recipient_id to ensure ordered notification processing.
Technical Selection: Kafka.
Rationale: High throughput and message replayability if the Notification Engine goes down.

Infrastructure (Optional)

Observability: Prometheus for metrics, Jaeger for tracing message flow across the Gateway and Chat Service.
Distributed Coordination: Not used for the MVP to maintain simplicity. Mapping is managed in Redis rather than a complex Zookeeper setup.
Wrap Up

Advanced Topics

Trade-offs (Consistency vs Availability): This is an AP system. In a network partition, we prioritize being able to send messages even if "Presence" status is slightly stale.
Reliability:
Sequence IDs: To handle out-of-order delivery over different network paths, the Chat Service assigns a monotonically increasing sequence number per chat_id.
Bottleneck Analysis:
WebSocket Max Connections: A single server can handle ~65k-100k connections. For 10M users, we need ~100-150 nodes.
Hot Partitions: Popular users or large groups are handled by sharding the group ID.
Security: TLS 1.3 for all connections. Database encryption at rest (AES-256).
Optimization:
Message Pull vs Push: Clients "Push" to the server via WS. Server "Pushes" to recipient. If the recipient is offline, the client "Pulls" history on reconnection using the last received message_id.