DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Global Instant Messaging System

Design the backend for a global instant messaging application like WhatsApp. The system must support real-time 1:1 and group chat for 500 million daily active users. Key requirements include low-latency message delivery, presence indicators (online/offline status), delivery and read receipts, and support for offline users via push notifications. Discuss how you handle millions of concurrent persistent connections, message ordering, and data persistence for high-volume writes.
WebSockets
Cassandra
Redis
Kafka
S3
gRPC
API Gateway
Push Notifications
Questions & Insights

Clarifying Questions

Scale and Traffic: What is the expected Daily Active User (DAU) count and average messages sent per user per day?
Feature Set: Does the MVP include 1:1 chat, group chat, and presence (online/offline status)? Should we handle media (images/video) or just text?
Data Retention: Should messages be stored permanently or deleted once delivered?
Network Constraints: Are we designing for high-latency mobile networks?
Assumptions for this design:
Scale: 500 million DAU, 20 billion messages per day.
Features: 1:1 messaging, group messaging (up to 512 members), delivery/read receipts, and presence status.
Media: MVP will support basic image/video uploads via an object store.
E2EE: End-to-end encryption is required (handled client-side, but the system must transport the keys/payloads).

Thinking Process

The core challenge is maintaining millions of persistent connections while ensuring low-latency delivery and handling massive write volumes.
Persistent Connectivity: How do we maintain open pipes for real-time delivery? (WebSockets for bi-directional communication).
Message Routing: How does the system know which server User B is connected to when User A sends a message? (Global Connection Registry/Presence Service).
Offline Delivery: How do we handle messages for users with no active connection? (Persistent storage + Push Notification triggers).
Group Fan-out: How do we efficiently send one message to 512 people without blocking the sender? (Asynchronous expansion via Message Queues).

Bonus Points

Consistency vs. Availability: Use a PACELC-aware design, prioritizing Availability and Partition Tolerance (AP) for presence, but ensuring "At-least-once" delivery for messages.
Connection Draining: Implementing a graceful way to migrate millions of WebSocket connections during deployments without causing a thundering herd.
Conflict-Free Replicated Data Types (CRDTs): Using CRDTs for message ordering and synchronization across multiple devices for the same user.
Last-mile Latency: Using Edge Locations/PoPs to terminate TLS/WebSocket connections closer to the user to reduce handshake latency.
Design Breakdown

Functional Requirements

Core Use Cases:
One-to-one messaging (Text and Media).
Group messaging (Fan-out).
Delivery and Read receipts (Ack/Nack).
Presence status (Online, Offline, Last Seen).
Scope Control:
In-scope: Real-time messaging, status, push notifications.
Out-of-scope: Voice/Video calling (Signaling only), message search, "Stories/Status" updates.

Non-Functional Requirements

Scale: Support 500M DAU and 10M+ concurrent connections per cluster.
Latency: Message delivery < 200ms for 99th percentile (p99) when both users are online.
Availability & Reliability: 99.99% availability; zero message loss for delivered/stored messages.
Consistency: Eventual consistency for presence; causal consistency for message ordering within a thread.
Security: TLS 1.3 for transport; client-side End-to-End Encryption (E2EE).

Estimation

Traffic Estimation:
20B messages / 86,400 seconds ≈ 231,500 Average QPS.
Peak QPS (3x): ~700k QPS.
Storage Estimation:
Avg message size: 100 bytes (text/metadata).
20B msgs/day * 100 bytes = 2 TB/day.
5-year retention: 2 TB 365 5 ≈ 3.6 PB.
Bandwidth Estimation:
Ingress: 231k QPS * 100 bytes ≈ 23 MB/s (text only).
Egress: Multiplying by group fan-out (assume avg 3x fan-out) ≈ 70 MB/s.

Blueprint

Concise Summary: A WebSocket-based architecture where a Connection Manager tracks active users, a Chat Service routes messages, and Cassandra provides high-throughput persistent storage for undelivered or historical messages.
Major Components:
Connection Manager: Maintains stateful WebSocket sessions for real-time delivery.
Presence Service: Tracks online/offline status in a low-latency cache.
Message Service: Handles message logic, receipts, and persistent storage.
Message DB: Distributed NoSQL store for high-volume message persistence.
Message Queue: Decouples group message fan-out and offline push notifications.
Object Store: Stores encrypted media files (images/videos).
Simplicity Audit: This architecture avoids complex distributed locking by using a "Push-on-Connect" and "Store-and-Forward" model, which is the most reliable way to handle intermittent mobile connectivity.
Architecture Decision Rationale:
WebSockets over HTTP: Essential for low-overhead, real-time bi-directional updates.
Cassandra: Chosen for its linear scalability and high write throughput, which fits the massive message volume.
Redis: Chosen for presence tracking because status changes are frequent and ephemeral.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
Use Global Server Load Balancing (GSLB) to route users to the nearest regional data center.
CDN used for caching static assets and media downloads.
Security & Perimeter:
API Gateway: Handles authentication via JWT and terminates TLS.
Rate Limiting: Applied per UserID to prevent spam/DDoS.

Service

Topology & Scaling:
Connection Managers: Stateful, horizontally scaled. Each node tracks which UserIDs are connected to it in local memory and updates the Presence Cache.
Chat Service: Stateless, scales based on CPU/QPS.
API Schema Design:
sendMessage(to_id, content_type, payload): gRPC for internal service communication; WebSocket for client-to-server.
getMessages(conversation_id, cursor): REST for historical sync.
Resilience & Reliability:
Heartbeats: Clients send heartbeats every 30s to keep the WebSocket alive and update presence.
Sequence Numbers: Every message has a client-generated monotonically increasing sequence number to handle deduplication and ordering.

Storage

Access Pattern: 90% writes (new messages/status) and 10% reads (historical sync).
Database Table Design:
Table: Messages
conversation_id (Partition Key)
message_id (Clustering Key - TimeUUID)
sender_id, content, timestamp, status (Sent/Delivered/Read)
Technical Selection: Cassandra.
Rationale: High write availability and efficient clustering keys for time-series message retrieval.
Distribution Logic: Sharded by conversation_id. For large groups, a sub-shard (bucket) is added to conversation_id to prevent single-partition hotspots.

Cache

Purpose & Justification: Presence tracking (Online/Offline/Last Seen).
Key-Value Schema:
Key: user:presence:{user_id}, Value: {status: "online", last_seen: timestamp, server_id: "conn-mgr-01"}.
TTL: Set to 60 seconds (expires if heartbeat fails).
Technical Selection: Redis.
Rationale: High IOPS for status updates and support for Pub/Sub (optional for friend status updates).

Messaging

Purpose & Decoupling: Kafka is used for:
Group Fan-out: Chat service produces a "GroupMsg" event; workers expand this into 1:1 messages for each member.
Offline Notifications: If a recipient is offline (checked via Presence Cache), a message is sent to the Push Notification Queue.
Failure Handling: Dead-letter queues (DLQ) for failed push notifications (e.g., invalid device tokens).
Technical Selection: Kafka.
Rationale: Message durability and high throughput for event-driven fan-out.
Wrap Up

Advanced Topics

Trade-offs: We chose Availability over Consistency (AP) for presence. If Redis is slightly out of sync, a user might see someone as "online" who just disconnected. This is acceptable for UX.
Reliability:
Retry Logic: Clients implement exponential backoff for WebSocket reconnection.
Acknowledgement: The "Message Received" flow: Client A -> Server (Ack to A) -> Client B -> Server (Ack from B) -> Client A (Delivered status).
Bottleneck Analysis:
Hot Shards: Large group chats (e.g., 500 members) can cause write spikes.
Solution: Use a separate asynchronous "Group Worker" to handle fan-out rather than doing it in the request-response cycle.
Security & Privacy:
PII Protection: Media stored in S3 is encrypted. Only the URL is stored in Cassandra.
End-to-End Encryption: The server never sees the raw message content, only the encrypted blob and routing metadata.