DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Real-Time Chat & Presence System

Design a globally scalable real-time chat system supporting 100 million daily active users. The system must handle 1:1 and group conversations, provide persistent message history, and maintain a real-time 'online/offline' presence status. Address the challenges of maintaining millions of concurrent connections, minimizing message delivery latency, and efficiently managing the high-frequency write load associated with presence heartbeats.
WebSocket
Redis
Cassandra
Kafka
gRPC
NoSQL
Pub/Sub
Questions & Insights

Clarifying Questions

Scale: What is the expected scale (DAU, peak concurrent connections)?
Features: Do we need to support 1:1 chat only, or are group chats and media (images/video) in scope?
Online Status (Presence): Should the status be real-time, or is a slight delay (e.g., 30s) acceptable? Does "Online" mean app-open, or active-typing?
Retention: Should message history be stored forever or for a limited period?
Assumptions for MVP: 100M DAU, 1:1 and small group chats (up to 100 people), 500-byte average message size, real-time presence via heartbeats (30s interval), messages stored indefinitely.

Thinking Process

Core Bottleneck: How do we maintain millions of persistent WebSocket connections while tracking presence status across multiple server nodes?
Data Flow: How is a message routed from User A to User B, ensuring low latency and reliable delivery?
Presence Scaling: How do we handle the massive write volume of "heartbeats" (3.3M QPS if 100M users heartbeat every 30s)?
Architecture Flow: Start with a WebSocket Gateway for real-time duplex communication, use a Distributed Cache (Redis) for volatile presence state, and a NoSQL Database (Cassandra) for high-write message storage.

Bonus Points

Last Seen Optimization: Instead of updating "Last Seen" every 30s for every user, only update it on specific events (message sent, app closed) and use a coarser-grained timestamp to reduce DB write pressure.
WebSocket Session Affinity: Use a "Session Service" or consistent hashing to quickly locate which gateway node User B is connected to, avoiding broadcast storms.
Causal Ordering: Discuss using Lamport Timestamps or Client-side sequence numbers to ensure messages appear in the correct order despite network jitter.
Fan-out Strategy: For group chats, differentiate between "push-on-write" for small groups and "pull-on-read" for massive channels (broadcasts).
Design Breakdown

Functional Requirements

Core Use Cases:
1:1 and Group messaging.
Real-time delivery via WebSockets.
Online/Offline status and "Last Seen" timestamps.
Message history retrieval.
Scope Control:
In-scope: Text messaging, presence, delivery/read receipts.
Out-of-scope: End-to-end encryption (E2EE) key exchange, voice/video calls, file storage (CDN-side), and message search.

Non-Functional Requirements

Scale: Support 100M DAU and 10M+ concurrent connections.
Latency: Sub-200ms end-to-end message delivery.
Availability & Reliability: 99.99% availability; messages must not be lost once acknowledged by the server.
Consistency: Strong ordering for messages within a single conversation; eventual consistency for presence status.
Security: TLS for all connections; authentication via JWT.

Estimation

Traffic Estimation:
100M DAU, each sending 10 messages/day = 1B messages/day.
Average Message QPS: 1,000,000,000 / 86,400 \approx 11.5k QPS.
Peak Message QPS (3x): \approx 35k QPS.
Presence Heartbeats: 100M users / 30s heartbeat = 3.3M QPS (High write load).
Storage Estimation:
1B messages/day * 500 bytes = 500 GB/day.
Yearly storage: \approx 180 TB.
Bandwidth Estimation:
Ingress: 35k pk QPS * 500 bytes \approx 17.5 MB/s.

Blueprint

Concise Summary: A WebSocket-based real-time architecture utilizing a distributed session store (Redis) for presence and a wide-column store (Cassandra) for message persistence.
Major Components:
WebSocket Gateway: Manages long-lived TCP connections and routes incoming/outgoing messages.
Presence Service: Tracks online status via heartbeat TTLs in Redis.
Chat Service: Handles message logic, persistence, and fan-out to recipients.
Redis: Stores active session locations and presence status for low-latency lookups.
Cassandra: Persists chat history with optimized schema for chronological retrieval.
Kafka: Decouples message delivery from heavy async tasks like Push Notifications.
Simplicity Audit: This design avoids complex distributed locking by using TTLs for presence and relies on NoSQL horizontal scaling for storage.
Architecture Decision Rationale:
Why?: WebSockets are essential for bi-directional low latency. Cassandra is chosen because chat is write-heavy and naturally maps to partition keys (ChatID) and clustering keys (MessageID/Timestamp).
Functional Satisfaction: Meets all real-time and history requirements.
Non-functional Satisfaction: Scalable via sharding (Cassandra/Redis Cluster) and highly available via stateless service layers.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
DNS uses latency-based routing to the nearest regional Data Center.
L7 Load Balancer (e.g., Nginx/AWS ALB) handles SSL termination and distributes WebSocket connections based on least-connections.
Security & Perimeter:
API Gateway: Performs JWT validation before upgrading the connection to WebSocket.
Rate Limiting: Applied at the Gateway level to prevent heartbeat spam (e.g., max 1 heartbeat per 20s).

Service

Topology & Scaling:
WebSocket Gateway: Stateful in terms of TCP connections, but stateless regarding business logic. Scaled based on memory (max open files) and CPU.
Chat/Presence Services: Stateless microservices, scaled via HPA (Horizontal Pod Autoscaler) on CPU/Request count.
API Schema Design:
Send Message: POST /v1/messages (Internal/RPC)
Payload: {senderId, receiverId, groupId, content, type, tempId}
Get History: GET /v1/messages/{conversationId}?limit=50&cursor={ts}
Resilience & Reliability:
Retries: Clients use exponential backoff for WebSocket reconnections.
Ack Mechanism: Gateway waits for Chat Service to persist message in Cassandra before sending an "ACK" to the sender.

Storage

Access Pattern: High write (new messages) and high read (loading history). Presence is extreme write/read.
Database Table Design (Cassandra):
Table: messages_by_conversation
conversation_id (Partition Key)
created_at (Clustering Key, Descending)
message_id (TimeUUID)
sender_id, content
Technical Selection: Cassandra is chosen for its linear scalability and performance in time-series-like workloads.
Distribution Logic: Partitioning by conversation_id ensures all messages for a single chat are on the same physical node/partition, making history fetches extremely fast.

Cache

Purpose & Justification: Redis handles two critical bottlenecks:
Presence: Storing user_id -> status (Online/Offline/LastSeen).
Session Map: Storing user_id -> gateway_node_id to route messages to the correct WebSocket server.
Key-Value Schema:
presence:{user_id}: Value online, TTL 60s (updated by 30s heartbeats).
session:{user_id}: Value gateway_ip.
Failure Handling: If Redis fails, presence defaults to "Offline." Use Redis Cluster for high availability.

Messaging

Purpose & Decoupling: Kafka is used to ingest sent messages for downstream consumers.
Event Schema: MessageCreatedEvent containing the full message payload.
Throughput & Partitioning: Partitioned by conversation_id to maintain ordering for any specific chat.
Failure Handling: Dead-letter queue (DLQ) for failed push notification attempts.

Infrastructure (Optional)

Observability: Prometheus for metrics (active WebSocket count, message latency), ELK for logs, and Jaeger for tracing message flow across microservices.
Wrap Up

Advanced Topics

Trade-offs (PACELC): For Presence, we favor Availability over Consistency (PA/EL). It is okay if a user sees a friend "Online" for 30s after they disconnect. For Messages, we favor Consistency and Partition Tolerance (CP/EC) to ensure order.
Reliability: If the WebSocket Gateway node crashes, clients reconnect to a different node. Redis session mapping is updated upon reconnection.
Bottleneck Analysis: The biggest bottleneck is the Redis write volume for presence. Optimization: Implement "Presence Interest Groups." Only send presence updates to "friends" who are currently online and looking at their chat list, rather than broadcasting to all followers.
Security: Use TLS 1.3 for all traffic. Sanitize all message content on the server side to prevent XSS/Injection.