The Question
DesignReal-Time Chat & Presence System
Design a globally scalable real-time chat system supporting 100 million daily active users. The system must handle 1:1 and group conversations, provide persistent message history, and maintain a real-time 'online/offline' presence status. Address the challenges of maintaining millions of concurrent connections, minimizing message delivery latency, and efficiently managing the high-frequency write load associated with presence heartbeats.
WebSocket
Redis
Cassandra
Kafka
gRPC
NoSQL
Pub/Sub
Questions & Insights
Clarifying Questions
Scale: What is the expected scale (DAU, peak concurrent connections)?
Features: Do we need to support 1:1 chat only, or are group chats and media (images/video) in scope?
Online Status (Presence): Should the status be real-time, or is a slight delay (e.g., 30s) acceptable? Does "Online" mean app-open, or active-typing?
Retention: Should message history be stored forever or for a limited period?
Assumptions for MVP: 100M DAU, 1:1 and small group chats (up to 100 people), 500-byte average message size, real-time presence via heartbeats (30s interval), messages stored indefinitely.
Thinking Process
Core Bottleneck: How do we maintain millions of persistent WebSocket connections while tracking presence status across multiple server nodes?
Data Flow: How is a message routed from User A to User B, ensuring low latency and reliable delivery?
Presence Scaling: How do we handle the massive write volume of "heartbeats" (3.3M QPS if 100M users heartbeat every 30s)?
Architecture Flow: Start with a WebSocket Gateway for real-time duplex communication, use a Distributed Cache (Redis) for volatile presence state, and a NoSQL Database (Cassandra) for high-write message storage.
Bonus Points
Last Seen Optimization: Instead of updating "Last Seen" every 30s for every user, only update it on specific events (message sent, app closed) and use a coarser-grained timestamp to reduce DB write pressure.
WebSocket Session Affinity: Use a "Session Service" or consistent hashing to quickly locate which gateway node User B is connected to, avoiding broadcast storms.
Causal Ordering: Discuss using Lamport Timestamps or Client-side sequence numbers to ensure messages appear in the correct order despite network jitter.
Fan-out Strategy: For group chats, differentiate between "push-on-write" for small groups and "pull-on-read" for massive channels (broadcasts).
Design Breakdown
Functional Requirements
Core Use Cases:
1:1 and Group messaging.
Real-time delivery via WebSockets.
Online/Offline status and "Last Seen" timestamps.
Message history retrieval.
Scope Control:
In-scope: Text messaging, presence, delivery/read receipts.
Out-of-scope: End-to-end encryption (E2EE) key exchange, voice/video calls, file storage (CDN-side), and message search.
Non-Functional Requirements
Scale: Support 100M DAU and 10M+ concurrent connections.
Latency: Sub-200ms end-to-end message delivery.
Availability & Reliability: 99.99% availability; messages must not be lost once acknowledged by the server.
Consistency: Strong ordering for messages within a single conversation; eventual consistency for presence status.
Security: TLS for all connections; authentication via JWT.
Estimation
Traffic Estimation:
100M DAU, each sending 10 messages/day = 1B messages/day.
Average Message QPS: 1,000,000,000 / 86,400 \approx 11.5k QPS.
Peak Message QPS (3x): \approx 35k QPS.
Presence Heartbeats: 100M users / 30s heartbeat = 3.3M QPS (High write load).
Storage Estimation:
1B messages/day * 500 bytes = 500 GB/day.
Yearly storage: \approx 180 TB.
Bandwidth Estimation:
Ingress: 35k pk QPS * 500 bytes \approx 17.5 MB/s.
Blueprint
Concise Summary: A WebSocket-based real-time architecture utilizing a distributed session store (Redis) for presence and a wide-column store (Cassandra) for message persistence.
Major Components:
WebSocket Gateway: Manages long-lived TCP connections and routes incoming/outgoing messages.
Presence Service: Tracks online status via heartbeat TTLs in Redis.
Chat Service: Handles message logic, persistence, and fan-out to recipients.
Redis: Stores active session locations and presence status for low-latency lookups.
Cassandra: Persists chat history with optimized schema for chronological retrieval.
Kafka: Decouples message delivery from heavy async tasks like Push Notifications.
Simplicity Audit: This design avoids complex distributed locking by using TTLs for presence and relies on NoSQL horizontal scaling for storage.
Architecture Decision Rationale:
Why?: WebSockets are essential for bi-directional low latency. Cassandra is chosen because chat is write-heavy and naturally maps to partition keys (ChatID) and clustering keys (MessageID/Timestamp).
Functional Satisfaction: Meets all real-time and history requirements.
Non-functional Satisfaction: Scalable via sharding (Cassandra/Redis Cluster) and highly available via stateless service layers.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
DNS uses latency-based routing to the nearest regional Data Center.
L7 Load Balancer (e.g., Nginx/AWS ALB) handles SSL termination and distributes WebSocket connections based on least-connections.
Security & Perimeter:
API Gateway: Performs JWT validation before upgrading the connection to WebSocket.
Rate Limiting: Applied at the Gateway level to prevent heartbeat spam (e.g., max 1 heartbeat per 20s).
Service
Topology & Scaling:
WebSocket Gateway: Stateful in terms of TCP connections, but stateless regarding business logic. Scaled based on memory (max open files) and CPU.
Chat/Presence Services: Stateless microservices, scaled via HPA (Horizontal Pod Autoscaler) on CPU/Request count.
API Schema Design:
Send Message:
POST /v1/messages (Internal/RPC)Payload:
{senderId, receiverId, groupId, content, type, tempId}Get History:
GET /v1/messages/{conversationId}?limit=50&cursor={ts}Resilience & Reliability:
Retries: Clients use exponential backoff for WebSocket reconnections.
Ack Mechanism: Gateway waits for Chat Service to persist message in Cassandra before sending an "ACK" to the sender.
Storage
Access Pattern: High write (new messages) and high read (loading history). Presence is extreme write/read.
Database Table Design (Cassandra):
Table:
messages_by_conversationconversation_id (Partition Key)created_at (Clustering Key, Descending)message_id (TimeUUID)sender_id, contentTechnical Selection: Cassandra is chosen for its linear scalability and performance in time-series-like workloads.
Distribution Logic: Partitioning by
conversation_id ensures all messages for a single chat are on the same physical node/partition, making history fetches extremely fast.Cache
Purpose & Justification: Redis handles two critical bottlenecks:
Presence: Storing
user_id -> status (Online/Offline/LastSeen). Session Map: Storing
user_id -> gateway_node_id to route messages to the correct WebSocket server.Key-Value Schema:
presence:{user_id}: Value online, TTL 60s (updated by 30s heartbeats).session:{user_id}: Value gateway_ip.Failure Handling: If Redis fails, presence defaults to "Offline." Use Redis Cluster for high availability.
Messaging
Purpose & Decoupling: Kafka is used to ingest sent messages for downstream consumers.
Event Schema:
MessageCreatedEvent containing the full message payload.Throughput & Partitioning: Partitioned by
conversation_id to maintain ordering for any specific chat.Failure Handling: Dead-letter queue (DLQ) for failed push notification attempts.
Infrastructure (Optional)
Observability: Prometheus for metrics (active WebSocket count, message latency), ELK for logs, and Jaeger for tracing message flow across microservices.
Wrap Up
Advanced Topics
Trade-offs (PACELC): For Presence, we favor Availability over Consistency (PA/EL). It is okay if a user sees a friend "Online" for 30s after they disconnect. For Messages, we favor Consistency and Partition Tolerance (CP/EC) to ensure order.
Reliability: If the WebSocket Gateway node crashes, clients reconnect to a different node. Redis session mapping is updated upon reconnection.
Bottleneck Analysis: The biggest bottleneck is the Redis write volume for presence. Optimization: Implement "Presence Interest Groups." Only send presence updates to "friends" who are currently online and looking at their chat list, rather than broadcasting to all followers.
Security: Use TLS 1.3 for all traffic. Sanitize all message content on the server side to prevent XSS/Injection.