The Question
DesignReal-time Chat Platform Design
Design a globally scalable real-time chat system similar to WhatsApp or Slack. The system must support 100 million daily active users, 1:1 and group messaging, and real-time presence indicators. Focus on handling persistent connections at scale, ensuring high-write throughput for message history, and providing low-latency delivery across diverse geographic regions while maintaining message ordering and reliability.
WebSockets
Redis
ScyllaDB
PostgreSQL
gRPC
JWT
APNs
FCM
NoSQL
Questions & Insights
Clarifying Questions
What is the scale of the system? (Assumption: 100M Daily Active Users (DAU), with each user sending an average of 50 messages per day, resulting in 5B messages/day).
What are the core communication types? (Assumption: 1:1 private chats and group chats with up to 500 members for the MVP).
What types of content must we support? (Assumption: Text, emojis, and small image metadata/links. Heavy video streaming is out of scope for MVP).
What are the delivery guarantees? (Assumption: At-least-once delivery with message sequencing/ordering within a conversation).
Is multi-device support required? (Assumption: Yes, users should see synchronized message history across multiple active sessions).
Thinking Process
Core Bottleneck 1: Real-time Bi-directional Communication. How do we maintain millions of persistent connections without exhausting server resources?
Core Bottleneck 2: Presence Management. How do we track "Online/Offline" status for 100M users with low latency and high update frequency?
Core Bottleneck 3: Message Persistence and Retrieval. How do we store 5B messages daily while ensuring sub-100ms read latency for chat history?
Core Bottleneck 4: Fan-out for Group Chats. How do we efficiently deliver a single message to 500 recipients without overloading the system?
Bonus Points
CRDTs (Conflict-Free Replicated Data Types): Implementing logic for seamless merging of chat states across multi-device sync or offline editing.
End-to-End Encryption (E2EE): Designing a Double Ratchet Algorithm-based key exchange (Signal Protocol) to ensure privacy even if the backend is compromised.
Push Notification Fallback: Intelligent integration with APNs/FCM using a dedicated notification service to wake up idle mobile clients.
Operational Cost Optimization: Using TTLs (Time-To-Live) on messages in hot storage and tiered archiving to S3 for old chat history.
Design Breakdown
Functional Requirements
Core Use Cases:
1:1 and Group messaging (text/emojis).
Real-time message delivery (Push).
Online/Offline presence status.
Message status (Sent, Delivered, Read receipts).
Chat history retrieval.
Scope Control:
In-scope: Core text messaging, group management, basic presence, multi-device sync.
Out-of-scope: Voice/Video calls, large file storage (CDN delivery), message search (Elasticsearch).
Non-Functional Requirements
Scale: Support 100M DAU and peak loads of 200k+ connections per second.
Latency: End-to-end message delivery under 200ms (P99).
Availability & Reliability: 99.99% availability; no message loss (Persistence is key).
Consistency: Strong ordering of messages within a single conversation; eventual consistency for presence.
Security: TLS in-transit; OAuth2/JWT for authentication; media URL signing.
Estimation
Traffic Estimation:
100M DAU * 50 messages = 5B messages/day.
Average QPS: 5B / 86,400s ≈ 60k QPS.
Peak QPS (3x): 180k QPS.
Storage Estimation:
5B messages * 100 bytes/message ≈ 500GB/day.
1 Year storage ≈ 180TB.
Bandwidth Estimation:
Ingress: 60k QPS * 100 bytes ≈ 6MB/s.
Egress (assuming 1:1 average): ~12MB/s (Fan-out increases this).
Blueprint
Concise Summary: A microservices-based architecture centered around a horizontally scalable WebSocket Gateway cluster for real-time delivery, paired with a NoSQL wide-column store for high-write message persistence.
Major Components:
WebSocket Gateway: Maintains persistent TCP connections and routes incoming/outgoing messages.
Message Service: Handles business logic, sequencing, and persistence of chat messages.
Presence Service: Tracks user connectivity status using a high-performance distributed cache.
Redis: Stores transient presence data and acts as a pub/sub for cross-gateway message routing.
NoSQL (ScyllaDB/Cassandra): Provides high-throughput storage for billion-scale message logs.
Simplicity Audit: This design avoids complex service meshes and heavy streaming frameworks (like Flink) in favor of a direct WebSocket + Redis Pub/Sub approach, which is sufficient for 100M DAU with text.
Architecture Decision Rationale:
Why this architecture?: WebSockets are the industry standard for low-latency bi-directional communication. NoSQL is chosen over SQL for message storage because chat logs are write-heavy and append-only, fitting the LSM-tree model perfectly.
Functional Satisfaction: Covers real-time delivery, persistence, and presence.
Non-functional Satisfaction: Scalable via gateway partitioning and NoSQL sharding.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
Global Load Balancing: Use Latency-based DNS (e.g., Route53) to route users to the nearest regional data center.
L7 Load Balancer: Terminates SSL/TLS and handles WebSocket upgrades (Upgrade: websocket header).
Security:
API Gateway: Standard JWT validation for REST requests.
WAF: Protects against Layer 7 DDoS attacks on the signaling endpoints.
Service
Topology & Scaling:
WebSocket Gateway: Statefully maintains connections. Scaled horizontally based on connection count (~50k-100k per instance).
Stateless Services: Message and Presence services scale based on CPU/Request count.
API Schema Design:
SendMessage:
POST /v1/messages (Protocol: gRPC internal, WS external).GetHistory:
GET /v1/history/{convId}?limit=50&cursor={ts}.Idempotency: Client generates a
client_msg_id (UUID) to prevent duplicates on retries.Resilience:
Heartbeats: Clients send ping/pong every 30s to the WS Gateway to keep connections alive and detect silent disconnects.
Storage
Access Pattern: 90% write (sending messages), 10% read (history fetch).
Database Table Design:
Messages Table (ScyllaDB):
Partition Key:
conversation_id.Clustering Key:
message_timestamp (descending), message_id.Fields:
sender_id, content, type, metadata.Conversations Table (PostgreSQL):
Fields:
id, name, type (1:1/group), created_at, participant_list.Technical Selection:
ScyllaDB: High-performance NoSQL for the heavy write load. It handles the linear growth of message logs better than RDBMS.
PostgreSQL: For relational metadata (user profiles, group memberships) where ACID is preferred for consistency.
Cache
Purpose & Justification: Presence tracking requires sub-millisecond updates and lookups.
Key-Value Schema:
Key:
user_status:{user_id}.Value:
{"status": "online", "last_active": "timestamp", "gateway_id": "ws-001"}.TTL: 60 seconds (requires periodic client heartbeats to renew).
Failure Handling: If Redis fails, the system defaults to "offline" status (Graceful degradation).
Messaging
Purpose & Decoupling: Redis Pub/Sub is used for cross-gateway message routing.
Mechanism:
When User A (on Gateway 1) sends a message to User B (on Gateway 2), Gateway 1 publishes the message to a Redis channel
user:B.Gateway 2, having subscribed to
user:B upon User B's login, receives the message and pushes it via the open WebSocket.Technical Selection: Redis Pub/Sub is chosen for its extremely low latency and simplicity for transient real-time routing.
Infrastructure (Optional)
Observability:
Metrics: Monitor "Active WebSocket Connections" and "Message End-to-End Latency."
Logging: Trace unique
correlation_id from the sender's WS Gateway through the Message Service to the receiver's WS Gateway.Wrap Up
Advanced Topics
Trade-offs:
Consistency vs. Availability (CAP): Presence is AP (Availability/Partition tolerance); it's okay if a user's "online" status is slightly delayed. Message persistence is CP to ensure no messages are lost.
Reliability:
Retry Mechanism: If a WS delivery fails, the Message Service sends a Push Notification (APNs/FCM) so the user receives it even if the app is in the background.
Bottleneck Analysis:
Hot Partitions: Large group chats (e.g., 500 members) can create write spikes. We mitigate this by using asynchronous fan-out via worker queues for group deliveries.
Security:
mTLS: For service-to-service communication within the VPC.
Sanitization: All incoming message content is sanitized on the gateway to prevent XSS/Injection.