The Question

Real-time Chat Platform Design

Design a globally scalable real-time chat system similar to WhatsApp or Slack. The system must support 100 million daily active users, 1:1 and group messaging, and real-time presence indicators. Focus on handling persistent connections at scale, ensuring high-write throughput for message history, and providing low-latency delivery across diverse geographic regions while maintaining message ordering and reliability.

WebSockets

Redis

ScyllaDB

PostgreSQL

gRPC

JWT

APNs

FCM

NoSQL

Questions & Insights

Clarifying Questions

What is the scale of the system? (Assumption: 100M Daily Active Users (DAU), with each user sending an average of 50 messages per day, resulting in 5B messages/day).

What are the core communication types? (Assumption: 1:1 private chats and group chats with up to 500 members for the MVP).

What types of content must we support? (Assumption: Text, emojis, and small image metadata/links. Heavy video streaming is out of scope for MVP).

What are the delivery guarantees? (Assumption: At-least-once delivery with message sequencing/ordering within a conversation).

Is multi-device support required? (Assumption: Yes, users should see synchronized message history across multiple active sessions).

Thinking Process

Core Bottleneck 1: Real-time Bi-directional Communication. How do we maintain millions of persistent connections without exhausting server resources?

Core Bottleneck 2: Presence Management. How do we track "Online/Offline" status for 100M users with low latency and high update frequency?

Core Bottleneck 3: Message Persistence and Retrieval. How do we store 5B messages daily while ensuring sub-100ms read latency for chat history?

Core Bottleneck 4: Fan-out for Group Chats. How do we efficiently deliver a single message to 500 recipients without overloading the system?

Bonus Points

CRDTs (Conflict-Free Replicated Data Types): Implementing logic for seamless merging of chat states across multi-device sync or offline editing.

End-to-End Encryption (E2EE): Designing a Double Ratchet Algorithm-based key exchange (Signal Protocol) to ensure privacy even if the backend is compromised.

Push Notification Fallback: Intelligent integration with APNs/FCM using a dedicated notification service to wake up idle mobile clients.

Operational Cost Optimization: Using TTLs (Time-To-Live) on messages in hot storage and tiered archiving to S3 for old chat history.

Design Breakdown

Functional Requirements

Core Use Cases:

1:1 and Group messaging (text/emojis).

Real-time message delivery (Push).

Online/Offline presence status.

Message status (Sent, Delivered, Read receipts).

Chat history retrieval.

Scope Control:

In-scope: Core text messaging, group management, basic presence, multi-device sync.

Out-of-scope: Voice/Video calls, large file storage (CDN delivery), message search (Elasticsearch).

Non-Functional Requirements

Scale: Support 100M DAU and peak loads of 200k+ connections per second.

Latency: End-to-end message delivery under 200ms (P99).

Availability & Reliability: 99.99% availability; no message loss (Persistence is key).

Consistency: Strong ordering of messages within a single conversation; eventual consistency for presence.

Security: TLS in-transit; OAuth2/JWT for authentication; media URL signing.

Estimation

Traffic Estimation:

100M DAU * 50 messages = 5B messages/day.

Average QPS: 5B / 86,400s ≈ 60k QPS.

Peak QPS (3x): 180k QPS.

Storage Estimation:

5B messages * 100 bytes/message ≈ 500GB/day.

1 Year storage ≈ 180TB.

Bandwidth Estimation:

Ingress: 60k QPS * 100 bytes ≈ 6MB/s.

Egress (assuming 1:1 average): ~12MB/s (Fan-out increases this).

Blueprint

Concise Summary: A microservices-based architecture centered around a horizontally scalable WebSocket Gateway cluster for real-time delivery, paired with a NoSQL wide-column store for high-write message persistence.

Major Components:

WebSocket Gateway: Maintains persistent TCP connections and routes incoming/outgoing messages.

Message Service: Handles business logic, sequencing, and persistence of chat messages.

Presence Service: Tracks user connectivity status using a high-performance distributed cache.

Redis: Stores transient presence data and acts as a pub/sub for cross-gateway message routing.

NoSQL (ScyllaDB/Cassandra): Provides high-throughput storage for billion-scale message logs.

Simplicity Audit: This design avoids complex service meshes and heavy streaming frameworks (like Flink) in favor of a direct WebSocket + Redis Pub/Sub approach, which is sufficient for 100M DAU with text.

Architecture Decision Rationale:

Why this architecture?: WebSockets are the industry standard for low-latency bi-directional communication. NoSQL is chosen over SQL for message storage because chat logs are write-heavy and append-only, fitting the LSM-tree model perfectly.

Functional Satisfaction: Covers real-time delivery, persistence, and presence.

Non-functional Satisfaction: Scalable via gateway partitioning and NoSQL sharding.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

Global Load Balancing: Use Latency-based DNS (e.g., Route53) to route users to the nearest regional data center.

L7 Load Balancer: Terminates SSL/TLS and handles WebSocket upgrades (Upgrade: websocket header).

Security:

API Gateway: Standard JWT validation for REST requests.

WAF: Protects against Layer 7 DDoS attacks on the signaling endpoints.

Service

Topology & Scaling:

WebSocket Gateway: Statefully maintains connections. Scaled horizontally based on connection count (~50k-100k per instance).

Stateless Services: Message and Presence services scale based on CPU/Request count.

API Schema Design:

SendMessage: POST /v1/messages (Protocol: gRPC internal, WS external).

GetHistory: GET /v1/history/{convId}?limit=50&cursor={ts}.

Idempotency: Client generates a client_msg_id (UUID) to prevent duplicates on retries.

Resilience:

Heartbeats: Clients send ping/pong every 30s to the WS Gateway to keep connections alive and detect silent disconnects.

Storage

Access Pattern: 90% write (sending messages), 10% read (history fetch).

Database Table Design:

Messages Table (ScyllaDB):

Partition Key: conversation_id.

Clustering Key: message_timestamp (descending), message_id.

Fields: sender_id, content, type, metadata.

Conversations Table (PostgreSQL):

Fields: id, name, type (1:1/group), created_at, participant_list.

Technical Selection:

ScyllaDB: High-performance NoSQL for the heavy write load. It handles the linear growth of message logs better than RDBMS.

PostgreSQL: For relational metadata (user profiles, group memberships) where ACID is preferred for consistency.

Cache

Purpose & Justification: Presence tracking requires sub-millisecond updates and lookups.

Key-Value Schema:

Key: user_status:{user_id}.

Value: {"status": "online", "last_active": "timestamp", "gateway_id": "ws-001"}.

TTL: 60 seconds (requires periodic client heartbeats to renew).

Failure Handling: If Redis fails, the system defaults to "offline" status (Graceful degradation).

Messaging

Purpose & Decoupling: Redis Pub/Sub is used for cross-gateway message routing.

Mechanism:

When User A (on Gateway 1) sends a message to User B (on Gateway 2), Gateway 1 publishes the message to a Redis channel user:B.

Gateway 2, having subscribed to user:B upon User B's login, receives the message and pushes it via the open WebSocket.

Technical Selection: Redis Pub/Sub is chosen for its extremely low latency and simplicity for transient real-time routing.

Infrastructure (Optional)

Observability:

Metrics: Monitor "Active WebSocket Connections" and "Message End-to-End Latency."

Logging: Trace unique correlation_id from the sender's WS Gateway through the Message Service to the receiver's WS Gateway.

Wrap Up

Advanced Topics

Trade-offs:

Consistency vs. Availability (CAP): Presence is AP (Availability/Partition tolerance); it's okay if a user's "online" status is slightly delayed. Message persistence is CP to ensure no messages are lost.

Reliability:

Retry Mechanism: If a WS delivery fails, the Message Service sends a Push Notification (APNs/FCM) so the user receives it even if the app is in the background.

Bottleneck Analysis:

Hot Partitions: Large group chats (e.g., 500 members) can create write spikes. We mitigate this by using asynchronous fan-out via worker queues for group deliveries.

Security:

mTLS: For service-to-service communication within the VPC.

Sanitization: All incoming message content is sanitized on the gateway to prevent XSS/Injection.