DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Real-time Chat Platform Design

Design a globally scalable real-time chat system similar to WhatsApp or Slack. The system must support 100 million daily active users, 1:1 and group messaging, and real-time presence indicators. Focus on handling persistent connections at scale, ensuring high-write throughput for message history, and providing low-latency delivery across diverse geographic regions while maintaining message ordering and reliability.
WebSockets
Redis
ScyllaDB
PostgreSQL
gRPC
JWT
APNs
FCM
NoSQL
Questions & Insights

Clarifying Questions

What is the scale of the system? (Assumption: 100M Daily Active Users (DAU), with each user sending an average of 50 messages per day, resulting in 5B messages/day).
What are the core communication types? (Assumption: 1:1 private chats and group chats with up to 500 members for the MVP).
What types of content must we support? (Assumption: Text, emojis, and small image metadata/links. Heavy video streaming is out of scope for MVP).
What are the delivery guarantees? (Assumption: At-least-once delivery with message sequencing/ordering within a conversation).
Is multi-device support required? (Assumption: Yes, users should see synchronized message history across multiple active sessions).

Thinking Process

Core Bottleneck 1: Real-time Bi-directional Communication. How do we maintain millions of persistent connections without exhausting server resources?
Core Bottleneck 2: Presence Management. How do we track "Online/Offline" status for 100M users with low latency and high update frequency?
Core Bottleneck 3: Message Persistence and Retrieval. How do we store 5B messages daily while ensuring sub-100ms read latency for chat history?
Core Bottleneck 4: Fan-out for Group Chats. How do we efficiently deliver a single message to 500 recipients without overloading the system?

Bonus Points

CRDTs (Conflict-Free Replicated Data Types): Implementing logic for seamless merging of chat states across multi-device sync or offline editing.
End-to-End Encryption (E2EE): Designing a Double Ratchet Algorithm-based key exchange (Signal Protocol) to ensure privacy even if the backend is compromised.
Push Notification Fallback: Intelligent integration with APNs/FCM using a dedicated notification service to wake up idle mobile clients.
Operational Cost Optimization: Using TTLs (Time-To-Live) on messages in hot storage and tiered archiving to S3 for old chat history.
Design Breakdown

Functional Requirements

Core Use Cases:
1:1 and Group messaging (text/emojis).
Real-time message delivery (Push).
Online/Offline presence status.
Message status (Sent, Delivered, Read receipts).
Chat history retrieval.
Scope Control:
In-scope: Core text messaging, group management, basic presence, multi-device sync.
Out-of-scope: Voice/Video calls, large file storage (CDN delivery), message search (Elasticsearch).

Non-Functional Requirements

Scale: Support 100M DAU and peak loads of 200k+ connections per second.
Latency: End-to-end message delivery under 200ms (P99).
Availability & Reliability: 99.99% availability; no message loss (Persistence is key).
Consistency: Strong ordering of messages within a single conversation; eventual consistency for presence.
Security: TLS in-transit; OAuth2/JWT for authentication; media URL signing.

Estimation

Traffic Estimation:
100M DAU * 50 messages = 5B messages/day.
Average QPS: 5B / 86,400s ≈ 60k QPS.
Peak QPS (3x): 180k QPS.
Storage Estimation:
5B messages * 100 bytes/message ≈ 500GB/day.
1 Year storage ≈ 180TB.
Bandwidth Estimation:
Ingress: 60k QPS * 100 bytes ≈ 6MB/s.
Egress (assuming 1:1 average): ~12MB/s (Fan-out increases this).

Blueprint

Concise Summary: A microservices-based architecture centered around a horizontally scalable WebSocket Gateway cluster for real-time delivery, paired with a NoSQL wide-column store for high-write message persistence.
Major Components:
WebSocket Gateway: Maintains persistent TCP connections and routes incoming/outgoing messages.
Message Service: Handles business logic, sequencing, and persistence of chat messages.
Presence Service: Tracks user connectivity status using a high-performance distributed cache.
Redis: Stores transient presence data and acts as a pub/sub for cross-gateway message routing.
NoSQL (ScyllaDB/Cassandra): Provides high-throughput storage for billion-scale message logs.
Simplicity Audit: This design avoids complex service meshes and heavy streaming frameworks (like Flink) in favor of a direct WebSocket + Redis Pub/Sub approach, which is sufficient for 100M DAU with text.
Architecture Decision Rationale:
Why this architecture?: WebSockets are the industry standard for low-latency bi-directional communication. NoSQL is chosen over SQL for message storage because chat logs are write-heavy and append-only, fitting the LSM-tree model perfectly.
Functional Satisfaction: Covers real-time delivery, persistence, and presence.
Non-functional Satisfaction: Scalable via gateway partitioning and NoSQL sharding.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
Global Load Balancing: Use Latency-based DNS (e.g., Route53) to route users to the nearest regional data center.
L7 Load Balancer: Terminates SSL/TLS and handles WebSocket upgrades (Upgrade: websocket header).
Security:
API Gateway: Standard JWT validation for REST requests.
WAF: Protects against Layer 7 DDoS attacks on the signaling endpoints.

Service

Topology & Scaling:
WebSocket Gateway: Statefully maintains connections. Scaled horizontally based on connection count (~50k-100k per instance).
Stateless Services: Message and Presence services scale based on CPU/Request count.
API Schema Design:
SendMessage: POST /v1/messages (Protocol: gRPC internal, WS external).
GetHistory: GET /v1/history/{convId}?limit=50&cursor={ts}.
Idempotency: Client generates a client_msg_id (UUID) to prevent duplicates on retries.
Resilience:
Heartbeats: Clients send ping/pong every 30s to the WS Gateway to keep connections alive and detect silent disconnects.

Storage

Access Pattern: 90% write (sending messages), 10% read (history fetch).
Database Table Design:
Messages Table (ScyllaDB):
Partition Key: conversation_id.
Clustering Key: message_timestamp (descending), message_id.
Fields: sender_id, content, type, metadata.
Conversations Table (PostgreSQL):
Fields: id, name, type (1:1/group), created_at, participant_list.
Technical Selection:
ScyllaDB: High-performance NoSQL for the heavy write load. It handles the linear growth of message logs better than RDBMS.
PostgreSQL: For relational metadata (user profiles, group memberships) where ACID is preferred for consistency.

Cache

Purpose & Justification: Presence tracking requires sub-millisecond updates and lookups.
Key-Value Schema:
Key: user_status:{user_id}.
Value: {"status": "online", "last_active": "timestamp", "gateway_id": "ws-001"}.
TTL: 60 seconds (requires periodic client heartbeats to renew).
Failure Handling: If Redis fails, the system defaults to "offline" status (Graceful degradation).

Messaging

Purpose & Decoupling: Redis Pub/Sub is used for cross-gateway message routing.
Mechanism:
When User A (on Gateway 1) sends a message to User B (on Gateway 2), Gateway 1 publishes the message to a Redis channel user:B.
Gateway 2, having subscribed to user:B upon User B's login, receives the message and pushes it via the open WebSocket.
Technical Selection: Redis Pub/Sub is chosen for its extremely low latency and simplicity for transient real-time routing.

Infrastructure (Optional)

Observability:
Metrics: Monitor "Active WebSocket Connections" and "Message End-to-End Latency."
Logging: Trace unique correlation_id from the sender's WS Gateway through the Message Service to the receiver's WS Gateway.
Wrap Up

Advanced Topics

Trade-offs:
Consistency vs. Availability (CAP): Presence is AP (Availability/Partition tolerance); it's okay if a user's "online" status is slightly delayed. Message persistence is CP to ensure no messages are lost.
Reliability:
Retry Mechanism: If a WS delivery fails, the Message Service sends a Push Notification (APNs/FCM) so the user receives it even if the app is in the background.
Bottleneck Analysis:
Hot Partitions: Large group chats (e.g., 500 members) can create write spikes. We mitigate this by using asynchronous fan-out via worker queues for group deliveries.
Security:
mTLS: For service-to-service communication within the VPC.
Sanitization: All incoming message content is sanitized on the gateway to prevent XSS/Injection.