The Question
Design

Scalable Real-time Chat Architecture

Design a highly available and scalable real-time messaging platform capable of supporting millions of concurrent users. The system should facilitate instantaneous 1:1 communication, persistent message storage for historical retrieval, and real-time user presence tracking, while ensuring low-latency delivery and fault tolerance.
WebSocket
Redis
Cassandra
Kafka
JWT
FCM
LSM-Tree
Questions & Insights

Clarifying Questions

What is the expected scale (DAU and Concurrent Users)?
Assumption: 10M DAU, 1M concurrent connections (PCU).
What are the core features for the MVP?
Assumption: 1:1 private messaging, persistent chat history, and real-time presence (online/offline status). Group chat and media attachments are secondary.
What are the latency requirements for message delivery?
Assumption: End-to-end latency should be < 200ms for the 99th percentile.
Do we need to support message delivery guarantees?
Assumption: "At-least-once" delivery is required. We will handle duplicates at the client level using message IDs.
What is the data retention policy?
Assumption: Chat history is stored indefinitely for the MVP.

Thinking Process

Core Bottleneck: Persistent Connections. Handling 1M+ concurrent WebSockets requires a distributed Gateway layer that maps users to specific server nodes.
Message Routing Strategy. How does Node A (User 1) find Node B (User 2)? We need a "Presence/Session Store" to track which Gateway node holds which user's connection.
Storage Choice. Chat messages are write-heavy and grow linearly. A wide-column NoSQL database (Cassandra/DynamoDB) is superior to RDBMS for horizontal scaling and sequential write performance.
Decoupling non-critical paths. Use a Message Queue (Kafka) to handle side effects like Push Notifications and Analytics without blocking the core message delivery path.

Bonus Points

Connection Draining: Implementation of "GoAway" frames or gradual connection rebalancing to prevent "thundering herd" issues during Gateway deployments.
Last-Write-Wins (LWW) vs. Vector Clocks: Discussion on handling message ordering in a distributed environment where clocks may drift.
Operational Excellence: Using "Consistent Hashing" for the session store to minimize remapping during cluster resizing.
Push Notification Fallback: Logic to detect if a WebSocket is dead and immediately trigger a mobile push notification (APNs/FCM).
Design Breakdown

Functional Requirements

Users can send and receive 1:1 text messages in real-time.
Users can see the online/offline status of their contacts.
Users can fetch historical messages (pagination).
Messages must be delivered even if the recipient is offline (stored for later).

Non-Functional Requirements

Low Latency: Real-time feel is critical.
High Availability: The system must remain operational even if a data center zone fails.
Scalability: Must handle spikes in traffic (e.g., during major events).
Reliability: No message loss once the server acknowledges receipt.

Estimation

Connections: 1M concurrent users. If one server handles 50k connections, we need ~20 Gateway nodes.
Storage: 10M DAU * 20 messages/day = 200M messages/day.
Average Message Size: 200 bytes.
Daily Storage: 200M * 200B = ~40 GB/day.
Yearly Storage: ~14.6 TB. Cassandra handles this easily with sharding.
Bandwidth: 200M msgs / 86400s \approx 2,300 msgs/sec (average). Peak could be 10x (23k msgs/sec).

Blueprint

Concise Summary: A WebSocket-based architecture using a distributed Gateway layer for real-time bi-directional communication, backed by a NoSQL store for persistence and Redis for session tracking.
Major Components:
WebSocket Gateway: Maintains persistent TCP connections with clients to enable low-latency pushes.
Presence/Session Service: A Redis-backed service that tracks which user is connected to which Gateway node.
Chat Service: Orchestrates message validation, persistence, and routing logic.
Cassandra: Stores the actual chat messages, partitioned by conversation ID.
Kafka: Buffers messages for asynchronous tasks like push notifications for offline users.
Simplicity Audit: This design avoids complex service meshes or global locking, focusing on stateless application logic and a battle-tested NoSQL storage layer.
Architecture Decision Rationale:
Why this architecture?: WebSockets are the industry standard for real-time. Cassandra scales writes linearly, which is the primary bottleneck in chat.
Functional Satisfaction: Covers 1:1, presence, and history.
Non-functional Satisfaction: Scalable via horizontal addition of Gateway/Service nodes; low latency via persistent connections.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
WebSocket Gateways: Stateful but horizontally scalable. Scaling is based on connection count and memory usage.
Chat/Presence Services: Stateless, deployed in Multi-AZ clusters. Scaling based on CPU and Request-Per-Second (RPS).
API Schema Design:
POST /v1/messages: REST (for message sending if not using WS for upstream) or WS Frame.
GET /v1/history/{convId}: REST, returns paginated messages.
Idempotency: Client generates a client_msg_id (UUID) to prevent duplicates on retries.
Resilience & Reliability:
Retries: Exponential backoff for client reconnections.
Heartbeats: Clients send ping/pong every 30s to keep WS connections alive and update presence.
Security:
JWT-based Auth passed during WS handshake.
TLS 1.3 for all data in transit.

Storage

Access Pattern:
Heavy writes (new messages).
High volume of "read recent" (fetching history).
Database Table Design (Cassandra):
Table: messages
conversation_id (Partition Key): Groups messages of a chat.
created_at (Clustering Key, Descending): Allows fast retrieval of latest messages.
message_id, sender_id, content.
Technical Selection: Cassandra.
Rationale: Optimized for LSM-tree based writes. No global locks. Easy multi-region replication.
Distribution Logic:
Sharding by conversation_id. For very large groups (future-proofing), use conversation_id + bucket_id.

Cache

Purpose & Justification: Presence tracking and Session mapping. We need to know: Is User A online? and If yes, which Gateway node is User A on?
Key-Value Schema:
user_session:{user_id} -> {gateway_id, last_active_timestamp}.
Technical Selection: Redis.
Rationale: Sub-millisecond latency for heartbeat updates. Supports TTL for auto-expiring offline users.

Messaging

Purpose & Decoupling: Offloading non-critical tasks from the main message delivery path.
Event / Topic Schema: message_events topic containing sender_id, receiver_id, preview_text.
Failure Handling: Dead-letter queues (DLQ) for failed push notification attempts to FCM/APNs.
Technical Selection: Kafka.
Rationale: High throughput, allows multiple consumers (Push Service, Analytics, Search Indexer) to process the same message stream.
Wrap Up

Advanced Topics

Monitoring:
Metrics: Number of active WS connections, message delivery latency, Redis memory usage, Kafka lag.
Trade-offs:
Consistency vs. Availability: We choose Availability (AP in CAP). In rare network partitions, a user might see messages slightly out of order, which is acceptable in chat vs. losing messages.
Bottlenecks: The Redis Presence store could become a hotspot. Optimization: Use Redis Cluster with consistent hashing.
Failure Handling: If a Gateway node dies, clients detect the socket closure and reconnect to another node via the Load Balancer.