The Question

Real-time Chat Platform Design

Design a real-time chat application capable of supporting 10 million daily active users and 1 million concurrent connections. The system must support instant 1-on-1 messaging, group chats with up to 500 members, and real-time user presence (online/offline status). Ensure the design addresses message persistence, low-latency delivery, and synchronization across multiple devices, while discussing the trade-offs between consistency and availability in a distributed environment.

WebSockets

Cassandra

Redis

Kafka

Protobuf

Snowflake ID

JWT

Questions & Insights

Clarifying Questions

What is the scale of the system (DAU, Peak Concurrent Users)?

Assumption: 10 million DAU, 1 million Peak Concurrent Users (PCU).

What is the maximum group size for chat rooms?

Assumption: Up to 500 members for the MVP to avoid massive fan-out complexity.

Do we need to support message history and synchronization across multiple devices?

Assumption: Yes, users expect to see their chat history when logging in from a new device.

Is "Presence" (Online/Offline status) a requirement for the MVP?

Assumption: Yes, it is a core expectation for real-time chat platforms.

What types of messages are supported?

Assumption: Text and small metadata for images (actual image storage handled via S3/CDN URLs).

Thinking Process

To design an end-to-end real-time chat system, we must solve for connection persistence, message routing, and reliable storage.

How do we maintain millions of persistent connections? We use the WebSocket protocol for bi-directional communication, managed by a dedicated fleet of stateful "Connection Servers."

How does a message from User A reach User B who is connected to a different server? We implement a "Message Router" and a "Presence Service" backed by a Pub/Sub mechanism to locate and forward messages to the correct connection node.

How do we ensure messages are never lost and stay in order? We use a distributed NoSQL database (like Cassandra) with a unique, monotonically increasing ID (Snowflake) to maintain chronological timelines.

How do we handle group chat efficiency? For the MVP, we use a fan-out-on-write approach for small groups, pushing messages to a message queue for delivery to all active members.

Bonus Points

Causality and Ordering: Using Vector Clocks or Logical Sequence Numbers (LSN) to handle message interleaving in distributed environments where physical clocks drift.

Last-Write-Wins (LWW) resolution: Applying LWW in Cassandra for eventual consistency in message status updates (e.g., read receipts).

Binary Protocols: Implementing Protocol Buffers (Protobuf) over WebSockets instead of JSON to reduce packet size and serialization latency on mobile devices.

Multi-Region Pinning: Pinning users to the geographically closest data center and using a global backbone (e.g., AWS Global Accelerator) to minimize RTT.

Design Breakdown

Functional Requirements

Core Use Cases:

One-to-one real-time messaging.

Group chat messaging (up to 500 members).

Online/Offline presence status.

Message delivery and read receipts.

Persistent message history.

Scope Control:

In-scope: Text messaging, delivery status, presence.

Out-of-scope: End-to-end encryption (E2EE) key management, video/voice calls, message search, stickers/GIF library.

Non-Functional Requirements

Scale: Support 1M concurrent WebSocket connections.

Latency: End-to-end message delivery under 200ms (P99).

Availability & Reliability: 99.99% availability; zero message loss.

Consistency: Eventual consistency for presence; strong ordering for message history within a single conversation.

Fault Tolerance: Automatic reconnection and session resumption for mobile clients on flaky networks.

Security: TLS for all connections; OAuth2 for authentication.

Estimation

Traffic Estimation:

10M DAU x 50 messages/day = 500M messages/day.

Average QPS: ~5,800 msgs/sec.

Peak QPS: ~12,000 msgs/sec.

Storage Estimation:

Average message size: 200 bytes.

Daily storage: 500M * 200B = 100 GB.

5-year storage: 100 GB 365 5 ≈ 182 TB (including replicas).

Bandwidth Estimation:

Outgoing: 12,000 msgs/sec * 200B = 2.4 MB/s (easily handled by standard instances).

Blueprint

Concise Summary: The system utilizes a fleet of WebSocket Gateways to maintain persistent connections, a stateless Chat Service to handle business logic, and Cassandra for high-write message persistence.

Major Components:

WebSocket Gateway: Stateful fleet managing persistent bi-directional connections and session mapping.

Presence Service: Redis-backed service tracking user heartbeats and online status.

Chat Service: Stateless service for message validation, ID generation, and routing.

Message Store: Distributed wide-column store (Cassandra) for storing conversation timelines.

Pub/Sub (Redis/Kafka): Bridges messages between different WebSocket Gateway nodes.

Simplicity Audit: This design avoids complex distributed locking by relying on Cassandra's LWW and local connection pinning, keeping the MVP highly performant.

Architecture Decision Rationale:

Why this architecture?: WebSockets provide the lowest latency for bi-directional traffic compared to long-polling. NoSQL (Cassandra) is chosen for its linear scalability and high-write throughput which suits chat logs.

Functional Requirement Satisfaction: Handles 1:1 and Group chats via message routing; Presence is handled by a fast K/V store.

Non-functional Requirement Satisfaction: Scalability is achieved by adding stateless Chat Services and horizontal sharding in Cassandra.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing

Use Anycast DNS for routing users to the nearest regional PoP.

L7 Load Balancer (ALB) handles SSL/TLS termination and distributes traffic to WebSocket Gateways using sticky sessions or consistent hashing based on UserID.

Security & Perimeter

API Gateway: Performs JWT validation and basic rate limiting (e.g., 20 requests/sec per user) to prevent spam.

Service

Topology & Scaling

WebSocket Gateway (Stateful): Horizontally scaled based on memory/connection count. Since connections are stateful, we use a Session Registry (Redis) to map UserID -> ServerID.

Chat Service (Stateless): Scales based on CPU/Request count.

API Schema Design

POST /v1/messages: Sends a message (REST/JSON for initial send, or Protobuf over WS).

GET /v1/history/{convId}: Paginated history retrieval using a cursor (timestamp + ID).

Resilience & Reliability

Exponential Backoff: Clients use jittered retries to reconnect if a WebSocket Gateway fails.

Sequence Numbers: Every message is assigned a monotonically increasing ID by the service to ensure clients can detect gaps.

Storage

Access Pattern

High write-to-read ratio (every message is a write; reads happen during catch-up or scroll-back).

Queries: SELECT * FROM messages WHERE conversation_id = ? AND message_id < ? ORDER BY message_id DESC LIMIT 50.

Database Table Design

Table: messages

conversation_id: UUID (Partition Key)

message_id: TimeUUID/Snowflake (Clustering Key - Descending)

sender_id: UUID

content: Text

created_at: Timestamp

Technical Selection

Cassandra: Chosen for its wide-column model which allows efficient storage of time-series message data and easy horizontal scaling without downtime.

Distribution Logic

Partitioned by conversation_id. This ensures all messages for a single chat (1:1 or small group) are co-located on the same shard for fast reads.

Cache

Purpose & Justification: Presence tracking requires extremely low latency and high frequency updates (heartbeats every 30s).

Key-Value Schema:

user:presence:{userId} -> status: "online", last_active: timestamp.

user:session:{userId} -> gateway_node_id.

Technical Selection: Redis. Support for TTL (Time-To-Live) allows presence to naturally expire to "offline" if a heartbeat is missed.

Messaging

Purpose & Decoupling: Acts as the "glue" between WebSocket nodes. If User A is on Gateway 1 and User B is on Gateway 2, the message travels: G1 -> Kafka -> G2 -> User B.

Event / Topic Schema: Topic chat_messages partitioned by receiver_id to ensure ordered delivery to the same consumer.

Technical Selection: Kafka. High throughput and the ability to replay messages if a WebSocket node crashes.

Data Processing

Processing Model: Stream processing for push notifications.

Processing DAG: Kafka Source -> Push Filter (if recipient offline) -> Notification Formatter -> SQS Push Queue.

Technical Selection: Custom Go/Java Workers. For an MVP, complex frameworks like Spark are overkill; lightweight consumers are sufficient.

Wrap Up

Advanced Topics

Trade-offs (PACELC): In the event of a network partition, we favor Availability (AP). A chat system is better off allowing users to send messages that might arrive slightly out of order than preventing them from sending messages at all.

Reliability & Failure Handling:

Zombie Connections: WebSocket Gateways use "Keep-Alive" pings. If the client doesn't pong, the connection is closed and presence is updated.

Message Delivery Guarantee: "At-least-once" delivery is achieved through application-level ACKs. If the client doesn't ACK a message, the server retries.

Bottleneck Analysis:

Hot Partitions: Large group chats (e.g., 100k members) would create hot partitions in Cassandra and Kafka. The MVP limits groups to 500 to mitigate this. For larger groups, we would implement "sharded partitions."

Distinguishing Insights:

Pull vs Push: While messages are pushed via WebSockets, the "catch-up" on app launch should be a Pull mechanism. This prevents a "thundering herd" of pushes when a user reconnects after being offline for hours.