The Question

Real-Time Chat & Presence System

Design a globally scalable real-time chat system supporting 100 million daily active users. The system must handle 1:1 and group conversations, provide persistent message history, and maintain a real-time 'online/offline' presence status. Address the challenges of maintaining millions of concurrent connections, minimizing message delivery latency, and efficiently managing the high-frequency write load associated with presence heartbeats.

WebSocket

Redis

Cassandra

Kafka

gRPC

NoSQL

Pub/Sub

Questions & Insights

Clarifying Questions

Scale: What is the expected scale (DAU, peak concurrent connections)?

Features: Do we need to support 1:1 chat only, or are group chats and media (images/video) in scope?

Online Status (Presence): Should the status be real-time, or is a slight delay (e.g., 30s) acceptable? Does "Online" mean app-open, or active-typing?

Retention: Should message history be stored forever or for a limited period?

Assumptions for MVP: 100M DAU, 1:1 and small group chats (up to 100 people), 500-byte average message size, real-time presence via heartbeats (30s interval), messages stored indefinitely.

Thinking Process

Core Bottleneck: How do we maintain millions of persistent WebSocket connections while tracking presence status across multiple server nodes?

Data Flow: How is a message routed from User A to User B, ensuring low latency and reliable delivery?

Presence Scaling: How do we handle the massive write volume of "heartbeats" (3.3M QPS if 100M users heartbeat every 30s)?

Architecture Flow: Start with a WebSocket Gateway for real-time duplex communication, use a Distributed Cache (Redis) for volatile presence state, and a NoSQL Database (Cassandra) for high-write message storage.

Bonus Points

Last Seen Optimization: Instead of updating "Last Seen" every 30s for every user, only update it on specific events (message sent, app closed) and use a coarser-grained timestamp to reduce DB write pressure.

WebSocket Session Affinity: Use a "Session Service" or consistent hashing to quickly locate which gateway node User B is connected to, avoiding broadcast storms.

Causal Ordering: Discuss using Lamport Timestamps or Client-side sequence numbers to ensure messages appear in the correct order despite network jitter.

Fan-out Strategy: For group chats, differentiate between "push-on-write" for small groups and "pull-on-read" for massive channels (broadcasts).

Design Breakdown

Functional Requirements

Core Use Cases:

1:1 and Group messaging.

Real-time delivery via WebSockets.

Online/Offline status and "Last Seen" timestamps.

Message history retrieval.

Scope Control:

In-scope: Text messaging, presence, delivery/read receipts.

Out-of-scope: End-to-end encryption (E2EE) key exchange, voice/video calls, file storage (CDN-side), and message search.

Non-Functional Requirements

Scale: Support 100M DAU and 10M+ concurrent connections.

Latency: Sub-200ms end-to-end message delivery.

Availability & Reliability: 99.99% availability; messages must not be lost once acknowledged by the server.

Consistency: Strong ordering for messages within a single conversation; eventual consistency for presence status.

Security: TLS for all connections; authentication via JWT.

Estimation

Traffic Estimation:

100M DAU, each sending 10 messages/day = 1B messages/day.

Average Message QPS:

1,000,000,000 / 86,400 \approx 11.5k

QPS.

Peak Message QPS (3x):

\approx 35k

QPS.

Presence Heartbeats: 100M users / 30s heartbeat = 3.3M QPS (High write load).

Storage Estimation:

1B messages/day * 500 bytes = 500 GB/day.

Yearly storage:

\approx 180

TB.

Bandwidth Estimation:

Ingress: 35k pk QPS * 500 bytes

\approx 17.5

MB/s.

Blueprint

Concise Summary: A WebSocket-based real-time architecture utilizing a distributed session store (Redis) for presence and a wide-column store (Cassandra) for message persistence.

Major Components:

WebSocket Gateway: Manages long-lived TCP connections and routes incoming/outgoing messages.

Presence Service: Tracks online status via heartbeat TTLs in Redis.

Chat Service: Handles message logic, persistence, and fan-out to recipients.

Redis: Stores active session locations and presence status for low-latency lookups.

Cassandra: Persists chat history with optimized schema for chronological retrieval.

Kafka: Decouples message delivery from heavy async tasks like Push Notifications.

Simplicity Audit: This design avoids complex distributed locking by using TTLs for presence and relies on NoSQL horizontal scaling for storage.

Architecture Decision Rationale:

Why?: WebSockets are essential for bi-directional low latency. Cassandra is chosen because chat is write-heavy and naturally maps to partition keys (ChatID) and clustering keys (MessageID/Timestamp).

Functional Satisfaction: Meets all real-time and history requirements.

Non-functional Satisfaction: Scalable via sharding (Cassandra/Redis Cluster) and highly available via stateless service layers.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

DNS uses latency-based routing to the nearest regional Data Center.

L7 Load Balancer (e.g., Nginx/AWS ALB) handles SSL termination and distributes WebSocket connections based on least-connections.

Security & Perimeter:

API Gateway: Performs JWT validation before upgrading the connection to WebSocket.

Rate Limiting: Applied at the Gateway level to prevent heartbeat spam (e.g., max 1 heartbeat per 20s).

Service

Topology & Scaling:

WebSocket Gateway: Stateful in terms of TCP connections, but stateless regarding business logic. Scaled based on memory (max open files) and CPU.

Chat/Presence Services: Stateless microservices, scaled via HPA (Horizontal Pod Autoscaler) on CPU/Request count.

API Schema Design:

Send Message: POST /v1/messages (Internal/RPC)

Payload: {senderId, receiverId, groupId, content, type, tempId}

Get History: GET /v1/messages/{conversationId}?limit=50&cursor={ts}

Resilience & Reliability:

Retries: Clients use exponential backoff for WebSocket reconnections.

Ack Mechanism: Gateway waits for Chat Service to persist message in Cassandra before sending an "ACK" to the sender.

Storage

Access Pattern: High write (new messages) and high read (loading history). Presence is extreme write/read.

Database Table Design (Cassandra):

Table: messages_by_conversation

conversation_id (Partition Key)

created_at (Clustering Key, Descending)

message_id (TimeUUID)

sender_id, content

Technical Selection: Cassandra is chosen for its linear scalability and performance in time-series-like workloads.

Distribution Logic: Partitioning by conversation_id ensures all messages for a single chat are on the same physical node/partition, making history fetches extremely fast.

Cache

Purpose & Justification: Redis handles two critical bottlenecks:

Presence: Storing user_id -> status (Online/Offline/LastSeen).

Session Map: Storing user_id -> gateway_node_id to route messages to the correct WebSocket server.

Key-Value Schema:

presence:{user_id}: Value online, TTL 60s (updated by 30s heartbeats).

session:{user_id}: Value gateway_ip.

Failure Handling: If Redis fails, presence defaults to "Offline." Use Redis Cluster for high availability.

Messaging

Purpose & Decoupling: Kafka is used to ingest sent messages for downstream consumers.

Event Schema: MessageCreatedEvent containing the full message payload.

Throughput & Partitioning: Partitioned by conversation_id to maintain ordering for any specific chat.

Failure Handling: Dead-letter queue (DLQ) for failed push notification attempts.

Infrastructure (Optional)

Observability: Prometheus for metrics (active WebSocket count, message latency), ELK for logs, and Jaeger for tracing message flow across microservices.

Wrap Up

Advanced Topics

Trade-offs (PACELC): For Presence, we favor Availability over Consistency (PA/EL). It is okay if a user sees a friend "Online" for 30s after they disconnect. For Messages, we favor Consistency and Partition Tolerance (CP/EC) to ensure order.

Reliability: If the WebSocket Gateway node crashes, clients reconnect to a different node. Redis session mapping is updated upon reconnection.

Bottleneck Analysis: The biggest bottleneck is the Redis write volume for presence. Optimization: Implement "Presence Interest Groups." Only send presence updates to "friends" who are currently online and looking at their chat list, rather than broadcasting to all followers.

Security: Use TLS 1.3 for all traffic. Sanitize all message content on the server side to prevent XSS/Injection.