The Question
DesignScalable Real-time Messaging & Presence System
Design a highly available 1:1 chat system capable of supporting millions of concurrent users. The system must handle real-time message delivery, provide accurate user online/offline status indicators, and allow users to retrieve their message history efficiently. Focus on managing long-lived connections and optimizing presence update propagation.
WebSocket
Redis
Kafka
Cassandra
Microservices
Questions & Insights
Clarifying Questions
What is the expected scale? (Assumption: 10M DAU, 1B messages/day, peak 100k concurrent connections).
What are the core features for the MVP? (Assumption: 1:1 private messaging and real-time online/offline status only. Group chats and media are out of scope for MVP).
What is the message retention policy? (Assumption: Messages are stored permanently for history).
How "real-time" must the online status be? (Assumption: A lag of 30-60 seconds for status updates is acceptable to prevent "thundering herd" issues on the database).
Thinking Process
Core Bottleneck: Maintaining stateful WebSocket (WS) connections for millions of users while efficiently propagating presence updates without saturating the network.
Strategy:
Use a WebSocket Gateway layer to maintain persistent connections.
Implement a Presence Service backed by an in-memory TTL-based store (Redis) to track heartbeats.
Decouple message delivery from storage using a Message Queue to ensure high availability and prevent loss during spikes.
Use a Wide-column NoSQL database for chat history to handle high-write throughput and time-series-like access patterns.
Bonus Points
Presence Optimization: Instead of pushing every status change to all "friends," use a Pull-on-Demand + WebSocket Push hybrid. Only push status changes to active conversation participants to save bandwidth.
Last-Write-Wins (LWW) Conflict Resolution: Use client-side sequence numbers combined with server-side timestamps to handle message ordering in a distributed environment.
Connection Draining: Implement graceful shutdown for WebSocket nodes by notifying clients to reconnect with an exponential backoff to avoid a "reconnection storm."
Geo-sharding: Discussing the use of Latency-based DNS routing to connect users to the nearest regional data center to minimize chat RTT (Round Trip Time).
Design Breakdown
Functional Requirements
Users can send and receive 1:1 messages in real-time.
Users can see the "Online/Offline" status of their contacts.
Users can retrieve message history (pagination).
Messages must be delivered even if the recipient is offline (stored for later).
Non-Functional Requirements
Low Latency: Message delivery < 200ms.
High Availability: 99.99% uptime for the messaging backbone.
Scalability: Must handle 100k+ concurrent WebSocket connections per node.
Consistency: Messages must be ordered chronologically within a conversation.
Estimation
Storage: 1B messages/day * 200 bytes/msg ≈ 200 GB/day. ~73 TB/year.
Presence Traffic: 10M DAU sending heartbeats every 30s = ~333k requests per second (RPS) to the Presence Service.
WebSocket Memory: 100k connections per server * 10KB/connection ≈ 1GB RAM per WS node (well within modern server limits).
Blueprint
Concise Summary: A microservices architecture utilizing WebSockets for real-time delivery, Redis for ephemeral presence state, and Cassandra for persistent chat logs.
Major Components:
WebSocket Gateway: Manages long-lived bi-directional connections and routes incoming messages to the internal bus.
Presence Service: Tracks user heartbeats in Redis and provides status information to the UI.
Message Service: Handles message persistence, history retrieval, and business logic (e.g., block lists).
Message Queue (Kafka): Buffers messages to decouple the real-time ingestion from the slower database write operations.
Simplicity Audit: This design avoids complex "service meshes" or "global locks," relying on horizontal scaling and standard asynchronous patterns.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: WebSockets provide the lowest latency for bi-directional communication, and Redis's TTL mechanism is the industry standard for lightweight presence tracking.
Functional Requirement Satisfaction: WebSocket handles real-time delivery; Redis handles status; Cassandra handles history.
Non-functional Requirement Satisfaction: Using Kafka ensures that even if the DB is slow, the messaging service remains responsive.
High Level Architecture
Sub-system Deep Dive
Service
Topology: WebSocket nodes are stateless regarding business logic but stateful regarding connections. They use a Pub/Sub (Redis or Kafka) to route messages to the specific node where a recipient is connected.
API Spec:
POST /v1/messages: Send a message (fallback if WS fails).GET /v1/history/{user_id}?limit=50&offset=X: Fetch paginated chat history.WS /v1/chat: WebSocket upgrade endpoint.Storage
Data Model: Cassandra Table
messages: partition_key: conversation_id, clustering_key: message_id (time-uuid).Database Logic: Partitioning by
conversation_id ensures all messages between two people are stored on the same node for fast retrieval.Cache
Presence Storage:
SET user_id:status "online" EX 60. If the heartbeat isn't received within 60s, the key expires and the user is considered offline.Metadata Cache: Store
user_id -> ws_node_id mapping to route messages to the correct server.Messaging
Kafka Topics:
incoming-messages (partitioned by conversation_id).Guarantees: At-least-once delivery. Deduplication is handled at the client-side using
message_id.Wrap Up
Advanced Topics
Monitoring:
Metrics: WebSocket connection count, message latency (P99), Kafka consumer lag.
Tools: Prometheus for metrics, Grafana for visualization.
Trade-offs: Eventual Consistency for Presence. A user might show as "online" for up to 30-60 seconds after losing connection. This is preferred over "Strong Consistency" which would require expensive distributed locking.
Bottlenecks: The WebSocket Gateway is a stateful bottleneck. If a node crashes, all connected users must reconnect simultaneously.
Failure Handling:
WS Node Failure: Clients detect disconnect and attempt reconnection to a different node via the Load Balancer.
Redis Failure: Use Redis Sentinel or Cluster for high availability.
Alternatives:
Firebase Realtime DB: Good for very small apps, but lacks control over data sharding and costs for 10M DAU.
DynamoDB: An alternative to Cassandra if the system is hosted on AWS, offering managed scaling.