The Question
DesignReal-time Messaging System Architecture
Design a high-scale, real-time communication platform similar to Discord. The system must support millions of concurrent users, provide persistent message history, manage user presence states, and handle large-scale message broadcasting within partitioned groups (servers/channels) with sub-second latency.
WebSocket
ScyllaDB
Redis Pub/Sub
PostgreSQL
Snowflake ID
Questions & Insights
Clarifying Questions
Scale: What are the target DAU (Daily Active Users) and PCU (Peak Concurrent Users)?
Feature Scope: Should the MVP include Voice/Video, or focus strictly on real-time text and presence?
Server Size: What is the maximum number of members in a single "Server" (Guild)?
Data Retention: Do we need to store message history indefinitely, or is there a TTL?
Consistency vs. Availability: In the event of a partition, is it more important that the system stays up (AP) or that message ordering is strictly enforced across all replicas (CP)?
Assumptions for MVP:
Scale: 10M DAU, 1M PCU.
Scope: Real-time text messaging, Server/Channel organization, and Presence (online/offline status). Voice/Video is out of scope for the MVP.
Server Size: Support up to 100,000 members per server.
Retention: Permanent storage of messages.
Latency: Message delivery < 200ms globally.
Thinking Process
Core Strategy: Use a persistent WebSocket-based Gateway to handle real-time bi-directional communication, paired with a high-throughput NoSQL database for message persistence.
Bottleneck 1: Connection Management. How do we handle 1M+ concurrent WebSocket connections?
Solution: Stateless Gateway nodes behind a Load Balancer, using a distributed Pub/Sub (Redis) to route messages to the correct node.
Bottleneck 2: Message Fan-out. How do we notify 50,000 members in a channel when one person speaks?
Solution: Implement a Fan-out pattern where the Message Service publishes to a Topic, and the Gateway nodes subscribe only to the channels their currently connected users are watching.
Bottleneck 3: Presence Scaling. How do we update everyone's status without O(N^2) complexity?
Solution: Use "Lazy Loading" for member lists and a Heartbeat mechanism in Redis to track sessions.
Bonus Points
ScyllaDB for Message Storage: Leveraging ScyllaDB (C++ rewrite of Cassandra) for predictable P99 latency and high-density storage, which Discord famously uses to handle trillions of messages.
Discord’s "Manifold" Pattern: Implementing a specialized service to aggregate presence updates to prevent "thundering herd" issues when a high-profile user goes online.
Causal Consistency: Ensuring that while the system is highly available, message dependencies (replies) maintain logical ordering via Snowflake IDs (64-bit distributed sequence numbers).
Edge PoPs: Deploying Gateway nodes at the edge to terminate TLS closer to the user, reducing the RTT (Round Trip Time) for the WebSocket handshake.
Design Breakdown
Functional Requirements
Users can create Servers and Channels.
Users can join/leave Servers.
Real-time text messaging within Channels.
Message history (scrolling back in time).
Presence (Online, Idle, DND, Offline status).
Non-Functional Requirements
Low Latency: Real-time feel for chat.
High Availability: System must be 99.9% available; chat is mission-critical.
Scalability: Must handle spikes when big gaming events occur.
Persistence: Messages must never be lost once acknowledged.
Estimation
Storage: 10M DAU * 20 messages/day = 200M messages/day. At ~200 bytes/message ≈ 40GB/day.
Connections: 1M PCU requires ~50-100 Gateway servers (assuming 10k-20k connections per node).
Pub/Sub Traffic: 200M messages/day. If average channel size is 50, that's 10B fan-out events/day.
Blueprint
Concise Summary: A microservices architecture centered around a WebSocket Gateway for real-time delivery, using PostgreSQL for relational metadata and ScyllaDB for high-volume message history.
Major Components:
Gateway Service: Manages persistent WebSocket connections and pushes real-time events to clients.
API Service (Rest): Handles CRUD operations for servers, channels, and user profiles.
Message Service: Orchestrates message validation, persistence, and fan-out triggering.
Presence Service: Tracks user heartbeats and broadcasts status changes.
Simplicity Audit: This design avoids complex stream processing (Flink) and heavy message brokers (Kafka) in favor of Redis Pub/Sub, which is sufficient for real-time fan-out at the target MVP scale.
Architecture Decision Rationale:
Why this architecture is the best?: It separates the high-frequency real-time path (WebSockets/Redis) from the persistent storage path (ScyllaDB), allowing independent scaling of connections vs. storage.
Functional Requirement Satisfaction: WebSocket Gateway ensures real-time delivery; ScyllaDB ensures history.
Non-functional Requirement Satisfaction: Horizontal scaling of Gateway nodes ensures availability and low latency.
High Level Architecture
Sub-system Deep Dive
Service
Topology:
Gateway Nodes: Deployed in multiple regions. They are stateful regarding the connection but stateless regarding user data.
API/Message Services: Standard Kubernetes deployments, auto-scaled based on CPU/Request count.
API Spec:
POST /v1/channels/{id}/messages: Send a message.GET /v1/channels/{id}/messages: Fetch history (paginated).WS /gateway: Upgrade to WebSocket for real-time stream.Storage
Data Model:
PostgreSQL:
Servers (ID, Name, Owner), Channels (ID, ServerID, Type), Members (UserID, ServerID, Role).ScyllaDB:
Messages table partitioned by channel_id with message_id (Snowflake) as the clustering key for ordered retrieval.Database Logic: Use Snowflake IDs for messages to ensure K-sortable ordering across distributed partitions without a centralized lock.
Cache
Data Structures:
Presence: Redis Hash mapping
user_id -> {status, last_active}.Session: Redis String mapping
session_token -> user_id for fast Gateway authentication.TTLs: 30-60 second TTL for heartbeats; if the heartbeat expires, the user is marked "Offline".
Messaging
Topic Structure: One Redis Pub/Sub topic per "Channel ID".
Delivery: When a message is sent, the Message Service writes to ScyllaDB and then publishes to the Redis Topic. Gateway nodes subscribe to topics for all channels currently visible to their connected users.
Guarantees: At-least-once delivery. Clients handle deduplication via Message IDs.
Wrap Up
Advanced Topics
Monitoring:
Prometheus/Grafana: Track WebSocket connection count, message lag (time from publish to receive), and ScyllaDB write latencies.
Trade-offs:
Consistency vs Availability: We choose AP (Availability) for presence. It’s okay if a user sees a "Online" status that is 5 seconds stale, but the system must not hang.
Bottlenecks:
Redis Pub/Sub: At massive scale (100M+ users), a single Redis cluster for Pub/Sub might become a bottleneck. Optimization: Cluster by Channel ID or migrate to a distributed bus like NATS.
Failure Handling:
Gateway Reconnection: If a Gateway node dies, clients perform exponential backoff to reconnect to a healthy node.
Alternatives:
MongoDB: Could be used for messages, but ScyllaDB offers better performance for the specific write-heavy, time-series-like nature of chat history.