The Question
Design

Real-Time Chat Platform at Scale

Design a real-time community chat platform similar to Discord supporting text messaging, voice channels, and presence features. The system should handle 100,000 daily active users, deliver messages with low latency, and maintain server/channel organization with fine-grained permission controls.
WebSocket
Redis Pub/Sub
PostgreSQL
Redis Cache
Questions & Insights

Clarifying Questions

Q1: What is the expected message volume per user and the peak concurrency?
Assumption: Each user sends ~20 messages/day. 100k DAU translates to ~2M messages/day. Peak concurrent users (PCU) is ~10k.
Q2: Does the MVP require Voice/Video or just Text and Presence?
Assumption: Focus on Text messaging, Presence (Online status), and basic Media (Images/Files).
Q3: What are the data retention requirements?
Assumption: Messages are stored indefinitely.
Q4: How many "Servers" (Guilds) does a typical user belong to?
Assumption: Average 10 servers per user, with some large public servers (thousands of members).

Thinking Process

Core Bottleneck: Real-time message delivery to multiple concurrent clients across different servers.
Step 1: How do we maintain a persistent connection for low-latency delivery? (WebSockets).
Step 2: How do we route a message from User A to all members of Channel X who are currently online? (Pub/Sub pattern).
Step 3: How do we track who is "Online" without overwhelming the database? (Heartbeats in an in-memory store).
Step 4: How do we store and retrieve history efficiently as the message count grows? (Database indexing on channel_id + timestamp).

Bonus Points

Causal Ordering: Using Lamport Timestamps or Snowflake IDs to ensure message order consistency across distributed clients, even if clocks drift.
Presence Optimization: Implementing "Lazy Loading" for presence lists in large servers (only show status for the top 100 members or those currently active) to save bandwidth.
Connection Draining: Strategy for WebSocket server updates—gracefully migrating 10k connections without a "thundering herd" effect on the handshake process.
Design Breakdown

Functional Requirements

Users can join/create Servers and Channels.
Real-time text messaging in channels.
Presence tracking (Online/Offline/Idle).
Basic media uploads (Images).
Persistent message history.

Non-Functional Requirements

Low Latency: Message delivery < 200ms.
Availability: 99.9% uptime.
Scalability: Support 100k DAU with a path to 1M.
Consistency: Messages must appear in the same order for all users in a channel.

Estimation

DAU: 100,000.
Writes (Messages): 100k * 20 = 2M msgs/day \approx 23 msgs/second (Average). Peak \approx 100-200 QPS.
Reads (History/Polling): Significantly higher due to channel switching, ~1k-2k QPS.
Storage: 2M msgs * 500 bytes \approx 1GB/day \approx 365GB/year.
Presence: 10k PCU sending heartbeats every 30s \approx 333 QPS to Redis.

Blueprint

Concise Summary: A WebSocket-based real-time architecture using a distributed Pub/Sub (Redis) for message routing and a relational database (PostgreSQL) for persistence.
Major Components:
API Gateway/Load Balancer: Entry point for REST and WebSocket upgrades.
Gateway Service (WebSockets): Manages persistent connections and real-time pushes.
Chat Service: Handles business logic, permissions, and message persistence.
Presence Service: Tracks user status using heartbeats and Redis.
PostgreSQL: Stores relational data (users, servers) and message history.
Redis: Acts as a session store, presence cache, and Pub/Sub bus.
Simplicity Audit: This design avoids complex distributed actors (like Erlang/Elixir) or heavy stream processors (Kafka) in favor of standard Pub/Sub and a robust SQL DB, which is more than sufficient for 100k DAU.
Architecture Decision Rationale:
Why this architecture?: WebSockets are the industry standard for bidirectional low-latency. Redis Pub/Sub provides the fastest way to bridge messages between different Gateway nodes.
Functional Satisfaction: Meets real-time delivery and history requirements.
Non-functional Satisfaction: High availability via redundant services; low latency via in-memory status/routing.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Gateway Service: Stateless beyond the socket connection. Scaled horizontally. Uses "sticky sessions" or consistent hashing at the LB if necessary, though Redis Pub/Sub makes it truly stateless regarding message routing.
API Service: Standard Node.js/Go/Python service for CRUD operations (joining servers, updating profiles).
API Spec:
POST /v1/channels/{id}/messages: Send message (REST fallback).
GET /v1/channels/{id}/messages: Fetch history (paginated).
WebSocket OpCodes: READY, MESSAGE_CREATE, PRESENCE_UPDATE, HEARTBEAT.

Storage

Data Model:
Users: id, username, password_hash, status, created_at.
Servers/Guilds: id, name, owner_id.
Channels: id, server_id, name, type (text/voice).
Messages: id (Snowflake), channel_id, author_id, content, type, created_at.
Database Logic:
Index on channel_id + created_at (descending) is critical for message history performance.
Use Snowflake IDs (64-bit) for messages to ensure time-based ordering across distributed systems without DB auto-increment bottlenecks.

Cache

Data Structures:
Presence: Redis Hash user_status mapping user_id -> {status, last_seen}.
Session: Redis String session:token -> user_id.
TTL/Eviction:
Presence keys expire if no heartbeat is received within 60-90 seconds.
Pub/Sub: Redis channels named channel:{channel_id}. Gateway nodes subscribe to channels where their connected users are active.

Messaging

Topic Structure: One Redis Pub/Sub channel per Discord Channel ID.
Delivery:
When User A sends a message to Channel X:
Gateway receives via WS.
Persistence: Message saved to Postgres.
Fan-out: Message published to Redis channel:X.
Delivery: All Gateway nodes subscribed to channel:X receive the message and push to their local connected clients.
Guarantees: At-most-once delivery for the real-time push; Persistence in DB ensures users can always fetch missed messages on reconnect (At-least-once via sync).
Wrap Up

Advanced Topics

Monitoring:
Prometheus/Grafana: Track WebSocket connection count, message latency (p99), and Redis memory usage.
ELK Stack: For log aggregation on failed message deliveries.
Trade-offs:
Consistency vs Availability: We prioritize Availability for presence (it's okay if status lags by a few seconds) but Consistency for message ordering within a channel via Snowflake IDs.
Bottlenecks:
Redis Pub/Sub is a single-threaded bottleneck at extreme scales (>1M concurrent), but for 100k DAU, it is extremely performant.
Large servers (e.g., 50k members) create a fan-out "write amplification" problem.
Failure Handling:
DB Replication: Primary-replica setup for Postgres with automated failover.
Client Reconnection: Exponential backoff for clients reconnecting to the Gateway to prevent slamming the server.
Alternatives & Optimization:
Alternative Storage: Could use Cassandra or ScyllaDB for messages if the scale was 100x larger, but it adds operational complexity unnecessary for 100k DAU.
Optimization: Use a Content Delivery Network (CDN) like CloudFront for S3-stored images to reduce latency for media.