The Question
DesignDesign a Massive-Scale Instant Messaging and Social Platform
Design a highly available and scalable communication system similar to WeChat. The system must support real-time 1:1 and group messaging for over 1 billion users, a social feed (Moments) with a complex social graph, and real-time presence tracking. Address the challenges of maintaining hundreds of millions of concurrent persistent connections, ensuring message delivery guarantees (no loss, correct ordering), and managing high-throughput write-heavy workloads for social updates. Discuss the trade-offs between push and pull models for feed distribution and the protocols used for mobile network resilience.
Cassandra
Redis
Kafka
WebSocket
gRPC
Flink
ZooKeeper
CDN
TLS
Anycast DNS
Questions & Insights
Clarifying Questions
Scale and Traffic: What is the expected Daily Active User (DAU) count and peak concurrent users (PCU)?
Assumption: 1 Billion DAU, 200 Million PCU, 100 Billion messages per day.
Core Features: Should we focus on the entire "Super App" or specific pillars?
Assumption: Focus on Instant Messaging (1:1 and Group Chat) and "Moments" (Social Feed). Payments and Mini-programs are out-of-scope for the MVP.
Message Types: What types of content are supported?
Assumption: Text, images, and short videos.
Data Retention: Are messages stored forever or for a specific duration?
Assumption: Messages are stored permanently but cold-storage strategies apply for older data.
Latency Requirements: What is the acceptable delay for message delivery?
Assumption: Under 200ms for 95th percentile (P95) in the same region.
Thinking Process
Core Bottleneck 1: Real-time Connection Management: How do we maintain 200M+ persistent connections efficiently without exhausting server resources?
Core Bottleneck 2: Message Ordering & Reliability: How do we ensure messages arrive in the correct order and are never lost (Exactly-once delivery feel)?
Core Bottleneck 3: Social Graph Fan-out: How do we handle "Moments" updates when a user has 5,000 friends (Write-heavy vs. Read-heavy tradeoffs)?
Progressive Path:
Establish a Gateway Layer to handle long-lived connections (WebSockets/TCP).
Implement a "Sync-Check" or "Mailbox" mechanism for message delivery.
Design a distributed ID generator for global causal ordering.
Optimize the "Moments" feed using a hybrid push/pull model based on user activity.
Bonus Points
Protocol Optimization: Mentioning the use of a custom binary protocol (like WeChat's Mars) over TCP/UDP to handle mobile network volatility (tailored heartbeat, quick reconnect).
Intelligent Diff-Sync: Instead of sending full message lists, use state synchronization (SyncKey) to only fetch the delta between the client's last known state and the server's current state.
Write-Ahead Logging for Group Chats: Utilizing a "Timeline" or "Inbox" model for group chats to avoid N-write amplification (writing once per group rather than once per member).
Multi-region Geo-local Storage: Storing message data close to the user's geographical location to comply with data sovereignty (e.g., GDPR) and reduce latency.
Design Breakdown
Functional Requirements
Core Use Cases:
One-to-one instant messaging (Text/Media).
Group chats (up to 500 members).
User status (Online/Offline/Presence).
"Moments" feed (Post, Like, Comment).
Scope Control:
In-scope: IM core, Moments core, Presence, Media upload.
Out-of-scope: Payments, Official Accounts, Mini-programs, Voice/Video calls.
Non-Functional Requirements
Scale: Support 1 Billion users and high-throughput write operations (1M+ QPS).
Latency: End-to-end message delivery < 200ms.
Availability & Reliability: 99.99% availability; zero message loss.
Consistency: Causal consistency for chat (messages from A must appear in order to B); Eventual consistency for Moments.
Security & Privacy: End-to-end encryption for 1:1 chats; TLS for all transit.
Estimation
Traffic Estimation:
100B messages/day \approx 1.15M messages/sec (Avg QPS).
Peak QPS (3x Avg) \approx 3.5M QPS.
Storage Estimation:
Avg message size: 100 bytes (text/metadata).
100B messages \times 100 bytes = 10 TB/day.
3.6 PB/year (excluding media). Media requires Object Storage (S3/OSS).
Bandwidth Estimation:
1.15M msg/s \times 100 bytes \approx 115 MB/s ingress (Average).
Blueprint
Concise Summary: A microservices architecture leveraging persistent WebSocket/TCP connections through a Gateway, using a "Timeline" (Mailbox) storage model for IM and a Push-Pull hybrid for Moments.
Major Components:
Connection Gateway: Manages millions of persistent TCP/WebSocket connections and handles heartbeat/session state.
Sync Service: The core logic engine that orchestrates message delivery via the "SyncKey" mechanism.
Chat Storage: A high-write NoSQL store (Wide-column) to hold the message timeline per user.
Moments Service: Manages the social feed using a fan-out pattern to distribute posts to friend timelines.
Push Notification (APNs/FCM): Handles delivery for offline users.
Simplicity Audit: This design avoids complex distributed transactions by using idempotent "SyncKey" sequences and eventual consistency for social feeds.
Architecture Decision Rationale:
Why this?: The Mailbox/Timeline model is the industry standard for IM because it scales linearly and handles multi-device synchronization naturally.
Functional Satisfaction: Covers real-time chat, group chat, and social feeds.
Non-functional Satisfaction: Horizontally scalable gateway and storage layers handle the massive DAU and QPS.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
Use Anycast DNS to route users to the nearest Data Center (DC).
Static Content: Images/Videos served via CDN with edge-side resizing.
Security & Perimeter:
API Gateway: Handles authentication (JWT/OAuth2) and TLS termination.
Long-lived Connections: Connection Gateway maintains TCP/WebSocket sticky sessions. If a client moves from Wi-Fi to LTE, the gateway handles session resumption via a Session Token.
Service
Topology & Scaling:
Stateless services (Sync, Moments, Presence) deployed in Kubernetes clusters across multiple Availability Zones (AZs).
Auto-scaling based on CPU and Request Count.
API Schema Design:
SendMessage:
POST /v1/chat/send (gRPC). Request: {to_id, type, content, client_msg_id}. Idempotency guaranteed via client_msg_id.SyncMessages:
GET /v1/chat/sync. Request: {last_sync_key}. Response: {messages: [], new_sync_key}.Resilience & Reliability:
SyncKey Mechanism: Each user has a monotonically increasing version number (SyncKey). The client sends its
current_sync_key, and the server returns all delta messages since that version. This ensures no message is missed even after a long offline period.Storage
Access Pattern: 90% write for chat (new messages); 10% read (initial sync/scrolling history).
Database Table Design:
Messages Table (NoSQL - Cassandra):
Partition Key:
user_id (Mailbox owner).Clustering Key:
sequence_id (SyncKey).Fields:
sender_id, message_body, timestamp, msg_type.Technical Selection:
Message Store: Cassandra or HBase. High write throughput and efficient range scans for a specific user's timeline.
User/Friend Metadata: PostgreSQL with read replicas (Strong consistency for account data).
Distribution Logic:
Shard
Messages Table by user_id to ensure a single user's mailbox is contiguous on disk.Cache
Purpose: Reduce DB load for presence and recent feed lookups.
Key-Value Schema:
Presence:
presence:{user_id} -> {status: online/offline, last_seen: timestamp}.Moments Feed:
feed:{user_id} -> List[post_ids].Technical Selection: Redis (Cluster mode).
Failure Handling: Use a small TTL for presence; if Redis is down, fallback to DB is costly, so use "Fail-fast" for presence but "Fail-over" to DB for Feed.
Messaging
Purpose & Decoupling: Decouples the critical path (sending a message) from side effects (updating feed, pushing to APNs, analytics).
Throughput & Partitioning: Kafka with partitioning by
user_id to ensure message ordering for individual users.Failure Handling: Dead-letter queues (DLQ) for messages that fail to be stored or processed after retries.
Data Processing
Processing Model: Streaming (Flink) for real-time analytics and abuse detection (spam filtering).
Processing DAG:
Message Stream -> Content Filter -> Sensitive Word Detection -> Statistics Sink.Technical Selection: Apache Flink for low-latency windowed aggregations.
Infrastructure (Optional)
Observability: Prometheus for metrics, Jaeger for tracing "Sync" requests across services.
Distributed Coordination: ZooKeeper for service discovery and managing the sharding map of the Connection Gateways.
Wrap Up
Advanced Topics
Trade-offs (PACELC): For Chat, we choose Consistency and Partition Tolerance (CP) over Availability. It is better for a message to fail to send than to appear in a scrambled order or be lost.
Moments Fan-out Strategy:
Pull Model: For "Celebrities" (many followers), followers pull the post.
Push Model: For "Regular Users", the post is pushed to friends' timelines.
Optimization (Connection Migration): When a server in the Gateway Layer is overloaded, we don't just kill connections. We send a "Go Away" frame (WebSocket) or a special packet (TCP) instructing clients to reconnect to a different, less-loaded node with jitter to avoid a "Thundering Herd."
Security: Use "Double Ratchet Algorithm" or similar Signal-protocol-inspired encryption for 1:1 chats to ensure Perfect Forward Secrecy (PFS).