The Question

Design a Massive-Scale Instant Messaging and Social Platform

Design a highly available and scalable communication system similar to WeChat. The system must support real-time 1:1 and group messaging for over 1 billion users, a social feed (Moments) with a complex social graph, and real-time presence tracking. Address the challenges of maintaining hundreds of millions of concurrent persistent connections, ensuring message delivery guarantees (no loss, correct ordering), and managing high-throughput write-heavy workloads for social updates. Discuss the trade-offs between push and pull models for feed distribution and the protocols used for mobile network resilience.

Cassandra

Redis

Kafka

WebSocket

gRPC

Flink

ZooKeeper

CDN

TLS

Anycast DNS

Questions & Insights

Clarifying Questions

Scale and Traffic: What is the expected Daily Active User (DAU) count and peak concurrent users (PCU)?

Assumption: 1 Billion DAU, 200 Million PCU, 100 Billion messages per day.

Core Features: Should we focus on the entire "Super App" or specific pillars?

Assumption: Focus on Instant Messaging (1:1 and Group Chat) and "Moments" (Social Feed). Payments and Mini-programs are out-of-scope for the MVP.

Message Types: What types of content are supported?

Assumption: Text, images, and short videos.

Data Retention: Are messages stored forever or for a specific duration?

Assumption: Messages are stored permanently but cold-storage strategies apply for older data.

Latency Requirements: What is the acceptable delay for message delivery?

Assumption: Under 200ms for 95th percentile (P95) in the same region.

Thinking Process

Core Bottleneck 1: Real-time Connection Management: How do we maintain 200M+ persistent connections efficiently without exhausting server resources?

Core Bottleneck 2: Message Ordering & Reliability: How do we ensure messages arrive in the correct order and are never lost (Exactly-once delivery feel)?

Core Bottleneck 3: Social Graph Fan-out: How do we handle "Moments" updates when a user has 5,000 friends (Write-heavy vs. Read-heavy tradeoffs)?

Progressive Path:

Establish a Gateway Layer to handle long-lived connections (WebSockets/TCP).

Implement a "Sync-Check" or "Mailbox" mechanism for message delivery.

Design a distributed ID generator for global causal ordering.

Optimize the "Moments" feed using a hybrid push/pull model based on user activity.

Bonus Points

Protocol Optimization: Mentioning the use of a custom binary protocol (like WeChat's Mars) over TCP/UDP to handle mobile network volatility (tailored heartbeat, quick reconnect).

Intelligent Diff-Sync: Instead of sending full message lists, use state synchronization (SyncKey) to only fetch the delta between the client's last known state and the server's current state.

Write-Ahead Logging for Group Chats: Utilizing a "Timeline" or "Inbox" model for group chats to avoid N-write amplification (writing once per group rather than once per member).

Multi-region Geo-local Storage: Storing message data close to the user's geographical location to comply with data sovereignty (e.g., GDPR) and reduce latency.

Design Breakdown

Functional Requirements

Core Use Cases:

One-to-one instant messaging (Text/Media).

Group chats (up to 500 members).

User status (Online/Offline/Presence).

"Moments" feed (Post, Like, Comment).

Scope Control:

In-scope: IM core, Moments core, Presence, Media upload.

Out-of-scope: Payments, Official Accounts, Mini-programs, Voice/Video calls.

Non-Functional Requirements

Scale: Support 1 Billion users and high-throughput write operations (1M+ QPS).

Latency: End-to-end message delivery < 200ms.

Availability & Reliability: 99.99% availability; zero message loss.

Consistency: Causal consistency for chat (messages from A must appear in order to B); Eventual consistency for Moments.

Security & Privacy: End-to-end encryption for 1:1 chats; TLS for all transit.

Estimation

Traffic Estimation:

100B messages/day

\approx

1.15M messages/sec (Avg QPS).

Peak QPS (3x Avg)

\approx

3.5M QPS.

Storage Estimation:

Avg message size: 100 bytes (text/metadata).

100B messages

\times

100 bytes = 10 TB/day.

3.6 PB/year (excluding media). Media requires Object Storage (S3/OSS).

Bandwidth Estimation:

1.15M msg/s

\times

100 bytes

\approx

115 MB/s ingress (Average).

Blueprint

Concise Summary: A microservices architecture leveraging persistent WebSocket/TCP connections through a Gateway, using a "Timeline" (Mailbox) storage model for IM and a Push-Pull hybrid for Moments.

Major Components:

Connection Gateway: Manages millions of persistent TCP/WebSocket connections and handles heartbeat/session state.

Sync Service: The core logic engine that orchestrates message delivery via the "SyncKey" mechanism.

Chat Storage: A high-write NoSQL store (Wide-column) to hold the message timeline per user.

Moments Service: Manages the social feed using a fan-out pattern to distribute posts to friend timelines.

Push Notification (APNs/FCM): Handles delivery for offline users.

Simplicity Audit: This design avoids complex distributed transactions by using idempotent "SyncKey" sequences and eventual consistency for social feeds.

Architecture Decision Rationale:

Why this?: The Mailbox/Timeline model is the industry standard for IM because it scales linearly and handles multi-device synchronization naturally.

Functional Satisfaction: Covers real-time chat, group chat, and social feeds.

Non-functional Satisfaction: Horizontally scalable gateway and storage layers handle the massive DAU and QPS.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

Use Anycast DNS to route users to the nearest Data Center (DC).

Static Content: Images/Videos served via CDN with edge-side resizing.

Security & Perimeter:

API Gateway: Handles authentication (JWT/OAuth2) and TLS termination.

Long-lived Connections: Connection Gateway maintains TCP/WebSocket sticky sessions. If a client moves from Wi-Fi to LTE, the gateway handles session resumption via a Session Token.

Service

Topology & Scaling:

Stateless services (Sync, Moments, Presence) deployed in Kubernetes clusters across multiple Availability Zones (AZs).

Auto-scaling based on CPU and Request Count.

API Schema Design:

SendMessage: POST /v1/chat/send (gRPC). Request: {to_id, type, content, client_msg_id}. Idempotency guaranteed via client_msg_id.

SyncMessages: GET /v1/chat/sync. Request: {last_sync_key}. Response: {messages: [], new_sync_key}.

Resilience & Reliability:

SyncKey Mechanism: Each user has a monotonically increasing version number (SyncKey). The client sends its current_sync_key, and the server returns all delta messages since that version. This ensures no message is missed even after a long offline period.

Storage

Access Pattern: 90% write for chat (new messages); 10% read (initial sync/scrolling history).

Database Table Design:

Messages Table (NoSQL - Cassandra):

Partition Key: user_id (Mailbox owner).

Clustering Key: sequence_id (SyncKey).

Fields: sender_id, message_body, timestamp, msg_type.

Technical Selection:

Message Store: Cassandra or HBase. High write throughput and efficient range scans for a specific user's timeline.

User/Friend Metadata: PostgreSQL with read replicas (Strong consistency for account data).

Distribution Logic:

Shard Messages Table by user_id to ensure a single user's mailbox is contiguous on disk.

Cache

Purpose: Reduce DB load for presence and recent feed lookups.

Key-Value Schema:

Presence: presence:{user_id} -> {status: online/offline, last_seen: timestamp}.

Moments Feed: feed:{user_id} -> List[post_ids].

Technical Selection: Redis (Cluster mode).

Failure Handling: Use a small TTL for presence; if Redis is down, fallback to DB is costly, so use "Fail-fast" for presence but "Fail-over" to DB for Feed.

Messaging

Purpose & Decoupling: Decouples the critical path (sending a message) from side effects (updating feed, pushing to APNs, analytics).

Throughput & Partitioning: Kafka with partitioning by user_id to ensure message ordering for individual users.

Failure Handling: Dead-letter queues (DLQ) for messages that fail to be stored or processed after retries.

Data Processing

Processing Model: Streaming (Flink) for real-time analytics and abuse detection (spam filtering).

Processing DAG: Message Stream -> Content Filter -> Sensitive Word Detection -> Statistics Sink.

Technical Selection: Apache Flink for low-latency windowed aggregations.

Infrastructure (Optional)

Observability: Prometheus for metrics, Jaeger for tracing "Sync" requests across services.

Distributed Coordination: ZooKeeper for service discovery and managing the sharding map of the Connection Gateways.

Wrap Up

Advanced Topics

Trade-offs (PACELC): For Chat, we choose Consistency and Partition Tolerance (CP) over Availability. It is better for a message to fail to send than to appear in a scrambled order or be lost.

Moments Fan-out Strategy:

Pull Model: For "Celebrities" (many followers), followers pull the post.

Push Model: For "Regular Users", the post is pushed to friends' timelines.

Optimization (Connection Migration): When a server in the Gateway Layer is overloaded, we don't just kill connections. We send a "Go Away" frame (WebSocket) or a special packet (TCP) instructing clients to reconnect to a different, less-loaded node with jitter to avoid a "Thundering Herd."

Security: Use "Double Ratchet Algorithm" or similar Signal-protocol-inspired encryption for 1:1 chats to ensure Perfect Forward Secrecy (PFS).