The Question
DesignEnd-to-End Encrypted Messaging System
Design a high-scale, end-to-end encrypted (E2EE) messaging platform similar to Signal or WhatsApp. The system must support 100M daily active users, providing real-time delivery for online users and reliable store-and-forward delivery for offline users. Key requirements include implementing a mechanism for asynchronous key exchange (X3DH), ensuring perfect forward secrecy via the Double Ratchet algorithm, and handling encrypted media uploads. Discuss how you would minimize metadata leakage, handle presence at scale, and ensure no messages are lost during server-side failures.
Signal Protocol
WebSockets
Kafka
Redis
PostgreSQL
Cassandra
S3
gRPC
TLS 1.3
JWT
Questions & Insights
Clarifying Questions
Scale and Traffic: What is the expected Daily Active User (DAU) count and peak message throughput?
Assumption: 100M DAU, average 50 messages/user/day (~5B messages/day), peak 100k QPS.
E2EE Protocol: Should we implement a custom protocol or use an industry standard like the Signal Protocol?
Assumption: Use the Signal Protocol (Double Ratchet + X3DH) for state-of-the-art security.
Message Persistence: Does the server store messages permanently or just until delivery?
Assumption: Store-and-forward model. Messages are deleted from the server immediately after successful delivery or after a 30-day TTL.
Group Messaging: Is group chat required for the MVP?
Assumption: Yes, but limited to small/medium groups (up to 100 members) to avoid fan-out complexity for the MVP.
Media Support: Do we need to handle images, video, and files?
Assumption: Yes, media is encrypted client-side and stored as blobs in object storage.
Thinking Process
Core Bottleneck: Establishing a secure session between two users who have never met and might not be online at the same time (Asynchronous Key Exchange).
Strategy Steps:
Identity & Trust: How do we verify users and store their public "Pre-keys" so others can initiate encrypted sessions?
Session Establishment: Use X3DH (Extended Triple Diffie-Hellman) to derive a shared secret even if one party is offline.
Message Flow: Implement Double Ratchet for "Perfect Forward Secrecy" (PFS)—changing keys for every message so a compromised key doesn't leak past/future messages.
Reliable Delivery: Use persistent WebSockets for real-time delivery and a "Message Store" (Queue/DB) for offline buffering.
Bonus Points
Sealed Sender: Hiding the sender's identity from the server itself by using temporary delivery tokens, reducing metadata leakage.
Key Transparency: Using a verifiable log (like Certificate Transparency) so users can audit if the server has swapped their contacts' public keys (Man-in-the-Middle detection).
Secure Value Recovery: Using Intel SGX or similar TEEs (Trusted Execution Environments) to allow users to recover contacts/keys from a remote encrypted backup without the server seeing the data.
Zero-Knowledge Discovery: Using Private Set Intersection (PSI) or Bloom filters to discover contacts without uploading the entire address book in plaintext.
Design Breakdown
Functional Requirements
Core Use Cases:
One-to-one encrypted messaging.
Group encrypted messaging (small groups).
Online/Offline status (Presence).
Media sharing (encrypted blobs).
Message delivery receipts (Sent, Delivered).
Scope Control:
In-Scope: E2EE logic, message routing, media storage, presence.
Out-of-Scope: Voice/Video calls, desktop client sync, message search across devices, disappearing messages (MVP).
Non-Functional Requirements
Scale: Support 100M DAU with low latency (<200ms message delivery).
Latency: Real-time feel via WebSockets.
Availability & Reliability: 99.99% availability; no message loss for offline users.
Consistency: High availability for messaging; strong consistency for Key management (Pre-keys).
Fault Tolerance: Horizontal scaling of connection managers; multi-AZ database replication.
Security & Privacy: Zero-knowledge architecture; the server cannot read any message content.
Estimation
Traffic Estimation:
5B messages / 86400s \approx 58k Avg QPS.
Peak QPS (2x avg) \approx 120k QPS.
Storage Estimation:
Metadata (IDs, Timestamps): 100 bytes/msg. 5B * 100B = 500 GB/day.
Since we delete on delivery, active storage (working set) is likely < 10 TB.
Bandwidth Estimation:
Average message (1KB): 120k * 1KB = 120 MB/s (1 Gbps) ingress/egress.
Media will dominate bandwidth; requires CDN and Object Storage.
Blueprint
Concise Summary: A WebSocket-based real-time messaging system utilizing the Signal Protocol for E2EE, featuring a "Store-and-Forward" architecture for asynchronous communication.
Major Components:
WebSocket Gateway: Maintains persistent connections for real-time delivery and presence tracking.
Key Service: Manages the storage and distribution of public Pre-keys (Identity, Signed, and One-time keys).
Message Service: Routes messages and manages the temporary offline storage for undelivered messages.
Media Service: Handles encrypted binary uploads to S3 with temporary access URLs.
Simplicity Audit: This design avoids complex distributed databases for message history by enforcing "Delete-on-Delivery," significantly simplifying scaling and privacy compliance.
Architecture Decision Rationale:
Why this architecture?: WebSockets are essential for the low-latency "typing" experience. The Key Service is separated because its consistency requirements (linearizability for keys) differ from the Message Service (high throughput).
Functional Requirement Satisfaction: Meets the E2EE requirement via X3DH and Double Ratchet. Store-and-forward handles the offline requirement.
Non-functional Requirement Satisfaction: Scalability is achieved by sharding the WebSocket Gateway by UserID and using Kafka for message buffering.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
CDN: Used for serving encrypted media blobs to reduce latency.
DNS: Latency-based routing to the nearest regional data center.
Security & Perimeter:
API Gateway: Performs JWT-based authentication and rate limiting (e.g., 500 requests/min per user).
TLS Termination: Hardened TLS 1.3 for all traffic.
Service
Topology & Scaling:
WebSocket Gateway: Stateful, horizontally scaled. Uses Redis to map
UserID -> GatewayInstanceID.Message Service: Stateless. Scales based on CPU/Queue depth.
API Schema Design:
POST /v1/keys: Upload Pre-keys (Initial setup).GET /v1/keys/{userId}: Fetch Pre-keys to start a session.POST /v1/media/upload: Get pre-signed S3 URL for encrypted blob.WebSocket Protocol: Custom binary framing (Protobuf) for messages to minimize overhead.
Resilience & Reliability:
Backoff: Clients use exponential backoff with jitter for WebSocket reconnections.
Heartbeats: 30s PING/PONG to detect dead connections and clean up Redis presence.
Storage
Access Pattern:
High write/delete for messages.
Key Service: 90% Read (fetching pre-keys), 10% Write (initialization/periodic updates).
Database Table Design (Key DB):
user_id (PK), identity_key (Blob), signed_prekey (Blob), prekey_signature (Blob).one_time_prekeys: Table with key_id and key_blob. One row deleted per fetch.Technical Selection:
PostgreSQL: For Key Storage due to ACID requirements for "Atomic Delete-on-Read" of one-time pre-keys.
Cassandra: For offline message metadata, optimized for high-write/high-delete throughput.
Distribution Logic: Sharded by
user_id to ensure all keys/messages for a user reside on one partition.Cache
Purpose & Justification: Tracks online status (Presence) and maps users to specific WebSocket nodes for routing.
Key-Value Schema:
Key:
presence:{userId}, Value: nodeId. TTL: 60 seconds (extended by heartbeats).
Technical Selection: Redis. High-performance sub-millisecond lookups for routing.
Failure Handling: If Redis fails, the system falls back to broadcasting messages to all Gateway nodes (expensive) or assumes the user is offline until Redis recovers.
Messaging
Purpose & Decoupling: Acts as a buffer between the Gateway and the Message Service. Handles load leveling during traffic spikes.
Event Schema:
{ senderId, receiverId, encryptedPayload, timestamp, messageId }.Technical Selection: Kafka. Provides durability for offline messages and allows for high-throughput replay if a Gateway node crashes.
Failure Handling: Dead-letter queues (DLQ) for malformed encrypted packets.
Wrap Up
Advanced Topics
Trade-offs: We sacrifice Message History on the server for Privacy. If a user loses their device and hasn't backed up keys, history is gone. This aligns with the Signal model.
Reliability: To prevent message loss during Gateway crashes, the Message Service only deletes a message from the "Offline Queue" once the Gateway receives an
ACK from the client.Bottleneck Analysis: The "One-time Pre-key" exhaustion. If a user receives thousands of session requests while offline, they might run out of pre-keys, degrading security to a less optimal DH exchange. Optimization: Send a "Key Exhaustion" system message to the client to trigger an immediate key upload.
Security: Perfect Forward Secrecy (PFS) ensures that even if the server is subpoenaed and the long-term identity keys are seized, previous conversations cannot be decrypted because ephemeral session keys were deleted.