The Question
Design

Design a Scalable Distributed Email Service

Design a global-scale email platform (similar to Gmail) capable of supporting hundreds of millions of daily active users. The system must handle high-throughput email delivery (SMTP), reliable metadata storage, efficient attachment handling, and full-text search. Focus on achieving high durability for message storage, sub-second latency for inbox rendering, and explain the architectural trade-offs between consistency and availability in a distributed environment.
SMTP
Cassandra
S3
Kafka
ElasticSearch
Redis
Spark
Kubernetes
OAuth2
TLS
Questions & Insights

Clarifying Questions

Scale & Traffic: What is the target scale? (Assumption: 1 Billion users, 500 Million DAU, averaging 10 emails sent/received per user daily).
Functional Scope: Should we support real-time search, attachments, and labels? (Assumption: MVP includes Sending/Receiving, Attachment support, and Search).
Consistency vs. Availability: Is eventual consistency acceptable for search results? (Assumption: Yes, but email delivery and storage must be highly durable and eventually consistent across devices).
Storage Constraints: What is the maximum email/attachment size? (Assumption: 25MB total per email).
Protocols: Do we need to support legacy IMAP/POP3? (Assumption: MVP focuses on a proprietary HTTP API for modern web/mobile clients, plus standard SMTP for inter-domain delivery).

Thinking Process

Protocol Management: How do we handle the transition between the Web/API layer (HTTP) and the Mail server layer (SMTP)?
Storage Strategy: How do we efficiently store billions of small metadata records vs. large unstructured message bodies and attachments?
Search & Indexing: How do we provide sub-second search over petabytes of text data without blocking the write path?
High Availability: How do we ensure the system stays up during massive "thundering herd" events (e.g., global marketing blasts)?

Bonus Points

Searchable Encryption: Implementing schemes that allow searching over encrypted email content without the server knowing the plaintext.
Spanner-based Metadata: Utilizing Google Spanner for global consistency of thread-views and unread counts to prevent "ghost" notifications.
Delta Encoding: Using delta-sync for attachments and long email threads to minimize mobile bandwidth usage.
Cold Storage Tiering: Automatically moving years-old emails to cheaper "Cold" storage (e.g., S3 Glacier) to optimize COGS (Cost of Goods Sold).
Design Breakdown

Functional Requirements

Core Use Cases:
Send and Receive emails (inter-domain and intra-domain).
View Inbox/Threads and manage unread status.
Search emails by keyword, sender, or date.
Upload/Download attachments.
Scope Control:
In-scope: Core mail transfer, metadata storage, basic search, attachment handling.
Out-of-scope: Calendar integration, Video conferencing (Meet), advanced AI auto-replies, and legacy POP3/IMAP support.

Non-Functional Requirements

Scale: Support 500M DAU and 5 Billion emails/day.
Latency: Sub-200ms for loading the inbox; sub-1s for search results.
Availability & Reliability: 99.99% availability; Zero data loss (Durability is the highest priority).
Consistency: Eventual consistency for search indexing; Read-after-write consistency for the inbox view.
Security: TLS in-transit, AES-256 at-rest, and robust anti-spam/phishing filtering.

Estimation

Traffic Estimation:
5B emails/day \approx 60,000 Average QPS.
Peak QPS (3x) \approx 180,000 QPS for writes.
Read QPS (Inbox view) \approx 10x write \approx 600,000 QPS.
Storage Estimation:
5B emails/day * 100KB average size (body+meta) = 500 TB/day.
182 PB per year. Multi-tier storage is mandatory.
Bandwidth Estimation:
Ingress: 500 TB / 86400s \approx 5.8 GB/s.
Egress: (Reads + Downloads) \approx 50 GB/s.

Blueprint

Concise Summary: A microservices architecture centered around a dual-storage strategy (NoSQL for metadata, Object Storage for bodies) with an asynchronous processing pipeline for indexing and outbound delivery.
Major Components:
API Gateway: Entry point for web/mobile clients, handling Auth and Rate Limiting.
Mail Service: Orchestrates inbox views and metadata updates.
SMTP Inbound/Outbound: Dedicated relays for standard mail transfer.
Kafka: Decouples write paths from heavy processing (indexing, spam check).
Cassandra: Stores email metadata and thread structures for high-write availability.
S3 (Object Store): Stores raw email bodies and attachments.
ElasticSearch: Provides full-text search capabilities.
Simplicity Audit: This design avoids complex distributed transactions by using an event-driven model via Kafka and a sharded NoSQL database.
Architecture Decision Rationale:
Why?: Separating metadata from bodies allows us to scale the fast-access inbox view independently from the massive unstructured data storage.
Functional: Meets all send/receive/search requirements.
Non-functional: Cassandra ensures high availability for writes; Kafka ensures durability even if the search indexer lags.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use Anycast DNS to route users to the nearest regional PoP.
Security & Perimeter:
API Gateway: Handles OAuth2 tokens.
Rate Limiting: Tiered limits (Free vs. Business users) implemented in Redis.
WAF: Protects against SQLi and XSS in email bodies.

Service

Topology & Scaling: Stateless microservices deployed in Kubernetes across multiple Availability Zones. Scaling is based on CPU for MailSvc and Queue Depth for Workers.
API Schema Design:
POST /v1/messages/send: Send email (Returns msg_id).
GET /v1/messages/inbox: Paginated inbox list.
GET /v1/messages/{id}: Fetch full message and attachment URLs.
Resilience & Reliability:
Retries: Exponential backoff for SMTP outbound delivery.
Circuit Breaker: Used when calling the ElasticSearch cluster to prevent cascading failures during search peaks.

Storage

Access Pattern:
Metadata: Heavy write (new mail) and heavy read (viewing inbox).
Bodies: Write once, read occasionally.
Database Table Design (Cassandra):
Table: messages_by_user
user_id (Partition Key)
folder_type (Clustering Key - Inbox/Sent/Draft)
timestamp (Clustering Key - Descending)
message_id, subject, from_address, has_attachments, is_read.
Technical Selection: Cassandra is chosen for its linear scalability and excellent write performance. S3 is used for attachments due to cost-efficiency.
Distribution Logic: Sharded by user_id to ensure all of a single user's inbox data resides on the same partition for fast retrieval.

Cache

Purpose & Justification: Reduces DB load for frequently accessed data like user sessions and unread message counts.
Key-Value Schema:
unread_cnt:{user_id} -> Integer.
session:{token} -> JSON user metadata.
Technical Selection: Redis for sub-millisecond latency.

Messaging

Purpose & Decoupling: Kafka acts as the buffer between the synchronous "Message Received" event and asynchronous tasks like Indexing and Spam Scanning.
Event / Topic Schema: email_events topic with payload: {user_id, message_id, action: "RECEIVED"}.
Technical Selection: Kafka for high throughput and replayability (crucial if search index needs rebuilding).

Data Processing

Processing Model: Spark Streaming (or Flink) consumes from Kafka to update the ElasticSearch index.
Correctness Guarantees: Idempotent indexing based on message_id.
Technical Selection: Spark for its robust ecosystem and ability to handle large batch re-indexing if needed.

Infrastructure (Optional)

Observability: Prometheus for RED metrics; Jaeger for tracing the lifecycle of an email from SMTP-In to Indexing.
Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for the Search Index to maintain high availability for the Send/Receive path. A user might not see a newly arrived email in search results for a few seconds.
Reliability: If Cassandra is down, Kafka stores the incoming emails, providing backpressure and preventing data loss.
Bottleneck Analysis: The "Hot User" problem (celebrities receiving millions of emails). Optimization: Implement a specialized "celebrity" queue or shard highly active mailboxes across more nodes.
Security: Data at rest is encrypted per-user. Attachments are scanned by an async virus-scanning service before being marked "available."