The Question

Design a Scalable Distributed Email Service

Design a global-scale email platform (similar to Gmail) capable of supporting hundreds of millions of daily active users. The system must handle high-throughput email delivery (SMTP), reliable metadata storage, efficient attachment handling, and full-text search. Focus on achieving high durability for message storage, sub-second latency for inbox rendering, and explain the architectural trade-offs between consistency and availability in a distributed environment.

SMTP

Cassandra

Kafka

ElasticSearch

Redis

Spark

Kubernetes

OAuth2

TLS

Questions & Insights

Clarifying Questions

Scale & Traffic: What is the target scale? (Assumption: 1 Billion users, 500 Million DAU, averaging 10 emails sent/received per user daily).

Functional Scope: Should we support real-time search, attachments, and labels? (Assumption: MVP includes Sending/Receiving, Attachment support, and Search).

Consistency vs. Availability: Is eventual consistency acceptable for search results? (Assumption: Yes, but email delivery and storage must be highly durable and eventually consistent across devices).

Storage Constraints: What is the maximum email/attachment size? (Assumption: 25MB total per email).

Protocols: Do we need to support legacy IMAP/POP3? (Assumption: MVP focuses on a proprietary HTTP API for modern web/mobile clients, plus standard SMTP for inter-domain delivery).

Thinking Process

Protocol Management: How do we handle the transition between the Web/API layer (HTTP) and the Mail server layer (SMTP)?

Storage Strategy: How do we efficiently store billions of small metadata records vs. large unstructured message bodies and attachments?

Search & Indexing: How do we provide sub-second search over petabytes of text data without blocking the write path?

High Availability: How do we ensure the system stays up during massive "thundering herd" events (e.g., global marketing blasts)?

Bonus Points

Searchable Encryption: Implementing schemes that allow searching over encrypted email content without the server knowing the plaintext.

Spanner-based Metadata: Utilizing Google Spanner for global consistency of thread-views and unread counts to prevent "ghost" notifications.

Delta Encoding: Using delta-sync for attachments and long email threads to minimize mobile bandwidth usage.

Cold Storage Tiering: Automatically moving years-old emails to cheaper "Cold" storage (e.g., S3 Glacier) to optimize COGS (Cost of Goods Sold).

Design Breakdown

Functional Requirements

Core Use Cases:

Send and Receive emails (inter-domain and intra-domain).

View Inbox/Threads and manage unread status.

Search emails by keyword, sender, or date.

Upload/Download attachments.

Scope Control:

In-scope: Core mail transfer, metadata storage, basic search, attachment handling.

Out-of-scope: Calendar integration, Video conferencing (Meet), advanced AI auto-replies, and legacy POP3/IMAP support.

Non-Functional Requirements

Scale: Support 500M DAU and 5 Billion emails/day.

Latency: Sub-200ms for loading the inbox; sub-1s for search results.

Availability & Reliability: 99.99% availability; Zero data loss (Durability is the highest priority).

Consistency: Eventual consistency for search indexing; Read-after-write consistency for the inbox view.

Security: TLS in-transit, AES-256 at-rest, and robust anti-spam/phishing filtering.

Estimation

Traffic Estimation:

5B emails/day

\approx

60,000 Average QPS.

Peak QPS (3x)

\approx

180,000 QPS for writes.

Read QPS (Inbox view)

\approx

10x write

\approx

600,000 QPS.

Storage Estimation:

5B emails/day * 100KB average size (body+meta) = 500 TB/day.

182 PB per year. Multi-tier storage is mandatory.

Bandwidth Estimation:

Ingress: 500 TB / 86400s

\approx

5.8 GB/s.

Egress: (Reads + Downloads)

\approx

50 GB/s.

Blueprint

Concise Summary: A microservices architecture centered around a dual-storage strategy (NoSQL for metadata, Object Storage for bodies) with an asynchronous processing pipeline for indexing and outbound delivery.

Major Components:

API Gateway: Entry point for web/mobile clients, handling Auth and Rate Limiting.

Mail Service: Orchestrates inbox views and metadata updates.

SMTP Inbound/Outbound: Dedicated relays for standard mail transfer.

Kafka: Decouples write paths from heavy processing (indexing, spam check).

Cassandra: Stores email metadata and thread structures for high-write availability.

S3 (Object Store): Stores raw email bodies and attachments.

ElasticSearch: Provides full-text search capabilities.

Simplicity Audit: This design avoids complex distributed transactions by using an event-driven model via Kafka and a sharded NoSQL database.

Architecture Decision Rationale:

Why?: Separating metadata from bodies allows us to scale the fast-access inbox view independently from the massive unstructured data storage.

Functional: Meets all send/receive/search requirements.

Non-functional: Cassandra ensures high availability for writes; Kafka ensures durability even if the search indexer lags.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use Anycast DNS to route users to the nearest regional PoP.

Security & Perimeter:

API Gateway: Handles OAuth2 tokens.

Rate Limiting: Tiered limits (Free vs. Business users) implemented in Redis.

WAF: Protects against SQLi and XSS in email bodies.

Service

Topology & Scaling: Stateless microservices deployed in Kubernetes across multiple Availability Zones. Scaling is based on CPU for MailSvc and Queue Depth for Workers.

API Schema Design:

POST /v1/messages/send: Send email (Returns msg_id).

GET /v1/messages/inbox: Paginated inbox list.

GET /v1/messages/{id}: Fetch full message and attachment URLs.

Resilience & Reliability:

Retries: Exponential backoff for SMTP outbound delivery.

Circuit Breaker: Used when calling the ElasticSearch cluster to prevent cascading failures during search peaks.

Storage

Access Pattern:

Metadata: Heavy write (new mail) and heavy read (viewing inbox).

Bodies: Write once, read occasionally.

Database Table Design (Cassandra):

Table: messages_by_user

user_id (Partition Key)

folder_type (Clustering Key - Inbox/Sent/Draft)

timestamp (Clustering Key - Descending)

message_id, subject, from_address, has_attachments, is_read.

Technical Selection: Cassandra is chosen for its linear scalability and excellent write performance. S3 is used for attachments due to cost-efficiency.

Distribution Logic: Sharded by user_id to ensure all of a single user's inbox data resides on the same partition for fast retrieval.

Cache

Purpose & Justification: Reduces DB load for frequently accessed data like user sessions and unread message counts.

Key-Value Schema:

unread_cnt:{user_id} -> Integer.

session:{token} -> JSON user metadata.

Technical Selection: Redis for sub-millisecond latency.

Messaging

Purpose & Decoupling: Kafka acts as the buffer between the synchronous "Message Received" event and asynchronous tasks like Indexing and Spam Scanning.

Event / Topic Schema: email_events topic with payload: {user_id, message_id, action: "RECEIVED"}.

Technical Selection: Kafka for high throughput and replayability (crucial if search index needs rebuilding).

Data Processing

Processing Model: Spark Streaming (or Flink) consumes from Kafka to update the ElasticSearch index.

Correctness Guarantees: Idempotent indexing based on message_id.

Technical Selection: Spark for its robust ecosystem and ability to handle large batch re-indexing if needed.

Infrastructure (Optional)

Observability: Prometheus for RED metrics; Jaeger for tracing the lifecycle of an email from SMTP-In to Indexing.

Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for the Search Index to maintain high availability for the Send/Receive path. A user might not see a newly arrived email in search results for a few seconds.

Reliability: If Cassandra is down, Kafka stores the incoming emails, providing backpressure and preventing data loss.

Bottleneck Analysis: The "Hot User" problem (celebrities receiving millions of emails). Optimization: Implement a specialized "celebrity" queue or shard highly active mailboxes across more nodes.

Security: Data at rest is encrypted per-user. Attachments are scanned by an async virus-scanning service before being marked "available."