The Question

Large-Scale Email System Design

Design a globally distributed email service similar to Gmail, capable of supporting billions of users. The system must handle high-volume write traffic for incoming emails, provide low-latency inbox retrieval, and support full-text search over petabytes of data. Discuss the trade-offs between consistency and availability, the strategy for massive metadata storage versus large attachments, and how to maintain a highly responsive search index under extreme scale.

Cassandra

Kafka

Redis

Elasticsearch

CDN

OAuth2

gRPC

Questions & Insights

Clarifying Questions

Scale and User Base: What is the target Monthly Active User (MAU) count and the average number of emails sent/received per user per day?

Storage Limits: Is there a per-user storage quota (e.g., 15GB free tier)? How do we handle users who exceed this?

Search Requirements: Does the search need to be full-text across message bodies and attachments, or just metadata (Subject/Sender)?

Consistency vs. Availability: When a user sends an email, is it acceptable for it to appear in the recipient's inbox with a slight delay (eventual consistency), or is immediate visibility required?

Attachment Constraints: What is the maximum attachment size, and do we need to support virus scanning or preview generation?

Assumptions:

Scale: 1 Billion DAU, average 20 emails per user/day (10 sent, 10 received).

Storage: 15GB per user; 100KB average email size (including metadata).

Search: Full-text search is required but can be eventually consistent (seconds of lag).

Architecture: Focus on the Web/Mobile API backend rather than legacy SMTP/POP3/IMAP protocols for the MVP.

Thinking Process

How do we handle massive write-heavy storage? We utilize a NoSQL wide-column store (like Cassandra or Bigtable) to manage email metadata, partitioning by user_id to ensure horizontal scalability and high write availability.

How do we manage multi-terabyte search indexes? We decouple the search indexing from the write path using an asynchronous event-driven architecture (Kafka) to populate a distributed search engine (Elasticsearch).

How do we handle large binary attachments? We separate message metadata from message bodies and attachments, storing the latter in an Object Store (S3/GCS) with references in the metadata DB.

How do we optimize for the "Inbox" view? Since most users only look at the first page of their inbox, we use a write-through cache for the most recent email headers to minimize database hits.

Bonus Points

Delta Encoding for Storage: For long email threads where users reply with previous content, we can use delta encoding to store only the new parts of the message, significantly reducing storage costs.

Bloom Filters for Spam/Search: Implement Bloom filters at the Edge or Cache layer to quickly check for blacklisted senders or to determine if a search term exists in a user's index before hitting the heavy search cluster.

Hot-Cold Storage Tiering: Automatically move emails older than 2 years to cheaper, high-latency storage (Cold storage) while keeping recent emails in high-performance SSD-backed NoSQL.

Global Secondary Indexing Strategy: Optimize for multi-attribute queries (e.g., "From: X" AND "Label: Work") using specialized indexing structures to avoid cross-shard scans.

Design Breakdown

Functional Requirements

Core Use Cases:

Compose and send emails.

Receive and display emails in an Inbox.

Organize emails using Labels (Folders).

Full-text search across the user's mailbox.

Support for attachments (up to 25MB).

Scope Control:

In-Scope: Core web backend API, metadata storage, search indexing, and attachment handling.

Out-of-Scope: Legacy protocol support (IMAP/POP3), real-time "typing" indicators, and complex third-party app integrations.

Non-Functional Requirements

Scale: Support 1 Billion users and 20 Billion daily email events.

Latency: Reading the inbox should be < 200ms; sending should be perceived as near-instant (async background task).

Availability & Reliability: 99.99% availability; zero data loss (emails must never be "lost").

Consistency: Strong consistency for the sender (Drafts/Sent folder), eventual consistency for the recipient and search index.

Fault Tolerance: Multi-AZ replication for all data layers.

Security: Data encryption at rest (AES-256) and in transit (TLS).

Estimation

Traffic Estimation:

Write QPS: 20 Billion emails / 86,400 seconds

\approx

230,000 QPS.

Read QPS: Assuming users check their inbox 5x more often than they send emails

\approx

1.15 Million QPS.

Storage Estimation:

20B emails/day * 100KB/email = 2 PB per day.

1 Year storage

\approx

730 PB.

Bandwidth Estimation:

Ingress: 230k QPS * 100KB

\approx

23 GB/s.

Egress: 1.15M QPS * 50KB (headers/metadata)

\approx

57.5 GB/s.

Blueprint

This architecture focuses on a decoupled, event-driven model to handle the extreme scale of global email traffic.

Major Components:

API Gateway: Entry point for all client requests, handling auth and rate limiting.

Mail Service: Handles the logic for composing, sending, and retrieving emails.

Metadata DB (Cassandra): A wide-column store holding message headers, statuses, and labels, partitioned by UserID.

Blob Storage (S3): Stores raw email bodies and binary attachments.

Message Queue (Kafka): Decouples the write path from heavy indexing and spam analysis.

Search Indexer: A consumer service that updates the search engine.

Inbox Cache (Redis): Stores the "top N" message headers for each user's inbox to reduce DB load.

Simplicity Audit: This design avoids the complexity of manual database sharding by using Cassandra's native partitioning and leverages managed object storage for large files, keeping the core Mail Service stateless and easy to scale.

Architecture Decision Rationale:

Scalability: Cassandra allows us to add nodes linearly as the user base grows.

Reliability: Kafka ensures that even if the search indexer is down, email indexing tasks are queued and eventually processed.

Performance: Redis ensures the most common action (viewing the first page of the inbox) is extremely fast.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Static assets (JS/CSS/Icons) are served via Global CDN nodes. DNS uses latency-based routing to direct users to the nearest regional data center.

Security & Perimeter:

API Gateway: Performs OAuth2 token validation.

Rate Limiting: Applied per UserID (e.g., 100 sends per hour for free tier) to prevent spam abuse.

WAF: Protects against SQLi and XSS in email content.

Service

Topology & Scaling: Stateless microservices deployed in Kubernetes across multiple Availability Zones. Scaling is triggered primarily by CPU and Request Count.

API Schema Design:

POST /v1/messages: Sends an email. Returns message_id.

GET /v1/messages/inbox: Paginated list of recent messages.

GET /v1/messages/{id}: Fetch full body and attachment links.

Resilience & Reliability:

Retry Policy: Mail Service uses exponential backoff when writing to Kafka or S3.

Circuit Breakers: Implemented on the Search DB connection to prevent service degradation during index hotspots.

Storage

Access Pattern: Write-heavy for incoming mail; Read-heavy for the "Inbox" view.

Database Table Design (Cassandra):

Table: user_emails

user_id (Partition Key)

received_at (Clustering Key, Descending)

message_id (UUID)

sender_email, subject, snippet, has_attachments, is_read, labels (set<text>).

Technical Selection:

Cassandra: Chosen for its "Always-on" write availability and natural fit for time-ordered data per user.

S3: Chosen for durability (99.999999999%) and cost-efficiency for large blobs.

Distribution Logic: Data is partitioned by user_id. This ensures that all data for a single user's inbox resides on the same node/shard, making the "Inbox" query very efficient.

Cache

Purpose: Reduce latency for the "Inbox View" and "Unread Count."

Key-Value Schema:

Key: inbox:{user_id} -> Value: List of the top 50 message metadata objects.

Key: unread_count:{user_id} -> Value: Integer.

TTL Strategy: 24 hours TTL, but invalidated/updated on every new incoming email (Write-through).

Failure Handling: If Redis is down, the service falls back to Cassandra. To prevent a "thundering herd," we use a random jitter in the TTL.

Messaging

Purpose: Asynchronous processing for search indexing, spam detection, and push notifications.

Event Schema: { "user_id": "123", "message_id": "abc", "action": "CREATED", "timestamp": 167890 }.

Throughput: Kafka partitions are scaled by user_id to ensure message ordering for a single user while maintaining massive parallel throughput.

Technical Selection: Kafka is preferred for its high retention (allowing index replays) and high throughput.

Data Processing

Processing Model: A "Search Indexer" service consumes from Kafka in batches.

Processing DAG: Source (Kafka) -> Extract Text (from S3 Body) -> Tokenize -> Update Index (Elasticsearch).

Scalability: Consumer group auto-scaling based on consumer lag metrics.

Technical Selection: Custom Go or Java microservice for the indexer due to low latency requirements and lightweight footprint.

Wrap Up

Advanced Topics

Trade-offs: We choose Availability over Consistency (AP) for the storage layer. If a network partition occurs, users can still read and write emails, though some messages might take longer to sync across regions.

Reliability:

Dead Letter Queues (DLQ): Used in the Kafka pipeline for emails that fail virus scanning or indexing.

Multi-Region: For a Staff-level design, we would replicate Metadata to a secondary region using Cassandra's cross-DC replication to handle total regional failure.

Bottleneck Analysis:

Hot Partitions: Celebrities or high-volume automated accounts (e.g., "no-reply@amazon.com") could create hot partitions. We solve this by adding a shard ID suffix to partition keys if a user's inbox exceeds a certain size.

Storage Growth: We implement an "Attachment Deduplication" layer—if multiple users receive the same large PDF, we store the blob once in S3 and reference it multiple times.

Security: Emails are scanned for malware using a dedicated service before being committed to S3. All PII in logs is masked.