The Question

Real-Time Email Spam Detection System

Design a high-scale, real-time spam detection system for a global email provider. The system must process over 100,000 emails per second with a P99 latency under 50ms. Focus on a multi-stage architecture that balances heavy text-based deep learning with lightweight metadata-based filtering. Detail how you would handle adversarial attacks (e.g., character obfuscation), the cold-start problem for new senders, and the infrastructure required for high-precision model calibration and daily retraining. Address the critical trade-off between false positives and false negatives in a production environment.

LightGBM

DistilBERT

Kafka

Flink

Spark

Redis

ONNX

BigQuery

Questions & Insights

Clarifying Questions

Business Goal: What is the primary North Star metric?

Answer*: Minimize the False Positive Rate (FPR) (legitimate mail in Spam) while maximizing the Spam Catch Rate**. Legitimate mail in Spam is a high-severity product failure.

Constraints & Scale: What is the expected scale and latency budget?

Answer*: Assume 1.5B+ DAU, 100k+ QPS, and a P99 latency budget of <50ms** for the classification result to avoid delaying delivery.

Data Freshness: How quickly must we adapt to new spam campaigns?

Answer: Adversaries adapt in minutes. The system needs near real-time feature updates (e.g., sender velocity) and daily/hourly model fine-tuning.

Content Privacy: Can we inspect email bodies?

Answer: Yes, but only via automated, privacy-preserving pipelines (embeddings/hashes). Humans never read the text.

Assumptions:

We have access to historical labels (user marked as spam/not spam).

We handle a corpus of billions of senders/IPs.

The system must handle attachments and embedded URLs.

Thinking Process

Identify the Bottleneck: The sheer volume of spam (often >90% of all internet email) means the system cannot run heavy deep learning on every request. I must design a tiered filtering approach: fast-reject heuristics followed by a more expensive ML model.

Precision vs. Recall: In spam, Precision (avoiding False Positives) is significantly more important than Recall. I will focus on high-threshold classification and calibration.

Feature Engineering: Spam is often a "velocity" and "reputation" game. The system must track real-time aggregations (e.g., "how many emails did this IP send in the last 60 seconds?").

Cold Start: New senders and domains are high risk. I'll need a way to handle low-reputation senders without blocking legitimate new users.

Elite Bonus Points

Adversarial Robustness: Implementing character-level CNNs or robust tokenization to handle obfuscation (e.g., V1agra or S.P.A.M).

Delayed Feedback Loop: "Mark as Spam" labels often arrive hours or days after delivery. I would implement a window-based label joiner that handles this temporal lag without introducing leakage.

Calibration for Thresholding: Since FPR is critical, the model must be well-calibrated (e.g., using Platt Scaling or Isotonic Regression) so that a score of 0.99 truly corresponds to a 99% probability of spam.

Holistic Sender Reputation: Using Graph Neural Networks (GNNs) or simple PageRank-style algorithms to compute the "trust score" of a sender based on their proximity to known spammers in the communication graph.

Design Breakdown

Requirements

Product Goal: Protect users from malicious/annoying content while ensuring 100% deliverability of critical mail.

Success Metrics:

Online: Spam-in-Inbox Rate, False Positive Rate (FPR), Latency (P99).

Offline: AUC-PR (Precision-Recall), Precision at 0.0001 FPR.

Guardrail: Inference cost per 1M emails.

System Constraints: 100k QPS, <50ms P99 latency, high availability (99.99%).

Data Availability: SMTP headers (SPF/DKIM), Sender IP, Body Text, User History, Global Click-through on links.

ML Problem Framing

ML Task Type: Binary Classification.

Prediction Target:

P(\text{Spam} | \text{Email Content, Sender Metadata, Context})

Inputs:

User: Historical "Not Spam" clicks, contact list status.

Item (Email): Text embeddings (BERT-mini), link reputation, attachment hashes, HTML structure.

Context: Sender IP velocity, domain age, SPF/DKIM verification status.

Outputs: A probability score [0, 1].

ML Challenges: Extreme class imbalance (most mail is spam), adversarial evolution, and strict latency.

Design Summary & MVP

Concise Summary: A two-tiered system consisting of a Rules-based Fast Rejector (IP blocklists, SPF checks) followed by a LightGBM Ranking Model for high-volume metadata filtering, and a lightweight DistilBERT for content-level analysis on "gray" cases.

Model Selection:

Baseline: Logistic Regression on header features.

Target Model: LightGBM (for tabular/metadata) + DistilBERT (for text) as a hybrid ensemble.

Choice Rationale: LightGBM is extremely fast (sub-ms) and handles categorical features (like IP/Domain) well via histograms. DistilBERT provides semantic understanding of text while fitting within the latency budget.

Simplicity Audit: We avoid a massive multi-modal transformer for the MVP. 90% of spam can be caught using metadata and simple text-hashing.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Real-time SMTP traffic logs and user-generated feedback (Mark as Spam/Not Spam).

Data Ingestion: Kafka as the backbone. We use a high-throughput producer at the SMTP gateway.

Data Storage: BigQuery for structured logs. This allows for rapid analytical queries to identify new spam waves.

Data Quality: Schema validation on Kafka payloads to ensure header fields are consistent. We monitor "Null Rate" for critical fields like SenderIP.

Feature Pipeline

Feature Definition:

Global Statistics: Total emails sent by Domain X in last 1 hour (Real-time).

User-Specific: Is sender in user's address book? (Boolean).

Content: TF-IDF of subject line or DistilBERT embeddings.

Online vs Offline:

Online: Flink calculates "Velocity" (sliding window counts) and stores them in Redis.

Offline: Spark calculates long-term "Sender Reputation" (last 30 days) and stores it in the Feature Store.

Training/Serving Skew: We use a Unified Feature Logging approach—the features used during inference are logged to the training set to ensure the model sees exactly what it will see in production.

Model Architecture

Architecture: A Hybrid Ensemble.

Tabular Model (LightGBM): Processes 50+ metadata features (IP, SPF, length, time).

Text Model (CNN/DistilBERT): Processes subject and snippet.

Inference Strategy: To save costs, we run LightGBM first. If the score is ambiguous (e.g., 0.4 - 0.7), we trigger the more expensive DistilBERT model.

Optimization: Use Quantization (INT8) for the DistilBERT model and ONNX Runtime for high-speed cross-platform serving.

Training Pipeline

Dataset Construction: We use Negative Downsampling since the volume of legitimate mail is high, but we maintain the original distribution for the validation set to ensure calibration.

Data Splitting: Time-based split. Train on days 1-28, test on day 29-30. This simulates the temporal nature of spam.

Retraining: Daily incremental training. We fine-tune the existing LightGBM model with the last 24 hours of data to adapt to new "bursty" spam campaigns.

Serving Pipeline

Pattern: Synchronous Request-Response at the SMTP gateway.

Reliability: Fallback to Heuristics. If the ML service has a 5xx error or latency >100ms, the system falls back to a conservative heuristic (e.g., block only known bad IPs).

Latency Optimization: Batching. We batch inference requests for high-volume MTAs (Mail Transfer Agents) to improve throughput on GPUs for the Transformer component.

Evaluation Pipeline

Offline: We look at the Precision-Recall Curve. Specifically, we optimize for Recall at 10^-5 FPR.

Online: A/B Testing. Group A (current model) vs Group B (new model). We monitor "Spam Report Rate" and "Manual Rescue Rate" (user moving mail from spam to inbox).

Monitoring Pipeline

Prediction Drift: Monitor the average score output. If the mean score shifts by >10% in an hour, it usually signifies a new massive spam attack or a feature pipeline failure.

Feature Drift: Monitor the distribution of "Sender IP Country". A sudden spike in mail from a specific region might indicate a botnet.

Wrap Up

Final Evaluation

Observability: Real-time dashboards showing the "Blocking Reason" (Heuristic vs ML vs Text).

Edge Cases:

Cold Start: For new IPs, we rely on "Domain Reputation" and "Sender Authenticity" (SPF/DKIM).

Adversarial: We use Image Hashing (pHash) to detect spam that is sent as an image to bypass text filters.

Trade-offs: We trade off Recall for Precision. It is better to let 10 spam emails through than to block 1 important work email.