The Question
ML DesignReal-Time Email Spam Detection System
Design a high-scale, real-time spam detection system for a global email provider. The system must process over 100,000 emails per second with a P99 latency under 50ms. Focus on a multi-stage architecture that balances heavy text-based deep learning with lightweight metadata-based filtering. Detail how you would handle adversarial attacks (e.g., character obfuscation), the cold-start problem for new senders, and the infrastructure required for high-precision model calibration and daily retraining. Address the critical trade-off between false positives and false negatives in a production environment.
LightGBM
DistilBERT
Kafka
Flink
Spark
Redis
ONNX
BigQuery
Questions & Insights
Clarifying Questions
Business Goal: What is the primary North Star metric?
Answer*: Minimize the False Positive Rate (FPR) (legitimate mail in Spam) while maximizing the Spam Catch Rate**. Legitimate mail in Spam is a high-severity product failure.
Constraints & Scale: What is the expected scale and latency budget?
Answer*: Assume 1.5B+ DAU, 100k+ QPS, and a P99 latency budget of <50ms** for the classification result to avoid delaying delivery.
Data Freshness: How quickly must we adapt to new spam campaigns?
Answer: Adversaries adapt in minutes. The system needs near real-time feature updates (e.g., sender velocity) and daily/hourly model fine-tuning.
Content Privacy: Can we inspect email bodies?
Answer: Yes, but only via automated, privacy-preserving pipelines (embeddings/hashes). Humans never read the text.
Assumptions:
We have access to historical labels (user marked as spam/not spam).
We handle a corpus of billions of senders/IPs.
The system must handle attachments and embedded URLs.
Thinking Process
Identify the Bottleneck: The sheer volume of spam (often >90% of all internet email) means the system cannot run heavy deep learning on every request. I must design a tiered filtering approach: fast-reject heuristics followed by a more expensive ML model.
Precision vs. Recall: In spam, Precision (avoiding False Positives) is significantly more important than Recall. I will focus on high-threshold classification and calibration.
Feature Engineering: Spam is often a "velocity" and "reputation" game. The system must track real-time aggregations (e.g., "how many emails did this IP send in the last 60 seconds?").
Cold Start: New senders and domains are high risk. I'll need a way to handle low-reputation senders without blocking legitimate new users.
Elite Bonus Points
Adversarial Robustness: Implementing character-level CNNs or robust tokenization to handle obfuscation (e.g.,
V1agra or S.P.A.M).Delayed Feedback Loop: "Mark as Spam" labels often arrive hours or days after delivery. I would implement a window-based label joiner that handles this temporal lag without introducing leakage.
Calibration for Thresholding: Since FPR is critical, the model must be well-calibrated (e.g., using Platt Scaling or Isotonic Regression) so that a score of 0.99 truly corresponds to a 99% probability of spam.
Holistic Sender Reputation: Using Graph Neural Networks (GNNs) or simple PageRank-style algorithms to compute the "trust score" of a sender based on their proximity to known spammers in the communication graph.
Design Breakdown
Requirements
Product Goal: Protect users from malicious/annoying content while ensuring 100% deliverability of critical mail.
Success Metrics:
Online: Spam-in-Inbox Rate, False Positive Rate (FPR), Latency (P99).
Offline: AUC-PR (Precision-Recall), Precision at 0.0001 FPR.
Guardrail: Inference cost per 1M emails.
System Constraints: 100k QPS, <50ms P99 latency, high availability (99.99%).
Data Availability: SMTP headers (SPF/DKIM), Sender IP, Body Text, User History, Global Click-through on links.
ML Problem Framing
ML Task Type: Binary Classification.
Prediction Target: P(\text{Spam} | \text{Email Content, Sender Metadata, Context}).
Inputs:
User: Historical "Not Spam" clicks, contact list status.
Item (Email): Text embeddings (BERT-mini), link reputation, attachment hashes, HTML structure.
Context: Sender IP velocity, domain age, SPF/DKIM verification status.
Outputs: A probability score [0, 1].
ML Challenges: Extreme class imbalance (most mail is spam), adversarial evolution, and strict latency.
Design Summary & MVP
Concise Summary: A two-tiered system consisting of a Rules-based Fast Rejector (IP blocklists, SPF checks) followed by a LightGBM Ranking Model for high-volume metadata filtering, and a lightweight DistilBERT for content-level analysis on "gray" cases.
Model Selection:
Baseline: Logistic Regression on header features.
Target Model: LightGBM (for tabular/metadata) + DistilBERT (for text) as a hybrid ensemble.
Choice Rationale: LightGBM is extremely fast (sub-ms) and handles categorical features (like IP/Domain) well via histograms. DistilBERT provides semantic understanding of text while fitting within the latency budget.
Simplicity Audit: We avoid a massive multi-modal transformer for the MVP. 90% of spam can be caught using metadata and simple text-hashing.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Real-time SMTP traffic logs and user-generated feedback (Mark as Spam/Not Spam).
Data Ingestion: Kafka as the backbone. We use a high-throughput producer at the SMTP gateway.
Data Storage: BigQuery for structured logs. This allows for rapid analytical queries to identify new spam waves.
Data Quality: Schema validation on Kafka payloads to ensure header fields are consistent. We monitor "Null Rate" for critical fields like
SenderIP.Feature Pipeline
Feature Definition:
Global Statistics: Total emails sent by Domain X in last 1 hour (Real-time).
User-Specific: Is sender in user's address book? (Boolean).
Content: TF-IDF of subject line or DistilBERT embeddings.
Online vs Offline:
Online: Flink calculates "Velocity" (sliding window counts) and stores them in Redis.
Offline: Spark calculates long-term "Sender Reputation" (last 30 days) and stores it in the Feature Store.
Training/Serving Skew: We use a Unified Feature Logging approach—the features used during inference are logged to the training set to ensure the model sees exactly what it will see in production.
Model Architecture
Architecture: A Hybrid Ensemble.
Tabular Model (LightGBM): Processes 50+ metadata features (IP, SPF, length, time).
Text Model (CNN/DistilBERT): Processes subject and snippet.
Inference Strategy: To save costs, we run LightGBM first. If the score is ambiguous (e.g., 0.4 - 0.7), we trigger the more expensive DistilBERT model.
Optimization: Use Quantization (INT8) for the DistilBERT model and ONNX Runtime for high-speed cross-platform serving.
Training Pipeline
Dataset Construction: We use Negative Downsampling since the volume of legitimate mail is high, but we maintain the original distribution for the validation set to ensure calibration.
Data Splitting: Time-based split. Train on days 1-28, test on day 29-30. This simulates the temporal nature of spam.
Retraining: Daily incremental training. We fine-tune the existing LightGBM model with the last 24 hours of data to adapt to new "bursty" spam campaigns.
Serving Pipeline
Pattern: Synchronous Request-Response at the SMTP gateway.
Reliability: Fallback to Heuristics. If the ML service has a 5xx error or latency >100ms, the system falls back to a conservative heuristic (e.g., block only known bad IPs).
Latency Optimization: Batching. We batch inference requests for high-volume MTAs (Mail Transfer Agents) to improve throughput on GPUs for the Transformer component.
Evaluation Pipeline
Offline: We look at the Precision-Recall Curve. Specifically, we optimize for Recall at 10^-5 FPR.
Online: A/B Testing. Group A (current model) vs Group B (new model). We monitor "Spam Report Rate" and "Manual Rescue Rate" (user moving mail from spam to inbox).
Monitoring Pipeline
Prediction Drift: Monitor the average score output. If the mean score shifts by >10% in an hour, it usually signifies a new massive spam attack or a feature pipeline failure.
Feature Drift: Monitor the distribution of "Sender IP Country". A sudden spike in mail from a specific region might indicate a botnet.
Wrap Up
Final Evaluation
Observability: Real-time dashboards showing the "Blocking Reason" (Heuristic vs ML vs Text).
Edge Cases:
Cold Start: For new IPs, we rely on "Domain Reputation" and "Sender Authenticity" (SPF/DKIM).
Adversarial: We use Image Hashing (pHash) to detect spam that is sent as an image to bypass text filters.
Trade-offs: We trade off Recall for Precision. It is better to let 10 spam emails through than to block 1 important work email.