DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Real-Time Email Spam Detection System

Design a high-scale, real-time spam detection system for a global email provider. The system must process over 100,000 emails per second with a P99 latency under 50ms. Focus on a multi-stage architecture that balances heavy text-based deep learning with lightweight metadata-based filtering. Detail how you would handle adversarial attacks (e.g., character obfuscation), the cold-start problem for new senders, and the infrastructure required for high-precision model calibration and daily retraining. Address the critical trade-off between false positives and false negatives in a production environment.
LightGBM
DistilBERT
Kafka
Flink
Spark
Redis
ONNX
BigQuery
Questions & Insights

Clarifying Questions

Business Goal: What is the primary North Star metric?
Answer*: Minimize the False Positive Rate (FPR) (legitimate mail in Spam) while maximizing the Spam Catch Rate**. Legitimate mail in Spam is a high-severity product failure.
Constraints & Scale: What is the expected scale and latency budget?
Answer*: Assume 1.5B+ DAU, 100k+ QPS, and a P99 latency budget of <50ms** for the classification result to avoid delaying delivery.
Data Freshness: How quickly must we adapt to new spam campaigns?
Answer: Adversaries adapt in minutes. The system needs near real-time feature updates (e.g., sender velocity) and daily/hourly model fine-tuning.
Content Privacy: Can we inspect email bodies?
Answer: Yes, but only via automated, privacy-preserving pipelines (embeddings/hashes). Humans never read the text.
Assumptions:
We have access to historical labels (user marked as spam/not spam).
We handle a corpus of billions of senders/IPs.
The system must handle attachments and embedded URLs.

Thinking Process

Identify the Bottleneck: The sheer volume of spam (often >90% of all internet email) means the system cannot run heavy deep learning on every request. I must design a tiered filtering approach: fast-reject heuristics followed by a more expensive ML model.
Precision vs. Recall: In spam, Precision (avoiding False Positives) is significantly more important than Recall. I will focus on high-threshold classification and calibration.
Feature Engineering: Spam is often a "velocity" and "reputation" game. The system must track real-time aggregations (e.g., "how many emails did this IP send in the last 60 seconds?").
Cold Start: New senders and domains are high risk. I'll need a way to handle low-reputation senders without blocking legitimate new users.

Elite Bonus Points

Adversarial Robustness: Implementing character-level CNNs or robust tokenization to handle obfuscation (e.g., V1agra or S.P.A.M).
Delayed Feedback Loop: "Mark as Spam" labels often arrive hours or days after delivery. I would implement a window-based label joiner that handles this temporal lag without introducing leakage.
Calibration for Thresholding: Since FPR is critical, the model must be well-calibrated (e.g., using Platt Scaling or Isotonic Regression) so that a score of 0.99 truly corresponds to a 99% probability of spam.
Holistic Sender Reputation: Using Graph Neural Networks (GNNs) or simple PageRank-style algorithms to compute the "trust score" of a sender based on their proximity to known spammers in the communication graph.
Design Breakdown

Requirements

Product Goal: Protect users from malicious/annoying content while ensuring 100% deliverability of critical mail.
Success Metrics:
Online: Spam-in-Inbox Rate, False Positive Rate (FPR), Latency (P99).
Offline: AUC-PR (Precision-Recall), Precision at 0.0001 FPR.
Guardrail: Inference cost per 1M emails.
System Constraints: 100k QPS, <50ms P99 latency, high availability (99.99%).
Data Availability: SMTP headers (SPF/DKIM), Sender IP, Body Text, User History, Global Click-through on links.

ML Problem Framing

ML Task Type: Binary Classification.
Prediction Target: P(\text{Spam} | \text{Email Content, Sender Metadata, Context}).
Inputs:
User: Historical "Not Spam" clicks, contact list status.
Item (Email): Text embeddings (BERT-mini), link reputation, attachment hashes, HTML structure.
Context: Sender IP velocity, domain age, SPF/DKIM verification status.
Outputs: A probability score [0, 1].
ML Challenges: Extreme class imbalance (most mail is spam), adversarial evolution, and strict latency.

Design Summary & MVP

Concise Summary: A two-tiered system consisting of a Rules-based Fast Rejector (IP blocklists, SPF checks) followed by a LightGBM Ranking Model for high-volume metadata filtering, and a lightweight DistilBERT for content-level analysis on "gray" cases.
Model Selection:
Baseline: Logistic Regression on header features.
Target Model: LightGBM (for tabular/metadata) + DistilBERT (for text) as a hybrid ensemble.
Choice Rationale: LightGBM is extremely fast (sub-ms) and handles categorical features (like IP/Domain) well via histograms. DistilBERT provides semantic understanding of text while fitting within the latency budget.
Simplicity Audit: We avoid a massive multi-modal transformer for the MVP. 90% of spam can be caught using metadata and simple text-hashing.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Real-time SMTP traffic logs and user-generated feedback (Mark as Spam/Not Spam).
Data Ingestion: Kafka as the backbone. We use a high-throughput producer at the SMTP gateway.
Data Storage: BigQuery for structured logs. This allows for rapid analytical queries to identify new spam waves.
Data Quality: Schema validation on Kafka payloads to ensure header fields are consistent. We monitor "Null Rate" for critical fields like SenderIP.

Feature Pipeline

Feature Definition:
Global Statistics: Total emails sent by Domain X in last 1 hour (Real-time).
User-Specific: Is sender in user's address book? (Boolean).
Content: TF-IDF of subject line or DistilBERT embeddings.
Online vs Offline:
Online: Flink calculates "Velocity" (sliding window counts) and stores them in Redis.
Offline: Spark calculates long-term "Sender Reputation" (last 30 days) and stores it in the Feature Store.
Training/Serving Skew: We use a Unified Feature Logging approach—the features used during inference are logged to the training set to ensure the model sees exactly what it will see in production.

Model Architecture

Architecture: A Hybrid Ensemble.
Tabular Model (LightGBM): Processes 50+ metadata features (IP, SPF, length, time).
Text Model (CNN/DistilBERT): Processes subject and snippet.
Inference Strategy: To save costs, we run LightGBM first. If the score is ambiguous (e.g., 0.4 - 0.7), we trigger the more expensive DistilBERT model.
Optimization: Use Quantization (INT8) for the DistilBERT model and ONNX Runtime for high-speed cross-platform serving.

Training Pipeline

Dataset Construction: We use Negative Downsampling since the volume of legitimate mail is high, but we maintain the original distribution for the validation set to ensure calibration.
Data Splitting: Time-based split. Train on days 1-28, test on day 29-30. This simulates the temporal nature of spam.
Retraining: Daily incremental training. We fine-tune the existing LightGBM model with the last 24 hours of data to adapt to new "bursty" spam campaigns.

Serving Pipeline

Pattern: Synchronous Request-Response at the SMTP gateway.
Reliability: Fallback to Heuristics. If the ML service has a 5xx error or latency >100ms, the system falls back to a conservative heuristic (e.g., block only known bad IPs).
Latency Optimization: Batching. We batch inference requests for high-volume MTAs (Mail Transfer Agents) to improve throughput on GPUs for the Transformer component.

Evaluation Pipeline

Offline: We look at the Precision-Recall Curve. Specifically, we optimize for Recall at 10^-5 FPR.
Online: A/B Testing. Group A (current model) vs Group B (new model). We monitor "Spam Report Rate" and "Manual Rescue Rate" (user moving mail from spam to inbox).

Monitoring Pipeline

Prediction Drift: Monitor the average score output. If the mean score shifts by >10% in an hour, it usually signifies a new massive spam attack or a feature pipeline failure.
Feature Drift: Monitor the distribution of "Sender IP Country". A sudden spike in mail from a specific region might indicate a botnet.
Wrap Up

Final Evaluation

Observability: Real-time dashboards showing the "Blocking Reason" (Heuristic vs ML vs Text).
Edge Cases:
Cold Start: For new IPs, we rely on "Domain Reputation" and "Sender Authenticity" (SPF/DKIM).
Adversarial: We use Image Hashing (pHash) to detect spam that is sent as an image to bypass text filters.
Trade-offs: We trade off Recall for Precision. It is better to let 10 spam emails through than to block 1 important work email.