The Question
ML Design

Large-Scale Adversarial Spam Detection System

Design a high-throughput spam detection system capable of processing billions of messages daily. The system must maintain a P99 latency under 50ms while balancing extreme class imbalance and adapting to rapidly evolving adversarial attacks. Focus on the end-to-end architecture from real-time feature engineering (sender reputation, content signals) to model training/serving, and explain how you would handle delayed labels from user feedback and ensure the system minimizes false positives for high-priority legitimate communication.
LightGBM
Kafka
Flink
Spark
Redis
Feast
Prometheus
XGBoost
Questions & Insights

Clarifying Questions

Business Goal: Is the primary objective to minimize False Positives (legitimate mail in spam) or maximize Recall (catching all spam)? Assumption: Precision is the North Star; False Positives (FPs) destroy user trust.
Constraints & Scale: What is the traffic volume? Assumption: 1B+ messages per day, peak QPS of 50k, and a P99 latency budget of <50ms.
Scope: Are we detecting text-based spam only, or multi-modal (images, links)? Assumption: Text and Metadata (sender info, links) for the MVP.
Freshness: How fast must the system adapt to new spam campaigns? Assumption: Near real-time adaptation is required to stop "burst" spam attacks.

Thinking Process

Identify the Core Trade-off: Spam detection is an adversarial game. The cost of a False Positive (missing an important job offer) is much higher than a False Negative (seeing a "free prize" email).
Bottleneck Analysis: The volume of data is massive. We cannot run heavy Transformers on every single message. I need a layered approach: Fast Path (Heuristics/Blocklists) -> ML Path (Feature-based classifier).
Featurization: Success depends more on "sender reputation" and "link safety" than just the text content.
Scaling: Using a Feature Store is critical to ensure training-serving consistency, especially for aggregate features (e.g., "how many emails did this IP send in the last 5 minutes?").

Elite Bonus Points

Adversarial Robustness: Discussing "adversarial training" where we perturb text (e.g., "f-r-e-e" vs "free") to ensure the model isn't easily bypassed.
Delayed Labeling & Active Learning: Spam labels are often provided by users (the "Mark as Spam" button). This creates a feedback loop with a delay. I'll propose a "Human-in-the-loop" queue for ambiguous cases.
Feature Versioning: Using "point-in-time" joins in the Feature Store to prevent data leakage during training.
Sender Reputation Decay: Implementing a Half-Life decay for sender reputation scores to allow reformed spammers (or compromised accounts that are recovered) to regain trust over time.
Design Breakdown

Requirements

Product Goal: Protect users from malicious or unwanted content while ensuring all legitimate messages are delivered.
Success Metrics:
Online Metrics: Precision (Primary), Recall (Secondary), User "Report as Spam" rate.
Offline Metrics: AUC-PR (Area Under Precision-Recall Curve), F1-Score.
Guardrail Metrics: P99 Latency < 50ms, False Positive Rate (FPR) < 0.01%.
System Constraints: 50k QPS, globally distributed, high availability (99.99%).
Data Availability: Real-time message stream, historical labels (user reports), sender/IP metadata.

ML Problem Framing

ML Task Type: Binary Classification.
Prediction Target: P(\text{is\_spam} | \text{sender, receiver, content, context}).
Inputs:
User/Sender Features: Account age, historical spam rate, IP reputation, verification status (SPF/DKIM).
Content Features: Message length, presence of "spammy" keywords, number of links, link safety (reputation of the domain).
Context Features: Time of day, device type, geographic distance between sender and receiver.
Outputs: A probability score [0, 1].
ML Challenges: High class imbalance (spam is frequent but often clustered), label delay, and evolving adversarial patterns.

Design Summary & MVP

Concise Summary: A two-tier system where a high-speed "Denylist/Allowlist" filters 20% of traffic, followed by a LightGBM classifier using aggregate sender features and lightweight text features.
Model Architecture & Selection:
Baseline Model: Logistic Regression with TF-IDF features.
Target Model: LightGBM (Gradient Boosted Decision Trees).
Choice Rationale: LightGBM is extremely fast for inference, handles categorical features (like IP/Domain) natively, and manages missing data without complex imputation.
ML Life Cycle Summary: Data is ingested via Kafka, processed in Spark for offline training, and Flink for real-time feature updates. Predictions happen in a low-latency C++ or Go service using the trained LightGBM model.
Simplicity Audit: Avoids heavy LLMs/Transformers for inference to stay within the 50ms latency budget. Uses a single model rather than an ensemble to simplify maintenance.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Application logs (message body, headers) and User Interaction logs (clicks, "mark as spam").
Data Ingestion: Kafka serves as the backbone. It provides the "at-least-once" delivery guarantee necessary for security.
Data Storage: S3/HDFS for raw logs (Parquet format for optimized storage). Partitioned by date/hour/region.
Data Processing: Spark handles heavy deduplication (same spam sent to millions) to prevent model bias toward a single campaign.

Feature Pipeline

Sender Reputation (Critical): sender_id_spam_rate_1h, ip_address_volume_5m. These are computed in Flink and stored in a low-latency Feature Store (Redis-based, e.g., Feast).
Text Features: Hash-based n-grams (avoids dictionary management) and "Special Character Density" (e.g., "F.R.E.E").
Entity Extraction: Extract URLs and run them against a local cache of known malicious domains.
Online/Offline Consistency: We use a unified feature definition library to ensure the Spark job (offline) and Flink job (online) apply identical transformations.

Model Architecture

Core Model: LightGBM.
Reasoning: It captures non-linear interactions between features (e.g., "Account age < 1 day" AND "Sent > 100 emails") much better than linear models.
Optimization: We use 8-bit quantization for the model weights to reduce the memory footprint and speed up inference.
Thresholding: We use a very high threshold (e.g., 0.98) for automatic spam folder placement to minimize FPs.

Training Pipeline

Label Construction: Labels are "User Reported" (+) and "User Replied/Opened" (-).
Negative Downsampling: Since spam volume can be huge, we downsample the negative class (non-spam) to balance the dataset for LightGBM.
Time-based Split: We train on weeks 1-3 and validate on week 4 to simulate the real-world scenario of predicting the future.

Serving Pipeline

Pattern: Request-Response online inference.
Logic:
Fast Path: Check Redis for Sender/IP blocklist. If hit, return SPAM immediately.
ML Path: Fetch aggregate features from the Feature Store, run LightGBM.
Circuit Breaker: If the ML service or Feature Store times out (>50ms), default to "NOT SPAM" (Fail-safe for user experience).

Evaluation Pipeline

Offline: We monitor the Precision-Recall Curve. We specifically look at Precision at a fixed Recall (e.g., what is Precision when we catch 80% of spam?).
Online: A/B testing two model versions. Success is measured by the reduction in "User Reports" without a drop in "Total Replies" (which would indicate FPs).

Monitoring Pipeline

Data Drift: Monitor the average sender_account_age in incoming traffic. If it drops suddenly, a botnet might be creating new accounts.
Prediction Drift: Monitor the % of messages marked as spam. A sudden spike might mean the model has gone rogue (FPs) or a massive attack is occurring.
Wrap Up

Final Evaluation

Observability: Use PSI (Population Stability Index) to track if the distribution of our features is shifting compared to the training set.
Edge Cases:
Cold Start: For new senders, we use a "Gray-list" strategy: rate-limit them and rely heavily on content features until reputation is established.
Adversarial Attacks: Use a "Regex" layer that can be updated in seconds (via a dynamic config) to block specific text patterns while the model is retraining.
Trade-offs: We chose LightGBM over BERT to prioritize Latency and Cost over marginal accuracy gains in text understanding.