DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Large-Scale Adversarial Spam Detection System

Design a high-throughput spam detection system capable of processing billions of messages daily. The system must maintain a P99 latency under 50ms while balancing extreme class imbalance and adapting to rapidly evolving adversarial attacks. Focus on the end-to-end architecture from real-time feature engineering (sender reputation, content signals) to model training/serving, and explain how you would handle delayed labels from user feedback and ensure the system minimizes false positives for high-priority legitimate communication.
LightGBM
Kafka
Flink
Spark
Redis
Feast
Prometheus
XGBoost
Questions & Insights

Clarifying Questions

Business Goal: Is the primary objective to minimize False Positives (legitimate mail in spam) or maximize Recall (catching all spam)? Assumption: Precision is the North Star; False Positives (FPs) destroy user trust.
Constraints & Scale: What is the traffic volume? Assumption: 1B+ messages per day, peak QPS of 50k, and a P99 latency budget of <50ms.
Scope: Are we detecting text-based spam only, or multi-modal (images, links)? Assumption: Text and Metadata (sender info, links) for the MVP.
Freshness: How fast must the system adapt to new spam campaigns? Assumption: Near real-time adaptation is required to stop "burst" spam attacks.

Thinking Process

Identify the Core Trade-off: Spam detection is an adversarial game. The cost of a False Positive (missing an important job offer) is much higher than a False Negative (seeing a "free prize" email).
Bottleneck Analysis: The volume of data is massive. We cannot run heavy Transformers on every single message. I need a layered approach: Fast Path (Heuristics/Blocklists) -> ML Path (Feature-based classifier).
Featurization: Success depends more on "sender reputation" and "link safety" than just the text content.
Scaling: Using a Feature Store is critical to ensure training-serving consistency, especially for aggregate features (e.g., "how many emails did this IP send in the last 5 minutes?").

Elite Bonus Points

Adversarial Robustness: Discussing "adversarial training" where we perturb text (e.g., "f-r-e-e" vs "free") to ensure the model isn't easily bypassed.
Delayed Labeling & Active Learning: Spam labels are often provided by users (the "Mark as Spam" button). This creates a feedback loop with a delay. I'll propose a "Human-in-the-loop" queue for ambiguous cases.
Feature Versioning: Using "point-in-time" joins in the Feature Store to prevent data leakage during training.
Sender Reputation Decay: Implementing a Half-Life decay for sender reputation scores to allow reformed spammers (or compromised accounts that are recovered) to regain trust over time.
Design Breakdown

Requirements

Product Goal: Protect users from malicious or unwanted content while ensuring all legitimate messages are delivered.
Success Metrics:
Online Metrics: Precision (Primary), Recall (Secondary), User "Report as Spam" rate.
Offline Metrics: AUC-PR (Area Under Precision-Recall Curve), F1-Score.
Guardrail Metrics: P99 Latency < 50ms, False Positive Rate (FPR) < 0.01%.
System Constraints: 50k QPS, globally distributed, high availability (99.99%).
Data Availability: Real-time message stream, historical labels (user reports), sender/IP metadata.

ML Problem Framing

ML Task Type: Binary Classification.
Prediction Target: P(\text{is\_spam} | \text{sender, receiver, content, context}).
Inputs:
User/Sender Features: Account age, historical spam rate, IP reputation, verification status (SPF/DKIM).
Content Features: Message length, presence of "spammy" keywords, number of links, link safety (reputation of the domain).
Context Features: Time of day, device type, geographic distance between sender and receiver.
Outputs: A probability score [0, 1].
ML Challenges: High class imbalance (spam is frequent but often clustered), label delay, and evolving adversarial patterns.

Design Summary & MVP

Concise Summary: A two-tier system where a high-speed "Denylist/Allowlist" filters 20% of traffic, followed by a LightGBM classifier using aggregate sender features and lightweight text features.
Model Architecture & Selection:
Baseline Model: Logistic Regression with TF-IDF features.
Target Model: LightGBM (Gradient Boosted Decision Trees).
Choice Rationale: LightGBM is extremely fast for inference, handles categorical features (like IP/Domain) natively, and manages missing data without complex imputation.
ML Life Cycle Summary: Data is ingested via Kafka, processed in Spark for offline training, and Flink for real-time feature updates. Predictions happen in a low-latency C++ or Go service using the trained LightGBM model.
Simplicity Audit: Avoids heavy LLMs/Transformers for inference to stay within the 50ms latency budget. Uses a single model rather than an ensemble to simplify maintenance.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Application logs (message body, headers) and User Interaction logs (clicks, "mark as spam").
Data Ingestion: Kafka serves as the backbone. It provides the "at-least-once" delivery guarantee necessary for security.
Data Storage: S3/HDFS for raw logs (Parquet format for optimized storage). Partitioned by date/hour/region.
Data Processing: Spark handles heavy deduplication (same spam sent to millions) to prevent model bias toward a single campaign.

Feature Pipeline

Sender Reputation (Critical): sender_id_spam_rate_1h, ip_address_volume_5m. These are computed in Flink and stored in a low-latency Feature Store (Redis-based, e.g., Feast).
Text Features: Hash-based n-grams (avoids dictionary management) and "Special Character Density" (e.g., "F.R.E.E").
Entity Extraction: Extract URLs and run them against a local cache of known malicious domains.
Online/Offline Consistency: We use a unified feature definition library to ensure the Spark job (offline) and Flink job (online) apply identical transformations.

Model Architecture

Core Model: LightGBM.
Reasoning: It captures non-linear interactions between features (e.g., "Account age < 1 day" AND "Sent > 100 emails") much better than linear models.
Optimization: We use 8-bit quantization for the model weights to reduce the memory footprint and speed up inference.
Thresholding: We use a very high threshold (e.g., 0.98) for automatic spam folder placement to minimize FPs.

Training Pipeline

Label Construction: Labels are "User Reported" (+) and "User Replied/Opened" (-).
Negative Downsampling: Since spam volume can be huge, we downsample the negative class (non-spam) to balance the dataset for LightGBM.
Time-based Split: We train on weeks 1-3 and validate on week 4 to simulate the real-world scenario of predicting the future.

Serving Pipeline

Pattern: Request-Response online inference.
Logic:
Fast Path: Check Redis for Sender/IP blocklist. If hit, return SPAM immediately.
ML Path: Fetch aggregate features from the Feature Store, run LightGBM.
Circuit Breaker: If the ML service or Feature Store times out (>50ms), default to "NOT SPAM" (Fail-safe for user experience).

Evaluation Pipeline

Offline: We monitor the Precision-Recall Curve. We specifically look at Precision at a fixed Recall (e.g., what is Precision when we catch 80% of spam?).
Online: A/B testing two model versions. Success is measured by the reduction in "User Reports" without a drop in "Total Replies" (which would indicate FPs).

Monitoring Pipeline

Data Drift: Monitor the average sender_account_age in incoming traffic. If it drops suddenly, a botnet might be creating new accounts.
Prediction Drift: Monitor the % of messages marked as spam. A sudden spike might mean the model has gone rogue (FPs) or a massive attack is occurring.
Wrap Up

Final Evaluation

Observability: Use PSI (Population Stability Index) to track if the distribution of our features is shifting compared to the training set.
Edge Cases:
Cold Start: For new senders, we use a "Gray-list" strategy: rate-limit them and rely heavily on content features until reputation is established.
Adversarial Attacks: Use a "Regex" layer that can be updated in seconds (via a dynamic config) to block specific text patterns while the model is retraining.
Trade-offs: We chose LightGBM over BERT to prioritize Latency and Cost over marginal accuracy gains in text understanding.