The Question
ML DesignScalable Toxic Content Moderation
Design a high-throughput, low-latency ML system for real-time toxic content detection and multi-label classification, capable of handling adversarial inputs and minimizing bias against protected groups at a scale of 100k+ QPS.
DistilBERT
Kafka
Spark
ONNX
Feast
PyTorch
Questions & Insights
Clarifying Questions
Business Goal: Is the priority to block content proactively (Hard-Gate) or flag for human review (Soft-Gate)? Assumption: MVP focuses on a hybrid approach—blocking high-confidence violations and flagging borderline cases.
Constraints & Scale: What is the traffic volume? Assumption: 100k QPS, P99 latency < 100ms for real-time blocking.
Definition of Toxicity: Is this binary or multi-label (e.g., hate speech, harassment, self-harm)? Assumption: Multi-label classification to allow for nuanced policy enforcement.
Data Freshness: How quickly must the system adapt to new "slang" or adversarial bypasses? Assumption: Daily retraining or active learning loops.
Edge Cases: How do we handle multilingual content and "leetspeak" (e.g., "t0xic")? Assumption: MVP starts with English but uses sub-word tokenization to handle variations.
Thinking Process
Identify the Bottleneck: The primary challenge is the trade-off between semantic nuance (needing Transformers) and extreme scale (needing low latency).
Strategy: Use a tiered approach. A fast heuristic/embedding-based filter (Retrieval/Filter) followed by a high-precision Transformer (Ranking/Classification).
Scale: To handle 100k QPS, I cannot run a massive BERT model on every single "Good Morning" post. I need a "Fast-Path" for clear negatives.
Data Imbalance: Toxic content is usually <3% of total volume. The training strategy must address this extreme skew.
Elite Bonus Points
Adversarial Robustness: Implementing "Proactive Character Perturbation" training to make the model resilient to intentional typos or obfuscation (e.g., using Unicode lookalikes).
Model Explainability (LIME/SHAP): Providing moderators with "Saliency Maps" showing which words triggered the flag to speed up manual review.
Active Learning Loop: Automatically sampling "low confidence" predictions (near the decision boundary) for prioritized human labeling to maximize the value of the labeling budget.
Contextual Embeddings: Utilizing user-history features (reputation scores) as side-inputs to the model to reduce False Positives for "reclaimed language" in specific sub-communities.
Design Breakdown
Requirements
Product Goal: Protect users by identifying and neutralizing toxic content in real-time.
Success Metrics:
Online: Precision@Fixed-Recall (to minimize user friction), Action Rate (how often moderators agree with the model).
Offline: PR-AUC (due to class imbalance), F1-Score.
Guardrail: P99 Latency, False Positive Rate (FPR) on protected groups (Fairness).
System Constraints: 100k QPS, 50-100ms latency budget, 99.9% availability.
Data Availability: Historical labeled datasets (Civil Comments, Jigsaw), real-time event streams, user metadata.
ML Problem Framing
ML Task Type: Multi-label Binary Classification.
Prediction Target: P(\text{category}_i | \text{text, user\_metadata, context}).
Inputs:
User Features: Account age, past violation history, reputation score.
Item (Text) Features: Raw text, sub-word tokens, character n-grams.
Context Features: Community/Channel ID (norms vary by sub-group).
Outputs: A vector of probabilities [0, 1] for categories like
Hate, Insult, Threat.ML Challenges: Extreme class imbalance, evolving adversarial tactics, and high cost of False Positives (censorship).
Design Summary & MVP
Concise Summary: A two-stage pipeline consisting of a fast Bloom Filter/Keyword list for known violations, followed by a DistilBERT-based classifier for semantic analysis.
Model Architecture & Selection:
Baseline Model: Logistic Regression with TF-IDF and a Keyword Regex list.
Target Model: DistilBERT (Knowledge-distilled Transformer).
Choice Rationale: Transformers capture context (e.g., "I will kill this task" vs "I will kill you"), which linear models miss. DistilBERT provides 95% of BERT's performance at 60% faster inference.
Major Pipelines:
Feature Pipeline: Real-time text normalization and embedding lookups.
Model Serving Pipeline: Tiered inference (Heuristic -> DistilBERT).
Simplicity Audit: Avoids multi-modal (image+text) or Reinforcement Learning in the MVP to focus on stabilizing the text-based P99 latency and precision.
Architecture Decision Rationale:
Why this?: Tiered inference ensures that 90% of "safe" traffic bypasses the expensive GPU-heavy model.
Requirement Satisfaction: Meets latency via distillation and throughput via horizontal scaling of inference nodes.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Real-time ingestion from user comment microservices via Kafka.
Data Ingestion: Use Kafka with at-least-once semantics. For content moderation, missing a message is worse than processing it twice.
Data Storage: S3 for the Data Lake (parquet format, partitioned by
date and category). Historical data for backtesting.Data Processing: Spark-based batch jobs for cleaning (removing control characters, standardizing Unicode).
Data Quality: De-duplication of identical spam messages to prevent training bias.
Feature Pipeline
Feature Engineering:
Text: Wordpiece tokenization (robust to OOV words).
Reputation: Feature Store (e.g., Feast) stores rolling window aggregations of user violations (last 24h, 7d).
Online vs Offline: Use a unified "Feature Definition" library to ensure the regex used in training matches the regex used in the Go/Java serving layer.
Training/Serving Skew: Mitigated by logging the exact features used at inference time (Feature Logging) rather than re-computing them from DBs for training.
Model Architecture
Problem Formulation: Multi-label classification using a Shared-Bottom Transformer architecture.
Candidate Model Families:
Linear (Fast, low accuracy).
CNN/LSTM (Good for sequences, but obsolete compared to Transformers).
DistilBERT (Chosen): Best balance of P99 latency and F1.
Architecture Design: The model outputs N logits, one for each toxicity type, passed through a Sigmoid layer.
Model Selection Strategy: Final selection based on the "Recall@1% FPR" metric—we want to catch as much as possible without annoying safe users.
Training Pipeline
Dataset Construction: Use Hard Negative Mining. Include samples that look toxic but aren't (e.g., "This movie is sick!").
Data Splitting: Time-based split (Train: Months 1-5, Test: Month 6) to account for shifting slang.
Training Infrastructure: PyTorch Lightning on A100 GPUs. Use Mixed Precision (FP16) to speed up training.
Retraining Strategy: Triggered when Prediction Drift (the distribution of scores) shifts by >10%, or weekly.
Serving Pipeline
Serving Pattern: Online Inference.
Latency Optimization:
ONNX Runtime / TensorRT: Convert DistilBERT to optimized formats.
Dynamic Batching: Grouping requests over a 5ms window to increase GPU utilization.
Reliability: If the model service times out, fall back to the "Safe" Heuristic (allow the post but flag for async review).
Evaluation Pipeline
Offline Evaluation: Use a slice-based evaluation (e.g., performance on "political" keywords vs "gaming" slang) to ensure the model isn't biased against specific communities.
Online Evaluation: Shadow Mode—run the new model alongside the old one and compare results without taking action.
Monitoring Pipeline
System Monitoring: Latency and GPU memory saturation.
Model Monitoring: Confidence Score Distribution. If the average toxicity score drops significantly, the model might be failing on new data types.
Data Monitoring: Check for "Label Delay"—it takes time for humans to review flagged content, so the feedback loop is naturally delayed.
Wrap Up
Final Evaluation
Observability: Tracking the "Human-Agreement Rate." If moderators consistently overturn model decisions, the model needs immediate retraining.
Edge Cases:
Cold Start: New users start with a "Neutral" reputation.
Adversarial: Use a dedicated "Jailbreak" test set (prompts designed to trick the model).
Trade-offs: Accuracy vs Latency. We chose a distilled model (accuracy loss) to satisfy the 100ms real-time requirement.
Distinguishing Insights: For a Principal level, emphasize Fairness & Bias. Content moderation models often over-index on African American Vernacular English (AAVE) or LGBTQ+ terms as "toxic." Use Counterfactual Fairness testing (e.g., swap "He is gay" with "He is straight" and ensure the toxicity score doesn't change).