The Question
ML DesignScalable Toxic Content Detection System
Design an end-to-end toxic content detection system for a high-traffic social media platform (1B+ posts/day). The system must handle real-time automated moderation with low latency (<150ms) while minimizing false positives to protect user expression. Your design should detail the data ingestion strategy, the model architecture for semantic understanding, techniques for handling adversarial text, and a robust evaluation framework that incorporates human-in-the-loop feedback and fairness auditing.
DistilBERT
Transformers
ONNX
TensorRT
Kafka
Flink
Spark
Redis
Triton
MLflow
Great Expectations
Questions & Insights
Clarifying Questions
Business Goal: Is the primary objective to auto-moderate content (high precision required to avoid "censorship" complaints) or to flag content for human review (high recall required to ensure safety)?
Constraints & Scale: What is the expected throughput (QPS) and content volume? (e.g., 500M posts/day, 20k QPS peak). What is the latency budget for real-time posting (e.g., <100ms)?
Content Types: Is the system strictly for text, or does it include images, video, and audio? (MVP focus: Text and Metadata).
Edge Cases: How should we handle "reclaimed" speech (slurs used within a community), sarcasm, or adversarial attacks (e.g., using "t0xic" instead of "toxic")?
Assumptions:
Content is primarily English for MVP.
Scale is 100M daily active users (DAU).
P99 latency requirement of 150ms for the entire moderation chain.
Thinking Process
Identify the Bottleneck: The trade-off between semantic understanding (Deep Learning) and latency/throughput. A massive LLM is too slow for 20k QPS at $0.00 cost; a keyword list is too dumb.
Cascaded Architecture: Use a "funnel" approach. Fast, cheap heuristics (RegEx/Hash) filter the obvious 80%, a medium-sized model (DistilBERT/FastText) handles the rest, and an ensemble/human-in-the-loop handles high-uncertainty cases.
Multimodal Future-Proofing: Ensure the feature store and serving layer can eventually ingest image/video embeddings without a re-architecture.
Feedback Loops: The system must learn from moderators' overrides to handle evolving language (slang/evasion).
Elite Bonus Points
Counterfactual Fairness: Explicitly testing and penalizing the model if it flags sentences like "I am a [protected group]" as toxic more often than "I am a [neutral group]".
Adversarial Robustness: Implementing a "Robust Encoder" (e.g., Char-CNN or BPE-level subword tokens) to resist character-level permutations intended to bypass filters.
Delayed Labeling via Active Learning: Using "Uncertainty Sampling" to prioritize which posts go to human moderators, maximizing the information gain for the next training cycle.
Contextual Embeddings: Utilizing the "Community" or "Thread" context (e.g., a toxicity score of the parent post) as a feature, as toxicity is often reactive.
Design Breakdown
Requirements
Product Goal: Maintain platform safety by detecting and taking action on toxic text content (hate speech, harassment, threats).
Success Metrics:
Online Metrics: Precision/Recall at different thresholds, Auto-moderation rate, User report rate (Proxy for missed toxicity).
Offline Metrics: PR-AUC, F1-Score, False Positive Rate (FPR) on "Benign" datasets.
Guardrail Metrics: P99 Latency, Inference Cost per 1k requests, Model bias metrics (Equality of Odds).
System Constraints: 100M+ items/day, 150ms P99 latency, 99.9% availability.
Data Availability: Historical moderated logs (labeled), user report logs, community-specific rules.
ML Problem Framing
ML Task Type: Binary or Multi-class Classification (Toxic, Severe Toxic, Obscene, Threat, Insult, Identity Hate).
Prediction Target: P(\text{toxic} | \text{text, user\_meta, context})
Inputs:
User Features: Account age, historical violation count, reputation score.
Item Features: Raw text, subword embeddings, character n-grams.
Context Features: Sub-community/Channel ID, time of day, parent post toxicity.
Outputs: Probability scores per category + an "Action" recommendation (Allow, Flag, Block).
ML Challenges: Highly imbalanced classes (toxicity is <1% of total traffic), adversarial evolution, and linguistic nuance.
Design Summary & MVP
Concise Summary: A three-stage cascaded moderation pipeline using a High-Speed Filter (Heuristics), a Semantic Scorer (Fine-tuned DistilBERT), and an Active Learning loop with Human Moderators.
Model Architecture & Selection:
Baseline Model: Logistic Regression on TF-IDF features + Keyword RegEx.
Target Model: DistilBERT (Transformers) for the semantic scorer to balance accuracy and latency.
Choice Rationale: Transformers capture long-range dependencies and sarcasm better than n-grams while DistilBERT is 40% smaller and 60% faster than BERT-base.
ML Life Cycle Summary: Spark processes logs; Flink extracts real-time user features; DistilBERT serves on Triton; Human-in-the-loop (Labelbox/Internal) provides ground truth.
Simplicity Audit: Avoids multi-modal or LLM-scale models (GPT-4) for the MVP to stay within latency and cost constraints.
Architecture Decision Rationale:
Cascaded Inference: Reduces compute cost by 70% by short-circuiting obvious cases.
Asynchronous Human Review: Ensures safety without blocking user experience for borderline cases.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Application event logs (user_id, text, timestamp), moderator action logs (labeling), and user reports.
Data Ingestion: Kafka for real-time ingestion. Airflow orchestrates daily batch jobs to move data from Kafka to S3.
Data Storage: S3 (Data Lake) for raw logs; Iceberg/Delta Lake for structured, versioned training data. Partitioned by
date and category.Data Processing: Spark for massive scale joins (User metadata + Content). Flink for real-time windowing (e.g., "how many posts has this user made in the last 10 seconds?").
Data Quality: Great Expectations for schema validation. Check for "label leakage" (e.g., ensuring moderator comments aren't in the training text).
Feature Pipeline
Feature Definition:
Content: Raw text, length, punctuation density (excessive caps/exclamations).
User: Historical toxicity rate, account age, verified status.
Contextual: Sub-reddit/Channel rules, community toxicity baseline.
Feature Engineering: Subword tokenization (Byte-Pair Encoding) to handle out-of-vocabulary words. Hashing for categorical features.
Feature Store: Tecton or Feast. Offline store (S3) for training; Online store (Redis) for <10ms feature retrieval during inference.
Skew Mitigation: Use a unified Python feature library shared between Spark (offline) and the inference service (online) to ensure identical transformations.
Model Architecture
Problem Formulation: Multi-head Binary Classification (Toxic, Hate, etc.).
Architecture Design:
Encoder: DistilBERT-base-uncased.
Classification Head: Multiple Dense layers with Dropout, one for each toxicity category.
Complexity vs. Constraints: DistilBERT (~66M parameters) allows for ~10-20ms inference on GPU (T4/A10G). For even higher scale, we use Knowledge Distillation to a TinyBERT or a highly optimized CNN.
Optimization: Export to ONNX or TensorRT for optimized runtime. Int8 quantization to reduce memory footprint and increase throughput by 2-3x.
Training Pipeline
Dataset Construction: Use Downsampling on the majority class (non-toxic) and SMOTE/Upsampling on minority classes (Threats).
Data Splitting: Time-based split is critical. Language and toxicity patterns change over time (e.g., new political slurs).
Training Infrastructure: PyTorch Lightning on a GPU cluster (e.g., AWS SageMaker). Use
DistributedDataParallel for scaling.Retraining Strategy: Weekly batch retraining + "Flash" retraining triggers if drift detection flags a drop in F1-score or a spike in user reports.
Serving Pipeline
Serving Pattern: Synchronous Request-Response for the "Allow/Block" flow.
Latency Optimization:
Tiered Inference: If Bloom Filter (Keyword list) matches a 100% confidence toxicity, skip the Model.
Request Batching: Group incoming requests at the API Gateway level to saturate GPU memory efficiently.
Reliability: If the model service times out (>100ms), fallback to a "Safety-First" mode (Allow the post but flag for high-priority manual review).
Evaluation Pipeline
Offline: Precision-Recall Curves. We focus on Precision at Recall=0.9 (to see how many users we annoy to catch 90% of trolls).
Online: Shadow Mode Deployment—run the new model in production, log its decisions, but don't act on them. Compare with the current model's decisions and human moderator labels.
Monitoring Pipeline
Data Monitoring: Track "Average Sentence Length" or "Token Distribution". If users start using weird characters, the distribution will shift (KL Divergence).
Model Monitoring: Monitor the Mean Predicted Probability. If it shifts from 0.01 to 0.05, either the world is getting angrier or the model is hallucinating.
Delayed Feedback: Correlate model predictions with User Reports that arrive 1-2 hours later. This is the "Ground Truth" for monitoring.
Wrap Up
Final Evaluation
Observability: Real-time dashboard showing "Auto-block Rate" by community.
Edge Cases:
Cold Start: For new users, use a more conservative threshold or rely purely on content features until a reputation is built.
Adversarial: Periodically inject "adversarial" samples (misspellings) into the test set.
Trade-offs:
Recall vs. Precision: We prioritize Precision for auto-bans and Recall for manual flagging.
Latency vs. Accuracy: We accept DistilBERT's slightly lower accuracy vs. BERT-large to ensure real-time user experience.
Distinguishing Insights: Implementing Model Calibration (e.g., Platt Scaling) is vital. A model might say "0.7 probability," but we need that 0.7 to actually mean a 70% chance of toxicity for reliable decision thresholds.