The Question

Scalable Toxic Content Detection System

Design an end-to-end toxic content detection system for a high-traffic social media platform (1B+ posts/day). The system must handle real-time automated moderation with low latency (<150ms) while minimizing false positives to protect user expression. Your design should detail the data ingestion strategy, the model architecture for semantic understanding, techniques for handling adversarial text, and a robust evaluation framework that incorporates human-in-the-loop feedback and fairness auditing.

DistilBERT

Transformers

ONNX

TensorRT

Kafka

Flink

Spark

Redis

Triton

MLflow

Great Expectations

Questions & Insights

Clarifying Questions

Business Goal: Is the primary objective to auto-moderate content (high precision required to avoid "censorship" complaints) or to flag content for human review (high recall required to ensure safety)?

Constraints & Scale: What is the expected throughput (QPS) and content volume? (e.g., 500M posts/day, 20k QPS peak). What is the latency budget for real-time posting (e.g., <100ms)?

Content Types: Is the system strictly for text, or does it include images, video, and audio? (MVP focus: Text and Metadata).

Edge Cases: How should we handle "reclaimed" speech (slurs used within a community), sarcasm, or adversarial attacks (e.g., using "t0xic" instead of "toxic")?

Assumptions:

Content is primarily English for MVP.

Scale is 100M daily active users (DAU).

P99 latency requirement of 150ms for the entire moderation chain.

Thinking Process

Identify the Bottleneck: The trade-off between semantic understanding (Deep Learning) and latency/throughput. A massive LLM is too slow for 20k QPS at $0.00 cost; a keyword list is too dumb.

Cascaded Architecture: Use a "funnel" approach. Fast, cheap heuristics (RegEx/Hash) filter the obvious 80%, a medium-sized model (DistilBERT/FastText) handles the rest, and an ensemble/human-in-the-loop handles high-uncertainty cases.

Multimodal Future-Proofing: Ensure the feature store and serving layer can eventually ingest image/video embeddings without a re-architecture.

Feedback Loops: The system must learn from moderators' overrides to handle evolving language (slang/evasion).

Elite Bonus Points

Counterfactual Fairness: Explicitly testing and penalizing the model if it flags sentences like "I am a [protected group]" as toxic more often than "I am a [neutral group]".

Adversarial Robustness: Implementing a "Robust Encoder" (e.g., Char-CNN or BPE-level subword tokens) to resist character-level permutations intended to bypass filters.

Delayed Labeling via Active Learning: Using "Uncertainty Sampling" to prioritize which posts go to human moderators, maximizing the information gain for the next training cycle.

Contextual Embeddings: Utilizing the "Community" or "Thread" context (e.g., a toxicity score of the parent post) as a feature, as toxicity is often reactive.

Design Breakdown

Requirements

Product Goal: Maintain platform safety by detecting and taking action on toxic text content (hate speech, harassment, threats).

Success Metrics:

Online Metrics: Precision/Recall at different thresholds, Auto-moderation rate, User report rate (Proxy for missed toxicity).

Offline Metrics: PR-AUC, F1-Score, False Positive Rate (FPR) on "Benign" datasets.

Guardrail Metrics: P99 Latency, Inference Cost per 1k requests, Model bias metrics (Equality of Odds).

System Constraints: 100M+ items/day, 150ms P99 latency, 99.9% availability.

Data Availability: Historical moderated logs (labeled), user report logs, community-specific rules.

ML Problem Framing

ML Task Type: Binary or Multi-class Classification (Toxic, Severe Toxic, Obscene, Threat, Insult, Identity Hate).

Prediction Target:

P(\text{toxic} | \text{text, user\_meta, context})

Inputs:

User Features: Account age, historical violation count, reputation score.

Item Features: Raw text, subword embeddings, character n-grams.

Context Features: Sub-community/Channel ID, time of day, parent post toxicity.

Outputs: Probability scores per category + an "Action" recommendation (Allow, Flag, Block).

ML Challenges: Highly imbalanced classes (toxicity is <1% of total traffic), adversarial evolution, and linguistic nuance.

Design Summary & MVP

Concise Summary: A three-stage cascaded moderation pipeline using a High-Speed Filter (Heuristics), a Semantic Scorer (Fine-tuned DistilBERT), and an Active Learning loop with Human Moderators.

Model Architecture & Selection:

Baseline Model: Logistic Regression on TF-IDF features + Keyword RegEx.

Target Model: DistilBERT (Transformers) for the semantic scorer to balance accuracy and latency.

Choice Rationale: Transformers capture long-range dependencies and sarcasm better than n-grams while DistilBERT is 40% smaller and 60% faster than BERT-base.

ML Life Cycle Summary: Spark processes logs; Flink extracts real-time user features; DistilBERT serves on Triton; Human-in-the-loop (Labelbox/Internal) provides ground truth.

Simplicity Audit: Avoids multi-modal or LLM-scale models (GPT-4) for the MVP to stay within latency and cost constraints.

Architecture Decision Rationale:

Cascaded Inference: Reduces compute cost by 70% by short-circuiting obvious cases.

Asynchronous Human Review: Ensures safety without blocking user experience for borderline cases.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Application event logs (user_id, text, timestamp), moderator action logs (labeling), and user reports.

Data Ingestion: Kafka for real-time ingestion. Airflow orchestrates daily batch jobs to move data from Kafka to S3.

Data Storage: S3 (Data Lake) for raw logs; Iceberg/Delta Lake for structured, versioned training data. Partitioned by date and category.

Data Processing: Spark for massive scale joins (User metadata + Content). Flink for real-time windowing (e.g., "how many posts has this user made in the last 10 seconds?").

Data Quality: Great Expectations for schema validation. Check for "label leakage" (e.g., ensuring moderator comments aren't in the training text).

Feature Pipeline

Feature Definition:

Content: Raw text, length, punctuation density (excessive caps/exclamations).

User: Historical toxicity rate, account age, verified status.

Contextual: Sub-reddit/Channel rules, community toxicity baseline.

Feature Engineering: Subword tokenization (Byte-Pair Encoding) to handle out-of-vocabulary words. Hashing for categorical features.

Feature Store: Tecton or Feast. Offline store (S3) for training; Online store (Redis) for <10ms feature retrieval during inference.

Skew Mitigation: Use a unified Python feature library shared between Spark (offline) and the inference service (online) to ensure identical transformations.

Model Architecture

Problem Formulation: Multi-head Binary Classification (Toxic, Hate, etc.).

Architecture Design:

Encoder: DistilBERT-base-uncased.

Classification Head: Multiple Dense layers with Dropout, one for each toxicity category.

Complexity vs. Constraints: DistilBERT (~66M parameters) allows for ~10-20ms inference on GPU (T4/A10G). For even higher scale, we use Knowledge Distillation to a TinyBERT or a highly optimized CNN.

Optimization: Export to ONNX or TensorRT for optimized runtime. Int8 quantization to reduce memory footprint and increase throughput by 2-3x.

Training Pipeline

Dataset Construction: Use Downsampling on the majority class (non-toxic) and SMOTE/Upsampling on minority classes (Threats).

Data Splitting: Time-based split is critical. Language and toxicity patterns change over time (e.g., new political slurs).

Training Infrastructure: PyTorch Lightning on a GPU cluster (e.g., AWS SageMaker). Use DistributedDataParallel for scaling.

Retraining Strategy: Weekly batch retraining + "Flash" retraining triggers if drift detection flags a drop in F1-score or a spike in user reports.

Serving Pipeline

Serving Pattern: Synchronous Request-Response for the "Allow/Block" flow.

Latency Optimization:

Tiered Inference: If Bloom Filter (Keyword list) matches a 100% confidence toxicity, skip the Model.

Request Batching: Group incoming requests at the API Gateway level to saturate GPU memory efficiently.

Reliability: If the model service times out (>100ms), fallback to a "Safety-First" mode (Allow the post but flag for high-priority manual review).

Evaluation Pipeline

Offline: Precision-Recall Curves. We focus on Precision at Recall=0.9 (to see how many users we annoy to catch 90% of trolls).

Online: Shadow Mode Deployment—run the new model in production, log its decisions, but don't act on them. Compare with the current model's decisions and human moderator labels.

Monitoring Pipeline

Data Monitoring: Track "Average Sentence Length" or "Token Distribution". If users start using weird characters, the distribution will shift (KL Divergence).

Model Monitoring: Monitor the Mean Predicted Probability. If it shifts from 0.01 to 0.05, either the world is getting angrier or the model is hallucinating.

Delayed Feedback: Correlate model predictions with User Reports that arrive 1-2 hours later. This is the "Ground Truth" for monitoring.

Wrap Up

Final Evaluation

Observability: Real-time dashboard showing "Auto-block Rate" by community.

Edge Cases:

Cold Start: For new users, use a more conservative threshold or rely purely on content features until a reputation is built.

Adversarial: Periodically inject "adversarial" samples (misspellings) into the test set.

Trade-offs:

Recall vs. Precision: We prioritize Precision for auto-bans and Recall for manual flagging.

Latency vs. Accuracy: We accept DistilBERT's slightly lower accuracy vs. BERT-large to ensure real-time user experience.

Distinguishing Insights: Implementing Model Calibration (e.g., Platt Scaling) is vital. A model might say "0.7 probability," but we need that 0.7 to actually mean a 70% chance of toxicity for reliable decision thresholds.