The Question

Scalable Content Moderation System

Design a high-scale content moderation system for a global social media platform. The system must process 100M+ text and image uploads per day with a P99 latency of under 200ms. Your design should cover the end-to-end ML lifecycle, including fast-path filtering, multi-modal scoring models, human-in-the-loop workflows for active learning, and strategies for handling adversarial content and data drift. Address how you would balance precision and recall while managing the costs of inference and human review.

DistilBERT

ResNet

LightGBM

ONNX

TensorRT

Kafka

Redis

Perceptual Hashing

Active Learning

OCR

Questions & Insights

Clarifying Questions

Business Goal: Is the primary goal to minimize legal risk (high recall of bad content) or to maximize user retention by avoiding over-moderation (high precision)?

Constraints & Scale: What is the scale of incoming content (e.g., 100M posts/day)? What is the P99 latency budget for real-time moderation (e.g., <200ms)?

Edge Cases: How do we handle "adversarial" content (e.g., leetspeak like "h4te", or hidden text in images)? Do we need to support multi-modal content (text + image)?

Human-in-the-loop (HITL): What is the capacity of the manual moderation team? Should the system prioritize high-uncertainty samples for them?

Assumptions:

We are building for a social media platform with 100M daily uploads.

Content types are primarily text and images.

P99 latency requirement: 200ms for automated flagging.

We assume an existing "blocklist" of known harmful hashes (MD5/Perceptual Hashes) exists.

Thinking Process

Identify the Funnel: Moderation is a needle-in-a-haystack problem. I need a multi-stage funnel: 1. Heuristic/Hash matching (fastest), 2. High-recall ML models (heavy lifting), 3. Human Review (ground truth).

Data Freshness: Harmful trends (memes, slang) evolve daily. The system must support rapid retraining or a "hot-patch" mechanism (rules/blocklists).

Multi-modality: A "safe" image with "unsafe" text is harmful. I should consider a multi-modal embedding space but keep the MVP simple with late fusion (scoring text and images separately and combining).

System Reliability: If the ML service is down, the fallback should be "Fail-Open" (show content) or "Fail-Closed" (hide content) based on the severity of the category (e.g., CSAM must Fail-Closed).

Elite Bonus Points

Active Learning: Implementing an "Uncertainty Sampling" strategy where the model automatically routes samples with scores near the decision boundary (e.g., 0.45 - 0.55) to human moderators to maximize the information gain of new training data.

Adversarial Robustness: Using character-level CNNs or Byte-level models (like CANINE) to resist text obfuscation (leetspeak, emojis inserted between letters) that often bypasses word-level tokenizers.

Explainable Moderation: Providing "Evidence Highlighting" (e.g., Integrated Gradients) to human moderators so they can quickly see why the ML flagged a post, reducing human review latency by 30-50%.

Negative Feedback Loop: Detecting "Moderator Fatigue" by cross-referencing decisions between different human moderators and using the consensus as the gold standard to clean noisy labels in the training set.

Design Breakdown

Requirements

Product Goal: Detect and act upon content violating community guidelines (NSFW, Hate Speech, Violence).

Success Metrics:

Online: Precision/Recall for each violation category, Time-to-Moderation (latency).

Offline: PR-AUC, F1-score, False Discovery Rate (FDR).

Guardrail: Over-moderation rate (blocking benign content).

System Constraints: 10k QPS, <200ms P99, 99.99% availability.

Data Availability: Historical labeled data from human moderators, third-party "toxic" datasets, and platform user reports.

ML Problem Framing

ML Task Type: Multi-label binary classification (one flag for Hate Speech, one for NSFW, etc.).

Prediction Target:

P(\text{violation}_i | \text{content, author, context})

for each category

i

Inputs:

User Features: Account age, previous violations, karma/reputation.

Item Features: Text (raw string), Image (pixels), OCR text from images, Perceptual hashes.

Context Features: Community/Subreddit type (some words are okay in medical groups but not general ones).

Outputs: A vector of probabilities

[p_1, p_2, ..., p_n]

and a combined "Action" recommendation (Allow, Flag for Review, Auto-Block).

ML Challenges: Class imbalance (harmful content is <1% of total), adversarial evolution, and nuances of language (sarcasm vs. hate).

Design Summary & MVP

Concise Summary: A tiered architecture using a Fast-Filter (Regex/Hashes) followed by a specialized ML scoring layer for text (DistilBERT) and images (ResNet-50), with an Active Learning loop for human moderation.

Model Architecture & Selection:

Baseline: Keyword regex and Perceptual Hash matching (e.g., pHash).

Target Model: Late-fusion of a LightGBM classifier (using user metadata + ML scores) and deep learning extractors.

Simplicity Audit: We avoid a massive unified Multi-modal Transformer (too slow/expensive for MVP) in favor of separate, optimized models for text and images.

Architecture Decision Rationale: This "Cascade" approach minimizes compute costs by dropping obvious safe/unsafe content early and only running expensive DL models on ambiguous cases.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Real-time HTTP uploads and user reports via Kafka.

Data Ingestion: Kafka handles spikes in traffic. For video, we chunk frames.

Data Storage: S3 for raw media; metadata in Snowflake/BigQuery for analysis. Retention policies are set to 30 days for "Safe" content and indefinitely for "Violation" content for legal/training.

Data Quality: PII stripping (GDPR compliance) before ingestion into the training pipeline.

Feature Pipeline

Text Features: Byte-pair encoding (BPE) to handle sub-words. TF-IDF for a fast baseline; DistilBERT embeddings for the deep model.

Image Features: Perceptual Hashing (pHash) to detect identical/mirrored images. OCR to extract text from memes.

User Features: Violation history, account age, and "Trust Score" (aggregated from past moderator approvals).

Feature Store: Redis-based online store for user metadata; S3 for offline historical features to ensure training/serving consistency.

Model Architecture

Text Model: DistilBERT. Why? It's 40% smaller and 60% faster than BERT while retaining 97% of the performance. Perfect for <200ms SLAs.

Image Model: ResNet-50 or MobileNetV3. We prioritize inference speed. It outputs a softmax over categories (Adult, Violent, etc.).

Fusion Layer: A Gradient Boosted Decision Tree (LightGBM) that takes (Text Score, Image Score, User Trust Score) as inputs to make the final "Action" decision. Trees handle non-linear interactions between "New User" and "High Toxicity Score" very well.

Training Pipeline

Dataset Construction: Use Downsampling on the majority "Safe" class to achieve a 1:10 ratio for training.

Labeling: Human labels are the ground truth. We use "Majority Vote" (3 moderators) for ambiguous cases to reduce label noise.

Retraining Strategy: Weekly scheduled retraining to capture new slang. However, the "Fast Filter" (Regex/Hashes) can be updated in seconds via a Redis broadcast to all serving nodes for emergency blocking.

Serving Pipeline

Serving Pattern: Synchronous for text (blocking the post); Asynchronous for images/video (post is hidden or marked "Under Review" for a few seconds).

Latency Optimization: Model Quantization (INT8) using ONNX Runtime or TensorRT.

Reliability: If the ML service fails, we fall back to the "Fast Filter" (Regex only). It's better to be slightly less accurate than to stop all content flow.

Evaluation Pipeline

Offline: We use a "Time-based Holdout" set (e.g., last week's data) to ensure we aren't leaking future trends into the past.

Online: "Shadow Mode" deployment—run the new model alongside the old one, compare decisions, but only act on the old model's output until metrics are validated.

Monitoring Pipeline

Precision/Recall Drift: If the % of content sent to humans suddenly doubles, it indicates either a "Viral Attack" or a "Model Drift."

Adversarial Drift: Monitor "Character Distribution" (e.g., sudden increase in special characters like h@te) to detect new obfuscation techniques.

Wrap Up

Final Evaluation

Observability: Use a "Confusion Matrix" dashboard updated daily from Human-in-the-loop results.

Feedback Loop: Content incorrectly flagged by humans (Appeals) is automatically funneled back as "Hard Negatives" for the next training cycle.

Trade-offs: We chose a Multi-stage Cascade (Hash -> DL -> Human) to balance Cost vs. Accuracy. While a large Multimodal Transformer might be 2% more accurate, the latency and AWS bill would be 10x higher.