DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Multimodal Firearm Detection System

Design a high-scale trust and safety system to detect firearm listings on a global marketplace. The system must process multimodal data (text and images) at the point of upload. Your design should address high-throughput data ingestion, low-latency multimodal inference (P99 < 500ms), strategies for handling extreme class imbalance and adversarial evasion, and a robust feedback loop for human-in-the-loop moderation and model retraining.
CLIP
DistilBERT
PyTorch
Kafka
Spark
Triton Inference Server
Redis
Perceptual Hashing
Active Learning
Questions & Insights

Clarifying Questions

Business Goal: Is the primary goal to block listings at upload time (preventive) or to flag them for human review (reactive)?
Assumption: Preventive blocking at upload time to ensure platform safety.
Constraints & Scale: What is the scale of daily uploads and the latency budget?
Assumption: 10M daily uploads, 100M total listings, with a P99 latency budget of < 500ms for the entire moderation check.
Scope of "Firearm": Does this include parts, ammunition, or toy/replica guns?
Assumption: Policy covers functional firearms, regulated parts, and realistic replicas. Airsoft/Paintball are allowed but must be tagged.
Data Freshness: How quickly must we adapt to new evasion tactics (e.g., "g*un" or obscured images)?
Assumption: Daily model retraining or active learning loops to catch emerging evasion patterns.

Thinking Process

Identify the Bottleneck: Processing high-resolution images and long descriptions for 10M uploads/day is computationally expensive. I need a cascaded approach: cheap heuristics/fingerprinting first, then heavy ML.
Multimodal Fusion: Firearms detection is inherently multimodal. A listing might have a generic title ("Sporting equipment") but a clear image of a rifle. I need to decide between early fusion (complex, high accuracy) vs. late fusion (simpler, faster for MVP).
Adversarial Nature: This is a cat-and-mouse game. Sellers use "leetspeak," obfuscated images, or background clutter. The system must be robust to noise.
Reliability vs. Precision: False positives (blocking a legitimate toy) hurt the business; false negatives (allowing a real gun) create legal/safety risks. I will prioritize high recall for the model but use human-in-the-loop for borderline cases.

Elite Bonus Points

Perceptual Hashing (pHash): Implementing a "Known Bad" image database to instantly block re-uploads of previously banned firearm photos without triggering heavy GPU inference.
Active Learning with Hard Negative Mining: Specifically sampling "near-misses" (e.g., power tools, airsoft, holsters) to retrain the model on the most difficult decision boundaries.
Multilingual LLM Guardrails: Using a multilingual encoder (like mBERT or XLM-R) to catch firearm listings in different languages or dialects without building separate models.
Explainability (Grad-CAM): Providing internal moderators with heatmaps showing which part of the image triggered the firearm detection to speed up the manual review process.
Design Breakdown

Requirements

Product Goal: Automatically detect and prevent the listing of firearms and regulated weapons.
Success Metrics:
Online Metrics: Precision/Recall, False Positive Rate (FPR), Block Rate.
Offline Metrics: PR-AUC, F1-score on a curated "golden set" of policy-violating items.
Guardrail Metrics: P99 Latency < 500ms, GPU utilization efficiency.
System Constraints: 10M uploads/day (~115 QPS average, 500+ QPS peak), high availability (99.9%).
Data Availability: Historical moderated listings (labeled "Safe" vs. "Banned"), image catalogs, and text descriptions.

ML Problem Framing

ML Task Type: Multimodal Binary Classification.
Prediction Target: P(\text{ViolatesPolicy} | \text{Image, Text, Context}).
Inputs:
User Features: Seller history, account age, previous violations.
Item Features: Title, description text, image pixels, price (outliers often indicate illicit sales).
Context Features: Category (e.g., "Home & Garden" vs. "Sporting Goods"), location.
Outputs: Probability score [0, 1] and a set of "reason codes."
ML Challenges: Extreme class imbalance (most listings are safe), adversarial evasion, and high cost of false positives.

Design Summary & MVP

Concise Summary: A two-stage cascaded system. Stage 1 uses fast text-matching and image hashing. Stage 2 uses a late-fusion multimodal model combining a CLIP-based image encoder and a BERT-based text encoder.
Model Architecture & Selection:
Baseline Model: Keyword regex + Image hashing.
Target Model: Two-Tower architecture (Visual Transformer for images, DistilBERT for text) with a MLP (Multi-Layer Perceptron) head for final classification.
Choice Rationale: Late fusion allows for parallel processing of text and image and is easier to debug and scale than early-fusion transformers.
ML Life Cycle Summary: Logs flow to S3 -> Spark for cleaning -> Multimodal training (PyTorch) -> Model Registry -> Triton Inference Server -> Monitoring.
Simplicity Audit: Avoids complex Graph Neural Networks or 3D vision. Uses pre-trained backbones (Transfer Learning) to minimize the need for massive labeled datasets.
Architecture Decision Rationale:
Why this architecture?: It balances the need for semantic understanding of text with the high-dimensional pattern recognition needed for images.
Requirement Satisfaction: Cascading ensures low latency for clear "Safe" items, while the deep model ensures high accuracy for ambiguous cases.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Real-time listing events (JSON) and binary image data from the Marketplace API.
Data Ingestion: Kafka for high-throughput streaming. We use a "Lambda Architecture" where data is both archived in S3 for training and passed to the inference service.
Data Storage: S3 for the Data Lake (parquet format for text/metadata, sharded blobs for images).
Data Processing: Spark jobs handle the heavy lifting: resizing images, removing EXIF metadata (privacy), and joining text descriptions with historical labels.
Data Quality: De-duplication is critical. If a seller uploads the same photo 10 times, we should process it once.

Feature Pipeline

Text Features: Tokenization using WordPiece. We compute TF-IDF for the baseline and dense embeddings (768-dim) for the ML model.
Image Features: Resizing to 224x224. We use pHash (Perceptual Hashing) for the "Fast Filter" stage.
Online Pipeline: The Serving Service pulls user reputation scores from a low-latency Redis-based feature store.
Training/Serving Skew: We use a unified library for text normalization (lowercase, stripping accents) to ensure the model sees the same format during training and inference.

Model Architecture

Problem Formulation: Binary classification (Is_Firearm: 0 or 1).
Candidate Models:
Image: ResNet-50 (Fast but old), ViT (High accuracy, heavy).
Text: FastText (Cheap), BERT (Heavy).
MVP Architecture:
Vision Tower: Pre-trained CLIP (ViT-B/32) image encoder. CLIP is excellent because it’s already aligned with natural language.
Text Tower: DistilBERT for a good balance of semantic depth and latency.
Fusion: Concatenate the output vectors from both towers and pass through 2 Dense layers with Dropout.
Optimization: Quantization (INT8) for the DistilBERT model to fit within the 500ms latency budget on CPU if GPUs are saturated.

Training Pipeline

Dataset Construction: We face a 1:1000 class imbalance. We use Downsampling of the majority class (Safe listings) and Image Augmentation (rotation, brightness, blur) for the minority class (Firearms).
Data Splitting: Time-based split. We train on months 1-5 and test on month 6 to simulate "new" evasion tactics appearing.
Training Infrastructure: Distributed training using PyTorch DDP on a cluster of A100s.
Retraining Strategy: Weekly scheduled retraining, or an ad-hoc trigger if the Monitoring Pipeline detects a >10% drop in Recall (suggesting a new evasion campaign).

Serving Pipeline

Serving Pattern: Request-Response for blocking at upload.
Latency Optimization:
Stage 1 (Fast): Redis lookup for pHash and Regex. (10ms)
Stage 2 (ML): Triton Inference Server with dynamic batching. (200-300ms)
Reliability: If the ML service times out or fails, the system defaults to "Allow" but flags for high-priority asynchronous manual review (Fail-open for UX, but safe).

Evaluation Pipeline

Offline: We track PR-AUC. Since the cost of a False Positive is high (frustrated sellers), we select a threshold where Precision is at least 95%.
Online: A/B testing the new model vs. the baseline. We measure "Appeal Rate" (how many users complain their listing was wrongly blocked) as a proxy for False Positives.

Monitoring Pipeline

Data Monitoring: Check if the average length of descriptions or the distribution of image aspect ratios changes (could indicate a pipeline bug).
Model Monitoring: Prediction Drift. If the model suddenly starts flagging 50% of listings as firearms, we trigger a circuit breaker.
Feedback Loop: When a human moderator reverses a model's decision, that case is automatically added to the "Hard Negatives" dataset for the next training cycle.
Wrap Up

Final Evaluation

Observability: Real-time dashboards showing the "Firearm Detection Rate" by category and region.
Edge Cases:
Contextual Confusion: A person selling a "Water Gun" or a "LEGO Star Wars Blaster." We solve this by training specifically on "Toy" labels.
Adversarial Text: Sellers writing "G.U.N" or using images with text overlays.
Trade-offs:
Accuracy vs. Latency: We chose DistilBERT over BERT-Large to save 150ms of inference time.
Complexity vs. Maintainability: Using pre-trained CLIP embeddings reduces the need for custom vision architecture maintenance.