The Question
ML DesignMultimodal Firearm Detection System
Design a high-scale trust and safety system to detect firearm listings on a global marketplace. The system must process multimodal data (text and images) at the point of upload. Your design should address high-throughput data ingestion, low-latency multimodal inference (P99 < 500ms), strategies for handling extreme class imbalance and adversarial evasion, and a robust feedback loop for human-in-the-loop moderation and model retraining.
CLIP
DistilBERT
PyTorch
Kafka
Spark
Triton Inference Server
Redis
Perceptual Hashing
Active Learning
Questions & Insights
Clarifying Questions
Business Goal: Is the primary goal to block listings at upload time (preventive) or to flag them for human review (reactive)?
Assumption: Preventive blocking at upload time to ensure platform safety.
Constraints & Scale: What is the scale of daily uploads and the latency budget?
Assumption: 10M daily uploads, 100M total listings, with a P99 latency budget of < 500ms for the entire moderation check.
Scope of "Firearm": Does this include parts, ammunition, or toy/replica guns?
Assumption: Policy covers functional firearms, regulated parts, and realistic replicas. Airsoft/Paintball are allowed but must be tagged.
Data Freshness: How quickly must we adapt to new evasion tactics (e.g., "g*un" or obscured images)?
Assumption: Daily model retraining or active learning loops to catch emerging evasion patterns.
Thinking Process
Identify the Bottleneck: Processing high-resolution images and long descriptions for 10M uploads/day is computationally expensive. I need a cascaded approach: cheap heuristics/fingerprinting first, then heavy ML.
Multimodal Fusion: Firearms detection is inherently multimodal. A listing might have a generic title ("Sporting equipment") but a clear image of a rifle. I need to decide between early fusion (complex, high accuracy) vs. late fusion (simpler, faster for MVP).
Adversarial Nature: This is a cat-and-mouse game. Sellers use "leetspeak," obfuscated images, or background clutter. The system must be robust to noise.
Reliability vs. Precision: False positives (blocking a legitimate toy) hurt the business; false negatives (allowing a real gun) create legal/safety risks. I will prioritize high recall for the model but use human-in-the-loop for borderline cases.
Elite Bonus Points
Perceptual Hashing (pHash): Implementing a "Known Bad" image database to instantly block re-uploads of previously banned firearm photos without triggering heavy GPU inference.
Active Learning with Hard Negative Mining: Specifically sampling "near-misses" (e.g., power tools, airsoft, holsters) to retrain the model on the most difficult decision boundaries.
Multilingual LLM Guardrails: Using a multilingual encoder (like mBERT or XLM-R) to catch firearm listings in different languages or dialects without building separate models.
Explainability (Grad-CAM): Providing internal moderators with heatmaps showing which part of the image triggered the firearm detection to speed up the manual review process.
Design Breakdown
Requirements
Product Goal: Automatically detect and prevent the listing of firearms and regulated weapons.
Success Metrics:
Online Metrics: Precision/Recall, False Positive Rate (FPR), Block Rate.
Offline Metrics: PR-AUC, F1-score on a curated "golden set" of policy-violating items.
Guardrail Metrics: P99 Latency < 500ms, GPU utilization efficiency.
System Constraints: 10M uploads/day (~115 QPS average, 500+ QPS peak), high availability (99.9%).
Data Availability: Historical moderated listings (labeled "Safe" vs. "Banned"), image catalogs, and text descriptions.
ML Problem Framing
ML Task Type: Multimodal Binary Classification.
Prediction Target: P(\text{ViolatesPolicy} | \text{Image, Text, Context}).
Inputs:
User Features: Seller history, account age, previous violations.
Item Features: Title, description text, image pixels, price (outliers often indicate illicit sales).
Context Features: Category (e.g., "Home & Garden" vs. "Sporting Goods"), location.
Outputs: Probability score [0, 1] and a set of "reason codes."
ML Challenges: Extreme class imbalance (most listings are safe), adversarial evasion, and high cost of false positives.
Design Summary & MVP
Concise Summary: A two-stage cascaded system. Stage 1 uses fast text-matching and image hashing. Stage 2 uses a late-fusion multimodal model combining a CLIP-based image encoder and a BERT-based text encoder.
Model Architecture & Selection:
Baseline Model: Keyword regex + Image hashing.
Target Model: Two-Tower architecture (Visual Transformer for images, DistilBERT for text) with a MLP (Multi-Layer Perceptron) head for final classification.
Choice Rationale: Late fusion allows for parallel processing of text and image and is easier to debug and scale than early-fusion transformers.
ML Life Cycle Summary: Logs flow to S3 -> Spark for cleaning -> Multimodal training (PyTorch) -> Model Registry -> Triton Inference Server -> Monitoring.
Simplicity Audit: Avoids complex Graph Neural Networks or 3D vision. Uses pre-trained backbones (Transfer Learning) to minimize the need for massive labeled datasets.
Architecture Decision Rationale:
Why this architecture?: It balances the need for semantic understanding of text with the high-dimensional pattern recognition needed for images.
Requirement Satisfaction: Cascading ensures low latency for clear "Safe" items, while the deep model ensures high accuracy for ambiguous cases.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Real-time listing events (JSON) and binary image data from the Marketplace API.
Data Ingestion: Kafka for high-throughput streaming. We use a "Lambda Architecture" where data is both archived in S3 for training and passed to the inference service.
Data Storage: S3 for the Data Lake (parquet format for text/metadata, sharded blobs for images).
Data Processing: Spark jobs handle the heavy lifting: resizing images, removing EXIF metadata (privacy), and joining text descriptions with historical labels.
Data Quality: De-duplication is critical. If a seller uploads the same photo 10 times, we should process it once.
Feature Pipeline
Text Features: Tokenization using WordPiece. We compute TF-IDF for the baseline and dense embeddings (768-dim) for the ML model.
Image Features: Resizing to 224x224. We use pHash (Perceptual Hashing) for the "Fast Filter" stage.
Online Pipeline: The Serving Service pulls user reputation scores from a low-latency Redis-based feature store.
Training/Serving Skew: We use a unified library for text normalization (lowercase, stripping accents) to ensure the model sees the same format during training and inference.
Model Architecture
Problem Formulation: Binary classification (Is_Firearm: 0 or 1).
Candidate Models:
Image: ResNet-50 (Fast but old), ViT (High accuracy, heavy).
Text: FastText (Cheap), BERT (Heavy).
MVP Architecture:
Vision Tower: Pre-trained CLIP (ViT-B/32) image encoder. CLIP is excellent because it’s already aligned with natural language.
Text Tower: DistilBERT for a good balance of semantic depth and latency.
Fusion: Concatenate the output vectors from both towers and pass through 2 Dense layers with Dropout.
Optimization: Quantization (INT8) for the DistilBERT model to fit within the 500ms latency budget on CPU if GPUs are saturated.
Training Pipeline
Dataset Construction: We face a 1:1000 class imbalance. We use Downsampling of the majority class (Safe listings) and Image Augmentation (rotation, brightness, blur) for the minority class (Firearms).
Data Splitting: Time-based split. We train on months 1-5 and test on month 6 to simulate "new" evasion tactics appearing.
Training Infrastructure: Distributed training using PyTorch DDP on a cluster of A100s.
Retraining Strategy: Weekly scheduled retraining, or an ad-hoc trigger if the Monitoring Pipeline detects a >10% drop in Recall (suggesting a new evasion campaign).
Serving Pipeline
Serving Pattern: Request-Response for blocking at upload.
Latency Optimization:
Stage 1 (Fast): Redis lookup for pHash and Regex. (10ms)
Stage 2 (ML): Triton Inference Server with dynamic batching. (200-300ms)
Reliability: If the ML service times out or fails, the system defaults to "Allow" but flags for high-priority asynchronous manual review (Fail-open for UX, but safe).
Evaluation Pipeline
Offline: We track PR-AUC. Since the cost of a False Positive is high (frustrated sellers), we select a threshold where Precision is at least 95%.
Online: A/B testing the new model vs. the baseline. We measure "Appeal Rate" (how many users complain their listing was wrongly blocked) as a proxy for False Positives.
Monitoring Pipeline
Data Monitoring: Check if the average length of descriptions or the distribution of image aspect ratios changes (could indicate a pipeline bug).
Model Monitoring: Prediction Drift. If the model suddenly starts flagging 50% of listings as firearms, we trigger a circuit breaker.
Feedback Loop: When a human moderator reverses a model's decision, that case is automatically added to the "Hard Negatives" dataset for the next training cycle.
Wrap Up
Final Evaluation
Observability: Real-time dashboards showing the "Firearm Detection Rate" by category and region.
Edge Cases:
Contextual Confusion: A person selling a "Water Gun" or a "LEGO Star Wars Blaster." We solve this by training specifically on "Toy" labels.
Adversarial Text: Sellers writing "G.U.N" or using images with text overlays.
Trade-offs:
Accuracy vs. Latency: We chose DistilBERT over BERT-Large to save 150ms of inference time.
Complexity vs. Maintainability: Using pre-trained CLIP embeddings reduces the need for custom vision architecture maintenance.