The Question

Multimodal Firearm Detection System

Design a high-scale trust and safety system to detect firearm listings on a global marketplace. The system must process multimodal data (text and images) at the point of upload. Your design should address high-throughput data ingestion, low-latency multimodal inference (P99 < 500ms), strategies for handling extreme class imbalance and adversarial evasion, and a robust feedback loop for human-in-the-loop moderation and model retraining.

CLIP

DistilBERT

PyTorch

Kafka

Spark

Triton Inference Server

Redis

Perceptual Hashing

Active Learning

Questions & Insights

Clarifying Questions

Business Goal: Is the primary goal to block listings at upload time (preventive) or to flag them for human review (reactive)?

Assumption: Preventive blocking at upload time to ensure platform safety.

Constraints & Scale: What is the scale of daily uploads and the latency budget?

Assumption: 10M daily uploads, 100M total listings, with a P99 latency budget of < 500ms for the entire moderation check.

Scope of "Firearm": Does this include parts, ammunition, or toy/replica guns?

Assumption: Policy covers functional firearms, regulated parts, and realistic replicas. Airsoft/Paintball are allowed but must be tagged.

Data Freshness: How quickly must we adapt to new evasion tactics (e.g., "g*un" or obscured images)?

Assumption: Daily model retraining or active learning loops to catch emerging evasion patterns.

Thinking Process

Identify the Bottleneck: Processing high-resolution images and long descriptions for 10M uploads/day is computationally expensive. I need a cascaded approach: cheap heuristics/fingerprinting first, then heavy ML.

Multimodal Fusion: Firearms detection is inherently multimodal. A listing might have a generic title ("Sporting equipment") but a clear image of a rifle. I need to decide between early fusion (complex, high accuracy) vs. late fusion (simpler, faster for MVP).

Adversarial Nature: This is a cat-and-mouse game. Sellers use "leetspeak," obfuscated images, or background clutter. The system must be robust to noise.

Reliability vs. Precision: False positives (blocking a legitimate toy) hurt the business; false negatives (allowing a real gun) create legal/safety risks. I will prioritize high recall for the model but use human-in-the-loop for borderline cases.

Elite Bonus Points

Perceptual Hashing (pHash): Implementing a "Known Bad" image database to instantly block re-uploads of previously banned firearm photos without triggering heavy GPU inference.

Active Learning with Hard Negative Mining: Specifically sampling "near-misses" (e.g., power tools, airsoft, holsters) to retrain the model on the most difficult decision boundaries.

Multilingual LLM Guardrails: Using a multilingual encoder (like mBERT or XLM-R) to catch firearm listings in different languages or dialects without building separate models.

Explainability (Grad-CAM): Providing internal moderators with heatmaps showing which part of the image triggered the firearm detection to speed up the manual review process.

Design Breakdown

Requirements

Product Goal: Automatically detect and prevent the listing of firearms and regulated weapons.

Success Metrics:

Online Metrics: Precision/Recall, False Positive Rate (FPR), Block Rate.

Offline Metrics: PR-AUC, F1-score on a curated "golden set" of policy-violating items.

Guardrail Metrics: P99 Latency < 500ms, GPU utilization efficiency.

System Constraints: 10M uploads/day (~115 QPS average, 500+ QPS peak), high availability (99.9%).

Data Availability: Historical moderated listings (labeled "Safe" vs. "Banned"), image catalogs, and text descriptions.

ML Problem Framing

ML Task Type: Multimodal Binary Classification.

Prediction Target:

P(\text{ViolatesPolicy} | \text{Image, Text, Context})

Inputs:

User Features: Seller history, account age, previous violations.

Item Features: Title, description text, image pixels, price (outliers often indicate illicit sales).

Context Features: Category (e.g., "Home & Garden" vs. "Sporting Goods"), location.

Outputs: Probability score [0, 1] and a set of "reason codes."

ML Challenges: Extreme class imbalance (most listings are safe), adversarial evasion, and high cost of false positives.

Design Summary & MVP

Concise Summary: A two-stage cascaded system. Stage 1 uses fast text-matching and image hashing. Stage 2 uses a late-fusion multimodal model combining a CLIP-based image encoder and a BERT-based text encoder.

Model Architecture & Selection:

Baseline Model: Keyword regex + Image hashing.

Target Model: Two-Tower architecture (Visual Transformer for images, DistilBERT for text) with a MLP (Multi-Layer Perceptron) head for final classification.

Choice Rationale: Late fusion allows for parallel processing of text and image and is easier to debug and scale than early-fusion transformers.

ML Life Cycle Summary: Logs flow to S3 -> Spark for cleaning -> Multimodal training (PyTorch) -> Model Registry -> Triton Inference Server -> Monitoring.

Simplicity Audit: Avoids complex Graph Neural Networks or 3D vision. Uses pre-trained backbones (Transfer Learning) to minimize the need for massive labeled datasets.

Architecture Decision Rationale:

Why this architecture?: It balances the need for semantic understanding of text with the high-dimensional pattern recognition needed for images.

Requirement Satisfaction: Cascading ensures low latency for clear "Safe" items, while the deep model ensures high accuracy for ambiguous cases.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Real-time listing events (JSON) and binary image data from the Marketplace API.

Data Ingestion: Kafka for high-throughput streaming. We use a "Lambda Architecture" where data is both archived in S3 for training and passed to the inference service.

Data Storage: S3 for the Data Lake (parquet format for text/metadata, sharded blobs for images).

Data Processing: Spark jobs handle the heavy lifting: resizing images, removing EXIF metadata (privacy), and joining text descriptions with historical labels.

Data Quality: De-duplication is critical. If a seller uploads the same photo 10 times, we should process it once.

Feature Pipeline

Text Features: Tokenization using WordPiece. We compute TF-IDF for the baseline and dense embeddings (768-dim) for the ML model.

Image Features: Resizing to 224x224. We use pHash (Perceptual Hashing) for the "Fast Filter" stage.

Online Pipeline: The Serving Service pulls user reputation scores from a low-latency Redis-based feature store.

Training/Serving Skew: We use a unified library for text normalization (lowercase, stripping accents) to ensure the model sees the same format during training and inference.

Model Architecture

Problem Formulation: Binary classification (Is_Firearm: 0 or 1).

Candidate Models:

Image: ResNet-50 (Fast but old), ViT (High accuracy, heavy).

Text: FastText (Cheap), BERT (Heavy).

MVP Architecture:

Vision Tower: Pre-trained CLIP (ViT-B/32) image encoder. CLIP is excellent because it’s already aligned with natural language.

Text Tower: DistilBERT for a good balance of semantic depth and latency.

Fusion: Concatenate the output vectors from both towers and pass through 2 Dense layers with Dropout.

Optimization: Quantization (INT8) for the DistilBERT model to fit within the 500ms latency budget on CPU if GPUs are saturated.

Training Pipeline

Dataset Construction: We face a 1:1000 class imbalance. We use Downsampling of the majority class (Safe listings) and Image Augmentation (rotation, brightness, blur) for the minority class (Firearms).

Data Splitting: Time-based split. We train on months 1-5 and test on month 6 to simulate "new" evasion tactics appearing.

Training Infrastructure: Distributed training using PyTorch DDP on a cluster of A100s.

Retraining Strategy: Weekly scheduled retraining, or an ad-hoc trigger if the Monitoring Pipeline detects a >10% drop in Recall (suggesting a new evasion campaign).

Serving Pipeline

Serving Pattern: Request-Response for blocking at upload.

Latency Optimization:

Stage 1 (Fast): Redis lookup for pHash and Regex. (10ms)

Stage 2 (ML): Triton Inference Server with dynamic batching. (200-300ms)

Reliability: If the ML service times out or fails, the system defaults to "Allow" but flags for high-priority asynchronous manual review (Fail-open for UX, but safe).

Evaluation Pipeline

Offline: We track PR-AUC. Since the cost of a False Positive is high (frustrated sellers), we select a threshold where Precision is at least 95%.

Online: A/B testing the new model vs. the baseline. We measure "Appeal Rate" (how many users complain their listing was wrongly blocked) as a proxy for False Positives.

Monitoring Pipeline

Data Monitoring: Check if the average length of descriptions or the distribution of image aspect ratios changes (could indicate a pipeline bug).

Model Monitoring: Prediction Drift. If the model suddenly starts flagging 50% of listings as firearms, we trigger a circuit breaker.

Feedback Loop: When a human moderator reverses a model's decision, that case is automatically added to the "Hard Negatives" dataset for the next training cycle.

Wrap Up

Final Evaluation

Observability: Real-time dashboards showing the "Firearm Detection Rate" by category and region.

Edge Cases:

Contextual Confusion: A person selling a "Water Gun" or a "LEGO Star Wars Blaster." We solve this by training specifically on "Toy" labels.

Adversarial Text: Sellers writing "G.U.N" or using images with text overlays.

Trade-offs:

Accuracy vs. Latency: We chose DistilBERT over BERT-Large to save 150ms of inference time.

Complexity vs. Maintainability: Using pre-trained CLIP embeddings reduces the need for custom vision architecture maintenance.