The Question

Large-Scale Visual Similarity & Search System

Design a high-scale visual similarity system for a global e-commerce platform with a 100M+ item catalog. The system must support 'search-by-image' and 'similar product' recommendations with a P99 latency under 200ms. Your design should cover end-to-end data ingestion, embedding generation using deep vision models, efficient approximate nearest neighbor (ANN) retrieval at scale, and strategies for handling high QPS. Emphasize how you handle data freshness, model evaluation (online vs. offline), and the infrastructure required to serve and monitor these models in a production environment with constant catalog updates.

Vision Transformers

Siamese Networks

HNSW

Triplet Loss

TensorRT

FAISS

PyTorch

Spark

Milvus

Contrastive Learning

Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:

Business Goal: Is the goal to power "Similar Product" recommendations on an e-commerce site or a "Search by Image" feature? (Assumption: E-commerce "Shop the Look" to increase Conversion Rate).

Constraints & Scale: What is the corpus size and throughput? (Assumption: 100M items, 10k QPS, <200ms P99 latency).

Data Freshness: How quickly must new items appear in search? (Assumption: Near real-time, within minutes of catalog update).

Edge Cases: How do we handle images with multiple objects (e.g., a person wearing a hat, shirt, and pants)? (Assumption: MVP focuses on the dominant object or uses a simple bounding-box detection).

Assumptions:

Corpus: 100M images stored in S3/Object Store.

Latency: 200ms end-to-end (100ms for embedding generation, 50ms for ANN search, 50ms for overhead).

Infrastructure: Cloud-native (AWS/GCP), using managed Vector Databases for retrieval.

Thinking Process

Identify the Core Pattern: This is a classic Embedding-based Retrieval (EBR) problem. We need to project images into a high-dimensional manifold where semantic similarity equals geometric distance (Cosine/Euclidean).

The Bottleneck: Inference latency for deep vision models and the curse of dimensionality in 100M-scale retrieval are the primary technical hurdles.

Two-Stage Approach:

Retrieval (Candidate Generation): Use an Approximate Nearest Neighbor (ANN) index for speed.

Ranking (Optional for MVP): A simple re-ranking based on metadata (availability, price) or a lightweight cross-encoder if the budget allows.

YAGNI Implementation: Start with a pre-trained backbone (ViT or ResNet) fine-tuned on triplet loss rather than building a multi-modal transformer from scratch.

Elite Bonus Points

Contrastive Learning (Triplet vs. InfoNCE): Using Triplet loss with "Hard Negative Mining" to ensure the model distinguishes between "similar" and "frustratingly similar but different" items (e.g., two different red dresses).

Product-Quantization (PQ) Tuning: For 100M items, memory is expensive. Discussing the trade-off between Recall and Memory using PQ or HNSW indexing is high-signal.

Offline/Online Consistency: Ensuring the image preprocessing (cropping, normalization) used during training exactly matches the online inference pipeline to prevent "feature drift."

Visual Explainability: Using Grad-CAM to visualize which parts of the image the model is using to determine similarity, which helps in debugging "false positives."

Design Breakdown

Requirements

Product Goal: Enable users to find products visually similar to a query image.

Success Metrics:

Online Metrics: Click-Through Rate (CTR) on similar items, Add-to-Cart (ATC) rate, Revenue per session.

Offline Metrics: Recall@K, Mean Average Precision (mAP), Normalized Discounted Cumulative Gain (NDCG).

Guardrail Metrics: Inference Latency (P99), Index Update Latency (Freshness).

System Constraints: 100M images, 10k QPS, 200ms Latency.

Data Availability: Product catalog images, user click logs (for triplet mining), category metadata.

ML Problem Framing

ML Task Type: Deep Metric Learning (Representation Learning).

Prediction Target: An embedding vector

v \in \mathbb{R}^d

such that

dist(v_{query}, v_{pos}) < dist(v_{query}, v_{neg})

Inputs:

User/Query: Raw image pixels (RGB).

Item Features: Pre-computed embeddings from the catalog.

Outputs: A ranked list of Item IDs.

ML Challenges: Large-scale indexing, handling "domain gap" (user-taken photos vs. professional catalog photos), and cold-start for new inventory.

Design Summary & MVP

Concise Summary: We will build a Siamese Network using a Vision Transformer (ViT) backbone to generate 512-d embeddings, indexed in a managed Vector DB (e.g., Pinecone or Milvus) using HNSW for sub-50ms retrieval.

Model Architecture & Selection:

Baseline Model: Simple Color Histograms + SIFT descriptors (Heuristic).

Target Model: Vision Transformer (ViT-B/16) fine-tuned with Triplet Loss.

Choice Rationale: ViT captures global dependencies better than CNNs, and Triplet Loss directly optimizes the similarity metric we care about.

ML Life Cycle Summary: Raw Images -> Preprocessing -> ViT Embedding -> HNSW Indexing -> ANN Retrieval -> UI.

Simplicity Audit: We avoid a multi-stage re-ranker for the MVP. A single-stage high-quality embedding retrieval is often sufficient for visual similarity.

Architecture Decision Rationale: HNSW offers the best trade-off between query speed and recall for a 100M scale. Using a pre-trained backbone reduces training time from weeks to days.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Product images from the catalog (S3), transactional DBs for metadata (categories), and user interaction logs for positive pairs (clicked similar items).

Data Ingestion:

Batch: Daily crawl of the full catalog via Spark.

Streaming: Kafka-triggered updates when a new product is added/modified to ensure the index is fresh.

Data Storage: Raw images in S3. Metadata in a distributed SQL DB (e.g., CockroachDB) for filtering (e.g., "only show items in stock").

Data Processing: Spark-based resizing and normalization. Images are converted to a standardized format (e.g., 224x224 RGB) to ensure consistency between training and serving.

Data Quality: Automated checks for corrupted images, grayscale-only images (if unexpected), and duplicate detection to avoid index bloat.

Feature Pipeline

Offline Feature Pipeline: Periodically (e.g., every 4 hours), a GPU-based Spark job runs the latest ViT model over all new images in S3. Embeddings are stored in a Feature Store and synced to the Vector Index.

Online Feature Pipeline: When a user uploads a query image, the Inference Service applies the exact same transformations (resize, normalize, mean-subtraction) before passing it to the model.

Feature Store: Acts as the source of truth for embeddings, allowing for point-in-time joins if we later decide to use these embeddings as features for a CTR ranking model.

Training/Serving Skew: Mitigation via a shared library for image preprocessing used by both the training script and the serving container.

Model Architecture

Problem Formulation: Deep Metric Learning using a Siamese network. We want to learn a function

f(x)

that maps an image

x

to a vector space.

Architecture Design:

Backbone: ViT-B/16 (Vision Transformer).

Projection Head: A small MLP (Dense -> ReLU -> Dense) that maps the transformer's [CLS] token output to a 512-dimensional L2-normalized embedding.

Model Complexity: ViT-B/16 has ~86M parameters. At 10k QPS, we need a robust GPU inference cluster (e.g., NVIDIA T4 or A10g) with TensorRT optimization.

Model Selection Strategy: ViT is chosen over ResNet because transformers are more robust to variations in object scale and occlusions.

Optimization: Use FP16 quantization for inference to reduce latency by 2x and memory footprint without significant loss in recall.

Training Pipeline

Dataset Construction:

Positive Pairs: Product images from the same "parent" SKU or items frequently co-clicked in search.

Negative Pairs: Hard negatives (items in the same category but different SKU) and Easy negatives (random items).

Data Splitting: Time-based split. Train on catalog from Jan-Oct, validate on Nov-Dec to simulate new fashion trends.

Training Infrastructure: Distributed training using PyTorch DistributedDataParallel (DDP) on 8x A100 GPUs.

Retraining Strategy: Retrain monthly or when a significant drop in Recall@K is detected via the Monitoring Pipeline.

Serving Pipeline

Serving Pattern: Online Inference for the query image + Vector Search for retrieval.

Serving Architecture:

Embedding Service: Python/FastAPI or Go service wrapping the PyTorch/TensorRT model.

Vector DB: HNSW index for low-latency retrieval.

Latency Optimization:

Use Request Batching (grouping multiple user requests into one GPU call).

Cache embeddings for popular query images in Redis to bypass GPU inference.

Reliability: If the Embedding Service fails, fallback to a metadata-only search (text-based) or a static "trending products" list.

Evaluation Pipeline

Offline Evaluation:

Recall@10: Does the actual product SKU appear in the top 10 results when its image is used as a query?

mAP (Mean Average Precision): Measures the quality of the entire ranked list.

Online Evaluation:

A/B Testing: Control (Random/Text-only) vs. Variant (Visual Similarity).

Primary Metric: Conversion Rate (CVR).

Secondary Metric: Average Session Depth (users clicking more items).

Monitoring Pipeline

System Monitoring: Prometheus/Grafana for GPU utilization and P99 latency.

Data Monitoring: Track the distribution of embedding norms. If the average norm shifts, it indicates a distribution shift in the input images.

Performance Monitoring: Use "Click-through on Rank 1" as a proxy for model health in production. If the CTR on top results drops, trigger an alert for model decay.

Wrap Up

Final Evaluation

Observability: We use Population Stability Index (PSI) on the embedding dimensions to detect if the model's output space is collapsing or drifting.

Feedback Loop: User clicks on similar items are fed back into the Triplet Miner for the next training cycle (Active Learning).

Edge Cases:

Cold Start: For new items, we use the embedding from the pre-trained ViT immediately, even before fine-tuning on interaction data.

Bias: Ensure the model doesn't over-index on specific colors or brands unless relevant.

Trade-offs:

Accuracy vs. Latency: Using a larger ViT-L model would increase recall by 2% but double the latency. For MVP, ViT-B is the sweet spot.

HNSW vs. IVF: HNSW provides higher recall but uses more RAM. Since our corpus is 100M (manageable), HNSW is preferred.