The Question
ML DesignLarge-Scale Multi-Modal Image Search System
Design a high-scale image search system capable of handling 100 million images. The system must support both text-to-image and image-to-image queries with sub-200ms P99 latency. Your design should detail the multi-modal representation learning approach, the architecture for approximate nearest neighbor (ANN) search at scale, and how you ensure consistent embedding spaces between training and serving. Address data ingestion for billion-scale corpora and how you handle the reliability of the vector retrieval path.
CLIP
Transformers
ViT
HNSW
FAISS
Product Quantization
Spark
Kafka
Vector Database
InfoNCE Loss
Questions & Insights
Clarifying Questions
Clarifying Questions & Constraints:
Business Goal: Is the goal to find exact duplicates (e.g., copyright) or semantic matches (e.g., "sunset on a beach")? Assumption: Semantic search to drive user engagement.
Modality: Are we supporting Text-to-Image, Image-to-Image, or both? Assumption: Both.
Scale: What is the corpus size and QPS? Assumption: 100M images, 1,000 QPS.
Latency: What is the P99 latency budget? Assumption: 200ms for the entire retrieval path.
Freshness: How quickly must a newly uploaded image be searchable? Assumption: Near real-time (minutes).
Assumptions:
Corpus: 100M images.
Latency: < 200ms P99.
Strategy: Use a unified embedding space for multi-modal retrieval.
Thinking Process
Identify the Core Bottleneck: Search across 100M items cannot be done by brute-force comparison. We need an Approximate Nearest Neighbor (ANN) solution.
Representation Learning: How do we map text and images to the same space? A CLIP-style (Contrastive Language-Image Pre-training) dual-tower architecture is the standard for bridging modalities.
Two-Stage Architecture: Start with a high-recall retrieval (ANN) and determine if a second-stage re-ranker is necessary for the MVP. Per YAGNI, we will start with a single-stage high-quality retrieval.
Scale and Efficiency: 100M embeddings at 512 dimensions (float32) require ~200GB of RAM. We must use quantization (Product Quantization) to fit this in memory or use a managed Vector DB.
Elite Bonus Points
Cross-Modal Alignment Calibration: Using temperature scaling in contrastive loss to prevent the model from becoming overconfident and improving the alignment between text and image clusters.
Hierarchical Navigable Small World (HNSW) vs. IVF-PQ: Choosing HNSW for high-precision low-latency retrieval while acknowledging the memory trade-off compared to Inverted File Index with Product Quantization.
Query Expansion via LLMs: Using an LLM to expand a short user text query into a more descriptive one before embedding to improve semantic hit rates.
Negative Mining: Implementing "Hard Negative Mining" during training to ensure the model distinguishes between very similar but distinct images (e.g., different types of "golden retrievers").
Design Breakdown
Requirements
Product Goal: Enable users to find relevant images using either text descriptions or reference images.
Success Metrics:
Online Metrics: Click-Through Rate (CTR) on search results, Mean Reciprocal Rank (MRR).
Offline Metrics: Recall@K, Mean Average Precision (mAP), Normalized Discounted Cumulative Gain (NDCG).
Guardrail Metrics: P99 Latency, Vector DB CPU/Memory utilization.
System Constraints: 100M image corpus, 1k QPS, <200ms latency.
Data Availability: Image-caption pairs (e.g., COCO, LAION-lite), user click logs.
ML Problem Framing
ML Task Type: Representation Learning and Extreme Classification (via Retrieval).
Prediction Target: Maximize cosine similarity S between Query Embedding E_q and Document Embedding E_d.
Score = \cos(\theta) = \frac{E_q \cdot E_d}{\|E_q\| \|E_d\|}
Inputs:
Query: Raw text string or raw image bytes.
Item: Image pixels and metadata (tags, alt-text).
Outputs: Top-K ranked list of Image IDs.
ML Challenges: Multi-modal alignment, scale of the index, and handling "dead" embeddings that are never retrieved.
Design Summary & MVP
Concise Summary: A Two-Tower CLIP-style architecture where images and text are projected into a shared 512-dimensional embedding space, indexed in a Vector Database using HNSW for sub-200ms retrieval.
Model Architecture & Selection:
Baseline Model: Simple TF-IDF on image metadata/tags.
Target Model: Dual-tower transformer (Vision Transformer for images, BERT-style for text) trained with Contrastive Loss.
Choice Rationale: Deep semantic understanding outperforms keyword matching, especially for images without rich metadata.
ML Life Cycle Summary: Raw data (S3) is processed via Spark. Embeddings are generated and stored in a Vector DB. An Inference service encodes the query and performs ANN search.
Simplicity Audit: We skip the re-ranking stage and complex graph-based expansion for the MVP to minimize infrastructure complexity and latency.
Architecture Decision Rationale:
Dual-tower allows for pre-computation of image embeddings, making the online search phase extremely fast (only query encoding + ANN search).
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Application logs (clicks), S3 buckets for raw images, and relational DBs for metadata/tags.
Data Ingestion: Batch processing for the initial 100M images using Airflow/Spark. For new uploads, a Kafka stream triggers the embedding generation.
Data Storage: Raw images remain in S3. Embeddings are stored in a specialized Vector Database (e.g., Milvus or Pinecone).
Data Quality: Image hash checks to remove duplicates; validation of aspect ratios and minimum resolution to ensure quality.
Feature Pipeline
Feature Engineering:
Images: Resize to 224x224, normalize pixel values, and apply data augmentation (jitter, crop) only during training.
Text: Tokenization using Byte-Pair Encoding (BPE).
Offline Feature Pipeline: Spark-based inference to generate 100M embeddings.
Online Feature Pipeline: Query image/text is processed by the same encoder used in training to ensure embedding consistency (eliminating "Serving Skew").
Feature Store: The Vector DB acts as the online feature store for image embeddings.
Model Architecture
Problem Formulation: Supervised representation learning using a contrastive objective (InfoNCE loss).
Candidate Model Families:
ResNet50 (Baseline for images).
ViT (Vision Transformer) + Transformer (Text): Chosen for CLIP because of better global feature capture.
Architecture Design: Two distinct encoders f_{img} and f_{txt} projecting into a latent space of size d.
Model Complexity: ViT-B/32 has ~86M parameters. Inference is fast enough for real-time (approx 20-50ms on GPU).
Optimization: Quantization to INT8 for the inference model to reduce latency and cost.
Training Pipeline
Dataset Construction: Pairs of (image, text\_description).
Label Construction: In a batch of N pairs, there are N positive matches and N^2 - N negative matches.
Infrastructure: Distributed training using PyTorch DistributedDataParallel (DDP) on a GPU cluster.
Retraining Strategy: Retrain quarterly or when data drift (new visual trends) is detected.
Serving Pipeline
Serving Pattern: Request-response for the query encoder; ANN lookup in the Vector DB.
Latency Optimization:
Caching: Cache embeddings for popular text queries (e.g., "puppy").
Quantization: Use Product Quantization (PQ) in the Vector DB to reduce memory and speed up distance calculations.
Reliability: Use a "Fallback to Metadata Search" (ElasticSearch) if the Vector DB service is unreachable.
Evaluation Pipeline
Offline: Use the COCO dataset to calculate Recall@1, Recall@10.
Online: A/B test the new CLIP-based model against a legacy tag-based search. Measure Search-to-Click Conversion.
Monitoring Pipeline
Data Monitoring: Track the distribution of embedding norms. If the norm shifts significantly, the model may be degrading.
Performance Monitoring: Monitor the "Recall Drift" by periodically running a golden set of queries and checking if the expected images still appear in Top-K.
Wrap Up
Final Evaluation
Observability: Tracking P99 latency of the Query Encoder vs. Vector DB lookup.
Edge Cases:
Cold Start: New images are added to the index immediately via the Kafka stream.
Low-Quality Queries: If text query embedding has low similarity to all items, return "No high-confidence results found."
Trade-offs: Accuracy vs. Memory. HNSW provides higher accuracy but requires more RAM. For 100M images, we choose HNSW but with PQ-compressed vectors.