The Question

Large-Scale Multi-Modal Image Search System

Design a high-scale image search system capable of handling 100 million images. The system must support both text-to-image and image-to-image queries with sub-200ms P99 latency. Your design should detail the multi-modal representation learning approach, the architecture for approximate nearest neighbor (ANN) search at scale, and how you ensure consistent embedding spaces between training and serving. Address data ingestion for billion-scale corpora and how you handle the reliability of the vector retrieval path.

CLIP

Transformers

ViT

HNSW

FAISS

Product Quantization

Spark

Kafka

Vector Database

InfoNCE Loss

Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:

Business Goal: Is the goal to find exact duplicates (e.g., copyright) or semantic matches (e.g., "sunset on a beach")? Assumption: Semantic search to drive user engagement.

Modality: Are we supporting Text-to-Image, Image-to-Image, or both? Assumption: Both.

Scale: What is the corpus size and QPS? Assumption: 100M images, 1,000 QPS.

Latency: What is the P99 latency budget? Assumption: 200ms for the entire retrieval path.

Freshness: How quickly must a newly uploaded image be searchable? Assumption: Near real-time (minutes).

Assumptions:

Corpus: 100M images.

Latency: < 200ms P99.

Strategy: Use a unified embedding space for multi-modal retrieval.

Thinking Process

Identify the Core Bottleneck: Search across 100M items cannot be done by brute-force comparison. We need an Approximate Nearest Neighbor (ANN) solution.

Representation Learning: How do we map text and images to the same space? A CLIP-style (Contrastive Language-Image Pre-training) dual-tower architecture is the standard for bridging modalities.

Two-Stage Architecture: Start with a high-recall retrieval (ANN) and determine if a second-stage re-ranker is necessary for the MVP. Per YAGNI, we will start with a single-stage high-quality retrieval.

Scale and Efficiency: 100M embeddings at 512 dimensions (float32) require ~200GB of RAM. We must use quantization (Product Quantization) to fit this in memory or use a managed Vector DB.

Elite Bonus Points

Cross-Modal Alignment Calibration: Using temperature scaling in contrastive loss to prevent the model from becoming overconfident and improving the alignment between text and image clusters.

Hierarchical Navigable Small World (HNSW) vs. IVF-PQ: Choosing HNSW for high-precision low-latency retrieval while acknowledging the memory trade-off compared to Inverted File Index with Product Quantization.

Query Expansion via LLMs: Using an LLM to expand a short user text query into a more descriptive one before embedding to improve semantic hit rates.

Negative Mining: Implementing "Hard Negative Mining" during training to ensure the model distinguishes between very similar but distinct images (e.g., different types of "golden retrievers").

Design Breakdown

Requirements

Product Goal: Enable users to find relevant images using either text descriptions or reference images.

Success Metrics:

Online Metrics: Click-Through Rate (CTR) on search results, Mean Reciprocal Rank (MRR).

Offline Metrics: Recall@K, Mean Average Precision (mAP), Normalized Discounted Cumulative Gain (NDCG).

Guardrail Metrics: P99 Latency, Vector DB CPU/Memory utilization.

System Constraints: 100M image corpus, 1k QPS, <200ms latency.

Data Availability: Image-caption pairs (e.g., COCO, LAION-lite), user click logs.

ML Problem Framing

ML Task Type: Representation Learning and Extreme Classification (via Retrieval).

Prediction Target: Maximize cosine similarity

S

between Query Embedding

E_q

and Document Embedding

E_d

Score = \cos(\theta) = \frac{E_q \cdot E_d}{\|E_q\| \|E_d\|}

Inputs:

Query: Raw text string or raw image bytes.

Item: Image pixels and metadata (tags, alt-text).

Outputs: Top-K ranked list of Image IDs.

ML Challenges: Multi-modal alignment, scale of the index, and handling "dead" embeddings that are never retrieved.

Design Summary & MVP

Concise Summary: A Two-Tower CLIP-style architecture where images and text are projected into a shared 512-dimensional embedding space, indexed in a Vector Database using HNSW for sub-200ms retrieval.

Model Architecture & Selection:

Baseline Model: Simple TF-IDF on image metadata/tags.

Target Model: Dual-tower transformer (Vision Transformer for images, BERT-style for text) trained with Contrastive Loss.

Choice Rationale: Deep semantic understanding outperforms keyword matching, especially for images without rich metadata.

ML Life Cycle Summary: Raw data (S3) is processed via Spark. Embeddings are generated and stored in a Vector DB. An Inference service encodes the query and performs ANN search.

Simplicity Audit: We skip the re-ranking stage and complex graph-based expansion for the MVP to minimize infrastructure complexity and latency.

Architecture Decision Rationale:

Dual-tower allows for pre-computation of image embeddings, making the online search phase extremely fast (only query encoding + ANN search).

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Application logs (clicks), S3 buckets for raw images, and relational DBs for metadata/tags.

Data Ingestion: Batch processing for the initial 100M images using Airflow/Spark. For new uploads, a Kafka stream triggers the embedding generation.

Data Storage: Raw images remain in S3. Embeddings are stored in a specialized Vector Database (e.g., Milvus or Pinecone).

Data Quality: Image hash checks to remove duplicates; validation of aspect ratios and minimum resolution to ensure quality.

Feature Pipeline

Feature Engineering:

Images: Resize to 224x224, normalize pixel values, and apply data augmentation (jitter, crop) only during training.

Text: Tokenization using Byte-Pair Encoding (BPE).

Offline Feature Pipeline: Spark-based inference to generate 100M embeddings.

Online Feature Pipeline: Query image/text is processed by the same encoder used in training to ensure embedding consistency (eliminating "Serving Skew").

Feature Store: The Vector DB acts as the online feature store for image embeddings.

Model Architecture

Problem Formulation: Supervised representation learning using a contrastive objective (InfoNCE loss).

Candidate Model Families:

ResNet50 (Baseline for images).

ViT (Vision Transformer) + Transformer (Text): Chosen for CLIP because of better global feature capture.

Architecture Design: Two distinct encoders

f_{img}

and

f_{txt}

projecting into a latent space of size

d

Model Complexity: ViT-B/32 has ~86M parameters. Inference is fast enough for real-time (approx 20-50ms on GPU).

Optimization: Quantization to INT8 for the inference model to reduce latency and cost.

Training Pipeline

Dataset Construction: Pairs of

(image, text\_description)

Label Construction: In a batch of

N

pairs, there are

N

positive matches and

N^2 - N

negative matches.

Infrastructure: Distributed training using PyTorch DistributedDataParallel (DDP) on a GPU cluster.

Retraining Strategy: Retrain quarterly or when data drift (new visual trends) is detected.

Serving Pipeline

Serving Pattern: Request-response for the query encoder; ANN lookup in the Vector DB.

Latency Optimization:

Caching: Cache embeddings for popular text queries (e.g., "puppy").

Quantization: Use Product Quantization (PQ) in the Vector DB to reduce memory and speed up distance calculations.

Reliability: Use a "Fallback to Metadata Search" (ElasticSearch) if the Vector DB service is unreachable.

Evaluation Pipeline

Offline: Use the COCO dataset to calculate Recall@1, Recall@10.

Online: A/B test the new CLIP-based model against a legacy tag-based search. Measure Search-to-Click Conversion.

Monitoring Pipeline

Data Monitoring: Track the distribution of embedding norms. If the norm shifts significantly, the model may be degrading.

Performance Monitoring: Monitor the "Recall Drift" by periodically running a golden set of queries and checking if the expected images still appear in Top-K.

Wrap Up

Final Evaluation

Observability: Tracking P99 latency of the Query Encoder vs. Vector DB lookup.

Edge Cases:

Cold Start: New images are added to the index immediately via the Kafka stream.

Low-Quality Queries: If text query embedding has low similarity to all items, return "No high-confidence results found."

Trade-offs: Accuracy vs. Memory. HNSW provides higher accuracy but requires more RAM. For 100M images, we choose HNSW but with PQ-compressed vectors.