The Question
ML DesignScalable Similar Listings Recommendation System
Design a high-scale 'Similar Listings' recommendation engine for a global vacation rental platform with 10M+ properties. The system must surface relevant, available alternatives in real-time (<150ms P99) while a user browses a Listing Detail Page. Focus on the end-to-end ML lifecycle: from multi-modal feature engineering (text, images, metadata) and handling unique inventory availability constraints, to a two-stage retrieval and ranking architecture. Discuss strategies for addressing listing cold-start, training/serving data consistency, and how to optimize for long-term business metrics like booking conversion rate.
LightGBM
Item2Vec
HNSW
FAISS
Spark
Kafka
Redis
Word2Vec
Vision Transformer
Questions & Insights
Clarifying Questions
Clarifying Questions & Constraints:
Business Goal: Is the primary North Star metric Bookings (conversion) or Listing Detail Page (LDP) views (engagement)? Answer: Conversion (Bookings).
Constraints & Scale: What is the scale of listings and traffic? Answer: 10M active listings, 100M monthly active users (MAU), 5k QPS at peak.
Latency Budget: What is the P99 latency requirement? Answer: 150ms for the entire recommendation component.
Freshness: How quickly must new listings appear in "Similar Listings"? Answer: Within 1 hour (Cold start is critical).
Availability: Should we only show listings available for the user's specific dates? Answer: Yes, availability is a hard constraint.
Assumptions:
A corpus of 10M listings.
P99 latency requirement of 150ms.
We have access to rich metadata (amenities, price, location) and high-quality images.
Thinking Process
Identify the Bottleneck: Similarity in vacation rentals is multi-faceted. It's not just "looks like this house," but "serves the same travel intent" (price point, location proximity, and group size).
Retrieval vs. Ranking: With 10M items, a single-stage model is impossible. I need a Two-Stage approach: 1) Fast Retrieval (Approximate Nearest Neighbors) and 2) Precise Ranking (Pointwise/Pairwise re-ranking).
The "Availability" Problem: Unlike E-commerce, inventory is unique and time-bound. A "similar" house that is booked for the user's dates is a dead end. Filtering must happen either during retrieval or immediately after.
Scaling the Solution: Leverage listing embeddings (Item2Vec or Content-based) for retrieval to handle the scale and use a GBDT or lightweight MLP for ranking to meet the 150ms budget.
Elite Bonus Points
Multi-modal Embeddings: Using a late-fusion approach to combine image embeddings (from a pre-trained Vision Transformer) with text embeddings (listing descriptions) to capture "vibe" similarity.
Availability-Aware Retrieval: Discussing the trade-off between "Post-filtering" (retrieving 100, filtering down to 10) vs "In-index filtering" (using HNSW with metadata filters) to prevent empty result sets.
Exploration/Exploitation (E&E): Implementing a small epsilon-greedy shuffle to prevent "rich-get-richer" effects and collect data on new, high-potential listings.
Session-Based Personalization: Adjusting "similarity" based on the user's current session (e.g., if they just looked at 3 beach houses, prioritize coastal similarity over price similarity).
Design Breakdown
Requirements
Product Goal: Surface 6-12 "Similar Listings" on the Listing Detail Page (LDP) to help users find alternatives and increase booking conversion.
Success Metrics:
Online Metrics: Booking Conversion Rate (CVR), CTR on recommendations, Average Daily Rate (ADR).
Offline Metrics: Recall@K (for retrieval), NDCG, LogLoss/AUC (for ranking).
Guardrail Metrics: P99 Latency, Listing Diversity (to avoid showing 10 identical units in the same building).
System Constraints: 10M items, 5k QPS, <150ms latency.
Data Availability: Listing metadata (price, rooms, location), User clickstream, historical booking logs, listing images.
ML Problem Framing
ML Task Type: Two-stage recommendation (Retrieval + Ranking).
Prediction Target: P(\text{Book} | \text{User}, \text{Context}, \text{Candidate Item}).
Inputs:
User: (Optional for MVP) Historical preferences, search filters.
Item (Anchor): Current listing's price, location (GeoHash), amenities, category (e.g., "Tiny Home").
Candidate Items: Features of potential similar listings.
Outputs: A ranked list of Listing IDs.
ML Challenges: Cold start for new listings, extreme data sparsity (most users don't book often), and the "Availability" hard constraint.
Design Summary & MVP
Concise Summary: A two-stage system using Approximate Nearest Neighbors (ANN) on listing embeddings for retrieval, followed by a LightGBM ranker that incorporates real-time availability and price delta features.
Model Architecture & Selection:
Baseline Model: Heuristic-based: "Top 10 listings in the same city within +/- 20% price range."
Target Model: Retrieval: Two-Tower model or Item2Vec embeddings stored in a Vector DB (Milvus/Pinecone). Ranking: LightGBM (Gradient Boosted Decision Trees) for fast, interpretable, and high-performance ranking.
Choice Rationale: GBDTs handle tabular features (price, room count) better than deep learning for ranking at this scale, while ANN enables sub-linear search over 10M listings.
ML Life Cycle Summary: Raw logs are processed via Spark; embeddings are generated offline; ANN index is updated hourly; LightGBM ranks the top 100 candidates online.
Simplicity Audit: Avoids complex Graph Neural Networks (GNNs) or real-time Transformers for the MVP, focusing on robust embeddings and efficient GBDT ranking.
Architecture Decision Rationale: This architecture balances the need for semantic similarity (embeddings) with the need for hard-constraint logic (price/location) and low-latency serving.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Clickstream (listing views, "save to wishlist"), Transaction logs (bookings), Listing Metadata (updated via CDC from production DB).
Data Ingestion: Kafka for real-time events; Airflow for orchestrating batch ingestion from Listing DB.
Data Storage: S3 for the Data Lake (Parquet format for efficiency). Partitioning by
date and region.Data Processing: Spark for heavy-duty joins (User-Item interactions) and sessionization.
Data Quality: De-duplication of events, schema validation (Great Expectations), and checking for "orphan listings" (listings with no metadata).
Feature Pipeline
Feature Definition:
Item (Static): Price, No. of bedrooms, Latitude/Longitude, Amenities (WiFi, Pool, etc.).
Item (Dynamic): 7-day CTR, Booking rate, availability calendar.
Context: Date of stay, number of guests.
Feature Engineering:
Geohashing: Convert Lat/Long to Geohashes of varying precision for proximity matching.
Price Bucketing: Normalize price relative to the median of the city.
Online Feature Store: Redis-based (e.g., Tecton or Feast) to store real-time listing counters (e.g., "times viewed in last hour").
Training/Serving Skew: Use a single Feature Definition library for both Spark (offline) and the Online Service to ensure feature consistency.
Model Architecture
Problem Formulation: Pointwise ranking: Predict the probability of a booking for a candidate item given the anchor item.
Retrieval Architecture: Item2Vec. Treat a user's session of viewed listings as a "sentence" and listings as "words." Train Word2Vec to learn listing embeddings.
Why? Captures co-occurrence (people who look at X also look at Y).
Ranking Architecture: LightGBM.
Features: Embedding cosine similarity, Haversine distance between anchor and candidate, price difference, star rating difference.
Model Complexity: Item2Vec (128d embeddings) + LightGBM (500 trees). This is highly efficient for 150ms P99.
Training Pipeline
Dataset Construction:
Positive Labels: Bookings.
Negative Labels: Sampled from listings shown but not clicked, or random listings from the same city.
Data Splitting: Time-based split. Train on months 1-5, validate on month 6. Never use random split for time-series recommendation data.
Retraining Strategy: Daily batch retraining of the LightGBM model to capture latest trends. Hourly incremental updates to the Vector DB for new listings (using content-based embedding fallback for cold start).
Serving Pipeline
Serving Pattern:
Trigger: User visits Listing A.
Retrieval: Fetch 100 most similar listing IDs from Vector DB using Embedding(A).
Filter: Query Availability Service (Redis) to remove listings booked for the user's dates.
Rank: Batch predict P(\text{Book}) for the remaining ~50 listings using LightGBM.
Serve: Return top 10.
Latency Optimization: Use FAISS or HNSW for ANN. Multi-threaded ranking using OpenMP.
Evaluation Pipeline
Offline: Use historical sessions. Rank listings the user actually booked higher in the results. Metric: Recall@20 and NDCG@10.
Online: A/B Testing.
Control: Heuristic (Same city, same price).
Treatment: ML-based Retrieval + Ranking.
Metric: Booking Conversion Rate (Primary), Click-through Rate (Secondary).
Monitoring Pipeline
System Monitoring: Prometheus/Grafana for QPS, latency, and 5xx errors.
Model Monitoring: Track the distribution of the output scores (prediction drift). If the average P(\text{Book}) drops significantly, alert the team.
Feature Monitoring: Monitor for missing values in critical features like
price or location.Wrap Up
Final Evaluation
Observability: Use "Feature Importance" plots in LightGBM to ensure the model isn't over-relying on a single noisy feature.
Feedback Loop: Clicks/Bookings on the "Similar Listings" section are piped back into the training data daily.
Edge Cases:
New Listing: Use "Content-only" embeddings (average of image + text embeddings) until enough click data exists for Item2Vec.
Out-of-Stock: The Availability Filter is the most critical non-ML component.
Trade-offs: We trade off some accuracy (by not using a Deep Cross Network) for extreme low latency and maintainability (LightGBM).