The Question

Scalable Similar Listings Recommendation System

Design a high-scale 'Similar Listings' recommendation engine for a global vacation rental platform with 10M+ properties. The system must surface relevant, available alternatives in real-time (<150ms P99) while a user browses a Listing Detail Page. Focus on the end-to-end ML lifecycle: from multi-modal feature engineering (text, images, metadata) and handling unique inventory availability constraints, to a two-stage retrieval and ranking architecture. Discuss strategies for addressing listing cold-start, training/serving data consistency, and how to optimize for long-term business metrics like booking conversion rate.

LightGBM

Item2Vec

HNSW

FAISS

Spark

Kafka

Redis

Word2Vec

Vision Transformer

Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:

Business Goal: Is the primary North Star metric Bookings (conversion) or Listing Detail Page (LDP) views (engagement)? Answer: Conversion (Bookings).

Constraints & Scale: What is the scale of listings and traffic? Answer: 10M active listings, 100M monthly active users (MAU), 5k QPS at peak.

Latency Budget: What is the P99 latency requirement? Answer: 150ms for the entire recommendation component.

Freshness: How quickly must new listings appear in "Similar Listings"? Answer: Within 1 hour (Cold start is critical).

Availability: Should we only show listings available for the user's specific dates? Answer: Yes, availability is a hard constraint.

Assumptions:

A corpus of 10M listings.

P99 latency requirement of 150ms.

We have access to rich metadata (amenities, price, location) and high-quality images.

Thinking Process

Identify the Bottleneck: Similarity in vacation rentals is multi-faceted. It's not just "looks like this house," but "serves the same travel intent" (price point, location proximity, and group size).

Retrieval vs. Ranking: With 10M items, a single-stage model is impossible. I need a Two-Stage approach: 1) Fast Retrieval (Approximate Nearest Neighbors) and 2) Precise Ranking (Pointwise/Pairwise re-ranking).

The "Availability" Problem: Unlike E-commerce, inventory is unique and time-bound. A "similar" house that is booked for the user's dates is a dead end. Filtering must happen either during retrieval or immediately after.

Scaling the Solution: Leverage listing embeddings (Item2Vec or Content-based) for retrieval to handle the scale and use a GBDT or lightweight MLP for ranking to meet the 150ms budget.

Elite Bonus Points

Multi-modal Embeddings: Using a late-fusion approach to combine image embeddings (from a pre-trained Vision Transformer) with text embeddings (listing descriptions) to capture "vibe" similarity.

Availability-Aware Retrieval: Discussing the trade-off between "Post-filtering" (retrieving 100, filtering down to 10) vs "In-index filtering" (using HNSW with metadata filters) to prevent empty result sets.

Exploration/Exploitation (E&E): Implementing a small epsilon-greedy shuffle to prevent "rich-get-richer" effects and collect data on new, high-potential listings.

Session-Based Personalization: Adjusting "similarity" based on the user's current session (e.g., if they just looked at 3 beach houses, prioritize coastal similarity over price similarity).

Design Breakdown

Requirements

Product Goal: Surface 6-12 "Similar Listings" on the Listing Detail Page (LDP) to help users find alternatives and increase booking conversion.

Success Metrics:

Online Metrics: Booking Conversion Rate (CVR), CTR on recommendations, Average Daily Rate (ADR).

Offline Metrics: Recall@K (for retrieval), NDCG, LogLoss/AUC (for ranking).

Guardrail Metrics: P99 Latency, Listing Diversity (to avoid showing 10 identical units in the same building).

System Constraints: 10M items, 5k QPS, <150ms latency.

Data Availability: Listing metadata (price, rooms, location), User clickstream, historical booking logs, listing images.

ML Problem Framing

ML Task Type: Two-stage recommendation (Retrieval + Ranking).

Prediction Target:

P(\text{Book} | \text{User}, \text{Context}, \text{Candidate Item})

Inputs:

User: (Optional for MVP) Historical preferences, search filters.

Item (Anchor): Current listing's price, location (GeoHash), amenities, category (e.g., "Tiny Home").

Candidate Items: Features of potential similar listings.

Outputs: A ranked list of Listing IDs.

ML Challenges: Cold start for new listings, extreme data sparsity (most users don't book often), and the "Availability" hard constraint.

Design Summary & MVP

Concise Summary: A two-stage system using Approximate Nearest Neighbors (ANN) on listing embeddings for retrieval, followed by a LightGBM ranker that incorporates real-time availability and price delta features.

Model Architecture & Selection:

Baseline Model: Heuristic-based: "Top 10 listings in the same city within +/- 20% price range."

Target Model: Retrieval: Two-Tower model or Item2Vec embeddings stored in a Vector DB (Milvus/Pinecone). Ranking: LightGBM (Gradient Boosted Decision Trees) for fast, interpretable, and high-performance ranking.

Choice Rationale: GBDTs handle tabular features (price, room count) better than deep learning for ranking at this scale, while ANN enables sub-linear search over 10M listings.

ML Life Cycle Summary: Raw logs are processed via Spark; embeddings are generated offline; ANN index is updated hourly; LightGBM ranks the top 100 candidates online.

Simplicity Audit: Avoids complex Graph Neural Networks (GNNs) or real-time Transformers for the MVP, focusing on robust embeddings and efficient GBDT ranking.

Architecture Decision Rationale: This architecture balances the need for semantic similarity (embeddings) with the need for hard-constraint logic (price/location) and low-latency serving.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Clickstream (listing views, "save to wishlist"), Transaction logs (bookings), Listing Metadata (updated via CDC from production DB).

Data Ingestion: Kafka for real-time events; Airflow for orchestrating batch ingestion from Listing DB.

Data Storage: S3 for the Data Lake (Parquet format for efficiency). Partitioning by date and region.

Data Processing: Spark for heavy-duty joins (User-Item interactions) and sessionization.

Data Quality: De-duplication of events, schema validation (Great Expectations), and checking for "orphan listings" (listings with no metadata).

Feature Pipeline

Feature Definition:

Item (Static): Price, No. of bedrooms, Latitude/Longitude, Amenities (WiFi, Pool, etc.).

Item (Dynamic): 7-day CTR, Booking rate, availability calendar.

Context: Date of stay, number of guests.

Feature Engineering:

Geohashing: Convert Lat/Long to Geohashes of varying precision for proximity matching.

Price Bucketing: Normalize price relative to the median of the city.

Online Feature Store: Redis-based (e.g., Tecton or Feast) to store real-time listing counters (e.g., "times viewed in last hour").

Training/Serving Skew: Use a single Feature Definition library for both Spark (offline) and the Online Service to ensure feature consistency.

Model Architecture

Problem Formulation: Pointwise ranking: Predict the probability of a booking for a candidate item given the anchor item.

Retrieval Architecture: Item2Vec. Treat a user's session of viewed listings as a "sentence" and listings as "words." Train Word2Vec to learn listing embeddings.

Why? Captures co-occurrence (people who look at X also look at Y).

Ranking Architecture: LightGBM.

Features: Embedding cosine similarity, Haversine distance between anchor and candidate, price difference, star rating difference.

Model Complexity: Item2Vec (128d embeddings) + LightGBM (500 trees). This is highly efficient for 150ms P99.

Training Pipeline

Dataset Construction:

Positive Labels: Bookings.

Negative Labels: Sampled from listings shown but not clicked, or random listings from the same city.

Data Splitting: Time-based split. Train on months 1-5, validate on month 6. Never use random split for time-series recommendation data.

Retraining Strategy: Daily batch retraining of the LightGBM model to capture latest trends. Hourly incremental updates to the Vector DB for new listings (using content-based embedding fallback for cold start).

Serving Pipeline

Serving Pattern:

Trigger: User visits Listing A.

Retrieval: Fetch 100 most similar listing IDs from Vector DB using Embedding(A).

Filter: Query Availability Service (Redis) to remove listings booked for the user's dates.

Rank: Batch predict

P(\text{Book})

for the remaining ~50 listings using LightGBM.

Serve: Return top 10.

Latency Optimization: Use FAISS or HNSW for ANN. Multi-threaded ranking using OpenMP.

Evaluation Pipeline

Offline: Use historical sessions. Rank listings the user actually booked higher in the results. Metric: Recall@20 and NDCG@10.

Online: A/B Testing.

Control: Heuristic (Same city, same price).

Treatment: ML-based Retrieval + Ranking.

Metric: Booking Conversion Rate (Primary), Click-through Rate (Secondary).

Monitoring Pipeline

System Monitoring: Prometheus/Grafana for QPS, latency, and 5xx errors.

Model Monitoring: Track the distribution of the output scores (prediction drift). If the average

P(\text{Book})

drops significantly, alert the team.

Feature Monitoring: Monitor for missing values in critical features like price or location.

Wrap Up

Final Evaluation

Observability: Use "Feature Importance" plots in LightGBM to ensure the model isn't over-relying on a single noisy feature.

Feedback Loop: Clicks/Bookings on the "Similar Listings" section are piped back into the training data daily.

Edge Cases:

New Listing: Use "Content-only" embeddings (average of image + text embeddings) until enough click data exists for Item2Vec.

Out-of-Stock: The Availability Filter is the most critical non-ML component.

Trade-offs: We trade off some accuracy (by not using a Deep Cross Network) for extreme low latency and maintainability (LightGBM).