The Question
ML Design

Similar Listings Recommendation System

Imagine you are building a feature for a major vacation rental platform that suggests 'Similar Listings' to users based on the property they are currently viewing. Design an end-to-end machine learning system that identifies and ranks these recommendations to maximize booking conversions, while accounting for high-dimensional metadata, physical location constraints, and real-time availability.
Two-Tower Model
HNSW
LightGBM
ViT/BERT
Contrastive Learning
Position Bias Correction
Questions & Insights

Clarifying Questions

Business Goal: Is the North Star metric "Booking Conversion Rate" (CVR) or "Click-Through Rate" (CTR) on the recommendations?
Assumption: The primary goal is Booking CVR, with CTR as a proxy metric.
Constraints & Scale: What is the listing corpus size and traffic volume?
Assumption: 10M active listings, 100M Monthly Active Users (MAU), and a P99 latency budget of <100ms for the recommendation component.
Edge Cases: How do we handle new listings (cold start) or highly seasonal availability?
Assumption: New listings will rely on content-based features (images/text) until interaction data is available. Availability will be a hard filter in the serving layer.
Assumptions: I will design for a system that serves a "Similar Listings" carousel on the Property Detail Page (PDP).

Thinking Process

Identify the Core Bottleneck: With 10M listings, we cannot score all items in real-time. The solution must use a multi-stage approach: Candidate Generation (Retrieval) followed by a Precision Ranker.
MVP Strategy (YAGNI): Avoid complex Graph Neural Networks (GNNs) initially. Start with a Two-Tower architecture to learn embeddings that represent "similarity" based on co-occurrence and shared attributes.
Geography as a Hard Constraint: "Similarity" in vacation rentals is heavily location-bound. A listing in Paris is rarely "similar" to one in Tokyo for a user looking at a specific weekend. I'll use Geo-hashing or Radius-filtering to prune the search space.
Freshness: Price and availability change daily. The ranking model must ingest these dynamic features without requiring a full model retrain.

Elite Bonus Points

Calibration for Booking Likelihood: Since bookings are rare events compared to clicks, I would implement Platt Scaling or Isotonic Regression to ensure the predicted probability aligns with actual booking rates, crucial if these scores are used for downstream revenue optimization.
Handling Delayed Feedback: Bookings have a long conversion window (users browse today, book in 3 days). I would use a Negative Sampling strategy that weights recent non-clicks differently than confirmed non-bookings to mitigate "false negative" noise.
Multi-Modal Embeddings: Use a pre-trained Vision Transformer (ViT) for listing photos and a Sentence-BERT for descriptions to ensure that "Similar" truly looks and feels similar to the user.
Embeddings Versioning: Implement a "Warm-start" mechanism where new models are trained to project into the same manifold as the old model to prevent a "recommender shock" during deployment.
Design Breakdown

Functional Reqs

Users viewing a Listing (Anchor) see a list of "Similar Listings."
Results must be geographically relevant and currently available for booking.
Results should reflect similarity in price point, style (e.g., "Cabin" vs "Apartment"), and amenities.

Non-Functional Reqs

Scalability: Handle 10,000+ Queries Per Second (QPS).
Latency: Sub-100ms end-to-end (Retrieval + Ranking).
Availability: 99.99% uptime; the PDP must load even if the recommendation service fails (graceful degradation).
Freshness: New listings should appear in recommendations within 1 hour of being published.

ML Problem Framing

ML Objective: Predict the probability P(\text{Booking} | \text{Anchor Listing}, \text{Candidate Listing}, \text{Context}).
ML Category: Two-stage retrieval and ranking. Retrieval is a K-Nearest Neighbors problem in embedding space; Ranking is a Pointwise Classification (Binary Cross-Entropy).
Input: Anchor Listing ID, User Context (session history), Candidate Listing features.
Output: A list of top-K Listing IDs sorted by booking probability.

Data Prep & Features

Data Pipeline:
Logs: User clicks, "Saved" listings, and successful Bookings (Labels).
Metadata: Listing price, room type, location (lat/long), amenities (WiFi, Pool).
Feature Engineering:
User/Item Features: Historical CTR per listing, Price bucket, Geo-hash (level 6).
Embeddings:
Visual: Average of top 5 listing photos.
Textual: Embedding of the listing title and summary.
Context: Current seasonality (month), user's device, search filters active.
Feature Store: Use a Feature Store (e.g., Tecton or Feast) to ensure the price_mean_7d used during training is the same as the one used at inference.

Model Architecture

Retrieval (Candidate Generation):
Two-Tower Model: One tower encodes the "Anchor Listing," the other encodes "Candidate Listings." We optimize for cosine similarity using a Contrastive Loss (e.g., InfoNCE).
Ranking (Precision):
LightGBM or XGBoost: For the MVP, GBDTs are extremely efficient for tabular data.
Cross-Features: Anchor_Price / Candidate_Price, Distance_between_listings.

Training & Serving

Training: Daily batch training on the last 90 days of interaction data. Use a time-based split (Train: Days 1-83, Val: Days 84-90).
Serving:
Retrieval: Use FAISS or HNSW for Approximate Nearest Neighbor (ANN) search.
Online Inference: Deploy the Ranker on a high-performance framework like Triton Inference Server.
Bias Mitigation: Subtract the "Position Bias" learned during training (using a position feature that is set to a default value during inference).

System Architecture

Pipeline Deep Dive

Data Pipeline

Ingestion: Raw events (click, view, book) are captured via Kafka. We use a "Silver" layer in our Delta Lake to store cleaned, schema-validated listing metadata and user interactions.
Storage: S3 stores the parquet files. We partition by event_date and listing_region to optimize retrieval for training.

Feature Pipeline

Real-time signals: Listing availability and price changes are processed via Flink and pushed to Redis.
Embeddings: Periodically, we run an offline inference job to generate listing embeddings from images and text. These are stored as vectors in the Feature Store.

Training Pipeline

Distributed Training: The Two-Tower model is trained using PyTorch on a GPU cluster. We use Negative Sampling (sampling listings the user saw but didn't click) to teach the model what is "not similar."
Orchestration: Airflow manages the DAG, ensuring the Two-Tower model finishes before the ANN index is rebuilt.

Serving Pipeline

Retrieval: We query the HNSW index using the Anchor Listing's embedding. We limit the search to the same "City" or "Geohash" to ensure physical proximity.
Re-ranking: The business logic layer removes the Anchor ID from the results and filters out listings that are already booked for the user's current search dates (if applicable).

Evaluation Pipeline

Interleaved Testing: Instead of a standard A/B test, we can use Interleaving to compare two ranking algorithms in a single stream, providing faster sensitivity to model improvements.

Monitoring Pipeline

Metric Tracking: We monitor the Recall@K of the Retrieval stage. If Recall drops, it usually indicates that the embedding model or the ANN index is stale.
Label Leakage: Monitor if specific features (like "total_price_after_fees") are leaking the label (booking) during training.
Wrap Up

Advanced Topics

Offline Metrics: AUC-ROC for the ranker; Recall@100 for the retrieval stage.
Online Metrics: Booking Conversion Rate (Primary) and Mean Reciprocal Rank (MRR) of the booked item in the "Similar" carousel.
Risk Mitigation:
Cold Start: Use content-only similarity for the first 48 hours of a listing's life.
Fallback: If the ML service fails, return listings in the same neighborhood and price bucket sorted by popularity.
Scalability Audit: The system scales horizontally. ANN indices can be sharded by region, and Ranking nodes can be scaled based on QPS.