The Question
ML DesignScalable Personalized Recommendation System
Design a large-scale recommendation system for a music streaming platform to generate weekly personalized discovery playlists for 500 million users from a corpus of 100 million tracks. Your design should cover the end-to-end ML lifecycle: from multi-stage candidate retrieval and ranking to batch inference pipelines. Address specific challenges including cold-start for new tracks, popularity bias, and the use of audio-based vs. collaborative embeddings. Detail your approach to high-throughput offline serving, data consistency across weekly updates, and evaluation metrics that balance user satisfaction with discovery novelty.
Two-Tower DNN
LightGBM
Word2Vec
FAISS
CNN
Approximate Nearest Neighbor
Spark
Cassandra
Negative Sampling
MMR
Questions & Insights
Clarifying Questions
Clarifying Questions & Constraints:
Business Goal: Is the primary metric long-term user retention or immediate listening time? Assumption: The North Star is "Discovery Satisfaction," measured by songs saved to personal libraries or playlists.
Scale: What is the corpus size and user base? Assumption: 100M+ tracks, 500M+ Monthly Active Users (MAU).
Freshness: Does the playlist need to update mid-week? Assumption: No, this is a batch-generated weekly playlist delivered every Monday.
Cold Start: How do we handle users with zero history or brand-new tracks? Assumption: We use content-based audio embeddings and artist metadata for new tracks; popularity-based fallbacks for new users.
Assumptions:
Corpus: 100M tracks.
Training Data: Implicit feedback (plays, skips, saves) from the last 90 days.
Delivery: Offline batch inference (YAGNI principle: no need for real-time serving if the product is a static weekly list).
Thinking Process
Identify the Funnel: With 100M tracks, we cannot rank everything for every user. I need a multi-stage approach: Retrieval (Candidate Generation) to narrow down to ~500 tracks, followed by Ranking to pick the top 30.
Hybrid Intelligence: Collaborative Filtering (CF) captures "users like you," while Content-based filtering (audio features) solves the "cold start" for new music. I'll combine these into a unified embedding space.
Scaling Bottleneck: 500M users x 100M tracks is a massive join. I must use embeddings and Approximate Nearest Neighbor (ANN) search to make this tractable.
The "Weekly" Constraint: Since it's a weekly batch, I can optimize for throughput and cost-efficiency in the data pipeline rather than millisecond latency at the point of request.
Elite Bonus Points
Calibration for Diversity: Users might like 80% Rock and 20% Jazz. If the model only suggests Rock, it fails. I would implement Calibrated Recommendations to ensure the predicted distribution of genres matches the user's historical preference.
Exploration via Bandits: To prevent "filter bubbles," I’d reserve 2-3 slots for "Exploration" tracks using a Thompson Sampling or UCB approach based on genre-level uncertainty.
Negative Signal Weighting: Not all skips are equal. A skip after 2 seconds is a strong negative; a skip after 2 minutes might be a "soft" positive. I'll use weighted loss functions to reflect dwell time.
Incremental Training: To keep track embeddings fresh without retraining the whole graph, I'd use an incremental "Warm Start" for new items using their audio-feature similarity to existing "anchor" tracks.
Design Breakdown
Requirements
Product Goal: Deliver a personalized 30-song playlist every Monday that introduces users to music they haven't heard but are likely to love.
Success Metrics:
Online: Discovery Save Rate (tracks added to library), Playlist Dwell Time, CTR on the playlist banner.
Offline: Precision@30, NDCG, Diversity Score, Novelty Score (to ensure we aren't just recommending hits).
System Constraints:
Throughput: Must generate 500M playlists within a 24-hour window (Sunday).
Storage: Billions of user-item interaction pairs.
Data Availability: User play logs (timestamp, duration), user library (saves), playlist co-occurrence data, audio analysis (tempo, key, energy).
ML Problem Framing
ML Task Type: Two-stage Recommendation (Retrieval + Ranking).
Prediction Target: Probability of a "Positive Interaction" (defined as a play > 30s OR a save).
Inputs:
User Features: Embedded listening history (last 50 songs), demographic, preferred genres.
Item Features: Track embeddings (from Word2Vec on playlists), Audio CNN embeddings, Artist popularity.
Context Features: Time of year (e.g., holiday music), user's current top genres.
ML Challenges: Popularity bias (the "rich get richer" in CF), position bias in training logs, and high-cardinality categorical features (Artist IDs).
Design Summary & MVP
Concise Summary: A hybrid system using Word2Vec (Item2Vec) on millions of user playlists for collaborative filtering, combined with a Two-Tower Neural Network for personalized retrieval, finalized by a LightGBM Ranker.
Model Architecture & Selection:
Baseline: Item-Item CF using cosine similarity on user-library vectors.
Target Model: Two-Tower DNN for retrieval (user tower + item tower) and a Pointwise Ranking model for the final 30 songs.
Simplicity Audit: I am choosing Batch Inference over real-time. Since Discover Weekly is delivered once a week, we calculate all 500M playlists offline and store them in a Key-Value store (Cassandra/Redis). This avoids the complexity of high-availability real-time ranking.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: User event streams (Kafka) for real-time play logs; Metadata DBs for track info.
Data Ingestion: Use Spark/Flink for ETL. For DW, a Batch (Daily) approach is sufficient.
Data Storage: "Medallion" architecture in a Data Lake (S3/Parquet).
Bronze: Raw logs.
Silver: Deduplicated, joined user-track interactions.
Gold: User profiles and track feature vectors.
Data Quality: Pydantic/Great Expectations for schema validation. We must check for "Session Hijacking" (bots) which can skew user embeddings.
Feature Pipeline
Collaborative Features: Item2Vec. Treat a playlist like a sentence and tracks like words. Use Skip-gram to learn track embeddings based on co-occurrence.
Content Features: Use a pre-trained VGG-ish CNN on track spectrograms to get "Audio Embeddings." This is vital for the cold-start of new releases.
Temporal Features: "Recency" weightings—songs played yesterday are more relevant than songs played 3 years ago.
Feature Store: Use Feast or Tecton to ensure that the features used during training (offline) are exactly the same as those used during batch inference (online consistency).
Model Architecture
Retrieval (Candidate Generation):
Two-Tower DNN: One tower encodes user history/metadata into vector U. One tower encodes track features into vector I.
Loss: Sampled Softmax or Triplet Loss to maximize dot(U, I) for positive tracks.
Scale: Use FAISS (Facebook AI Similarity Search) to perform ANN search across 100M track embeddings.
Ranking (Final Selection):
XGBoost/LightGBM: Since we only rank ~500 items per user, a GBDT is highly effective at capturing non-linear interactions (e.g., "User likes Genre X AND Item is Popular in Region Y").
Target: Binary classification (Is interaction > 30s?).
Training Pipeline
Negative Sampling: Crucial for retrieval. Use "In-batch negatives" (random tracks) and "Hard negatives" (tracks the user saw but didn't click).
Data Splitting: Time-based split. Train on data up to Week N-1, validate on Week N. Random splits will cause data leakage via user habits.
Retraining: Two-Tower model retrained weekly; Track embeddings (Item2Vec) can be updated daily to capture new trends.
Serving Pipeline
Pattern: Batch Offline Inference. Every Sunday night, a massive Spark job runs for all active users.
Workflow:
Extract User Embeddings.
FAISS query to find Top 500 tracks.
Join with features.
Rank top 30 with XGBoost.
Write results to Cassandra.
SLA: The API just does a simple GET from Cassandra by
user_id, ensuring <10ms response time for the user.Evaluation Pipeline
Offline: Use Hit Rate @ 30 and NDCG. Compare the new model against the previous week’s model using historical logs.
Online (A/B Test): Split 5% of users to the new model. Measure "Discovery Ratio" (Listening to music not previously in library) vs. Control.
Monitoring Pipeline
Prediction Drift: Monitor the mean predicted score. If it drops, the model is losing confidence.
Feature Drift: Monitor the distribution of user embedding norms. Large shifts suggest changes in user behavior or upstream data bugs.
System Health: Track the completion time of the Sunday batch job. If it exceeds 24h, the system fails to deliver.
Wrap Up
Final Evaluation
Feedback Loop: "Skips" are fed back into the next week's negative sampling. "Saves" are given 5x weight in the positive training set.
Cold Start: Handled by the Audio CNN Tower in the retrieval stage.
Trade-off (Accuracy vs. Diversity): Pure accuracy leads to "Hit" songs. We intentionally inject a Maximum Marginal Relevance (MMR) penalty to the ranker to de-duplicate artists and genres in the final 30.
Distinguishing Insight: A Principal engineer knows that "Exploitation" (giving users what they already know) is easy, but "Discovery" requires a high Novelty Score. I would track the average popularity of recommended tracks and ensure it's significantly lower than the user's current "Top 40" habits to prove the system is actually "discovering."