DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Scalable Personalized Recommendation System

Design a large-scale recommendation system for a music streaming platform to generate weekly personalized discovery playlists for 500 million users from a corpus of 100 million tracks. Your design should cover the end-to-end ML lifecycle: from multi-stage candidate retrieval and ranking to batch inference pipelines. Address specific challenges including cold-start for new tracks, popularity bias, and the use of audio-based vs. collaborative embeddings. Detail your approach to high-throughput offline serving, data consistency across weekly updates, and evaluation metrics that balance user satisfaction with discovery novelty.
Two-Tower DNN
LightGBM
Word2Vec
FAISS
CNN
Approximate Nearest Neighbor
Spark
Cassandra
Negative Sampling
MMR
Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:
Business Goal: Is the primary metric long-term user retention or immediate listening time? Assumption: The North Star is "Discovery Satisfaction," measured by songs saved to personal libraries or playlists.
Scale: What is the corpus size and user base? Assumption: 100M+ tracks, 500M+ Monthly Active Users (MAU).
Freshness: Does the playlist need to update mid-week? Assumption: No, this is a batch-generated weekly playlist delivered every Monday.
Cold Start: How do we handle users with zero history or brand-new tracks? Assumption: We use content-based audio embeddings and artist metadata for new tracks; popularity-based fallbacks for new users.
Assumptions:
Corpus: 100M tracks.
Training Data: Implicit feedback (plays, skips, saves) from the last 90 days.
Delivery: Offline batch inference (YAGNI principle: no need for real-time serving if the product is a static weekly list).

Thinking Process

Identify the Funnel: With 100M tracks, we cannot rank everything for every user. I need a multi-stage approach: Retrieval (Candidate Generation) to narrow down to ~500 tracks, followed by Ranking to pick the top 30.
Hybrid Intelligence: Collaborative Filtering (CF) captures "users like you," while Content-based filtering (audio features) solves the "cold start" for new music. I'll combine these into a unified embedding space.
Scaling Bottleneck: 500M users x 100M tracks is a massive join. I must use embeddings and Approximate Nearest Neighbor (ANN) search to make this tractable.
The "Weekly" Constraint: Since it's a weekly batch, I can optimize for throughput and cost-efficiency in the data pipeline rather than millisecond latency at the point of request.

Elite Bonus Points

Calibration for Diversity: Users might like 80% Rock and 20% Jazz. If the model only suggests Rock, it fails. I would implement Calibrated Recommendations to ensure the predicted distribution of genres matches the user's historical preference.
Exploration via Bandits: To prevent "filter bubbles," I’d reserve 2-3 slots for "Exploration" tracks using a Thompson Sampling or UCB approach based on genre-level uncertainty.
Negative Signal Weighting: Not all skips are equal. A skip after 2 seconds is a strong negative; a skip after 2 minutes might be a "soft" positive. I'll use weighted loss functions to reflect dwell time.
Incremental Training: To keep track embeddings fresh without retraining the whole graph, I'd use an incremental "Warm Start" for new items using their audio-feature similarity to existing "anchor" tracks.
Design Breakdown

Requirements

Product Goal: Deliver a personalized 30-song playlist every Monday that introduces users to music they haven't heard but are likely to love.
Success Metrics:
Online: Discovery Save Rate (tracks added to library), Playlist Dwell Time, CTR on the playlist banner.
Offline: Precision@30, NDCG, Diversity Score, Novelty Score (to ensure we aren't just recommending hits).
System Constraints:
Throughput: Must generate 500M playlists within a 24-hour window (Sunday).
Storage: Billions of user-item interaction pairs.
Data Availability: User play logs (timestamp, duration), user library (saves), playlist co-occurrence data, audio analysis (tempo, key, energy).

ML Problem Framing

ML Task Type: Two-stage Recommendation (Retrieval + Ranking).
Prediction Target: Probability of a "Positive Interaction" (defined as a play > 30s OR a save).
Inputs:
User Features: Embedded listening history (last 50 songs), demographic, preferred genres.
Item Features: Track embeddings (from Word2Vec on playlists), Audio CNN embeddings, Artist popularity.
Context Features: Time of year (e.g., holiday music), user's current top genres.
ML Challenges: Popularity bias (the "rich get richer" in CF), position bias in training logs, and high-cardinality categorical features (Artist IDs).

Design Summary & MVP

Concise Summary: A hybrid system using Word2Vec (Item2Vec) on millions of user playlists for collaborative filtering, combined with a Two-Tower Neural Network for personalized retrieval, finalized by a LightGBM Ranker.
Model Architecture & Selection:
Baseline: Item-Item CF using cosine similarity on user-library vectors.
Target Model: Two-Tower DNN for retrieval (user tower + item tower) and a Pointwise Ranking model for the final 30 songs.
Simplicity Audit: I am choosing Batch Inference over real-time. Since Discover Weekly is delivered once a week, we calculate all 500M playlists offline and store them in a Key-Value store (Cassandra/Redis). This avoids the complexity of high-availability real-time ranking.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: User event streams (Kafka) for real-time play logs; Metadata DBs for track info.
Data Ingestion: Use Spark/Flink for ETL. For DW, a Batch (Daily) approach is sufficient.
Data Storage: "Medallion" architecture in a Data Lake (S3/Parquet).
Bronze: Raw logs.
Silver: Deduplicated, joined user-track interactions.
Gold: User profiles and track feature vectors.
Data Quality: Pydantic/Great Expectations for schema validation. We must check for "Session Hijacking" (bots) which can skew user embeddings.

Feature Pipeline

Collaborative Features: Item2Vec. Treat a playlist like a sentence and tracks like words. Use Skip-gram to learn track embeddings based on co-occurrence.
Content Features: Use a pre-trained VGG-ish CNN on track spectrograms to get "Audio Embeddings." This is vital for the cold-start of new releases.
Temporal Features: "Recency" weightings—songs played yesterday are more relevant than songs played 3 years ago.
Feature Store: Use Feast or Tecton to ensure that the features used during training (offline) are exactly the same as those used during batch inference (online consistency).

Model Architecture

Retrieval (Candidate Generation):
Two-Tower DNN: One tower encodes user history/metadata into vector U. One tower encodes track features into vector I.
Loss: Sampled Softmax or Triplet Loss to maximize dot(U, I) for positive tracks.
Scale: Use FAISS (Facebook AI Similarity Search) to perform ANN search across 100M track embeddings.
Ranking (Final Selection):
XGBoost/LightGBM: Since we only rank ~500 items per user, a GBDT is highly effective at capturing non-linear interactions (e.g., "User likes Genre X AND Item is Popular in Region Y").
Target: Binary classification (Is interaction > 30s?).

Training Pipeline

Negative Sampling: Crucial for retrieval. Use "In-batch negatives" (random tracks) and "Hard negatives" (tracks the user saw but didn't click).
Data Splitting: Time-based split. Train on data up to Week N-1, validate on Week N. Random splits will cause data leakage via user habits.
Retraining: Two-Tower model retrained weekly; Track embeddings (Item2Vec) can be updated daily to capture new trends.

Serving Pipeline

Pattern: Batch Offline Inference. Every Sunday night, a massive Spark job runs for all active users.
Workflow:
Extract User Embeddings.
FAISS query to find Top 500 tracks.
Join with features.
Rank top 30 with XGBoost.
Write results to Cassandra.
SLA: The API just does a simple GET from Cassandra by user_id, ensuring <10ms response time for the user.

Evaluation Pipeline

Offline: Use Hit Rate @ 30 and NDCG. Compare the new model against the previous week’s model using historical logs.
Online (A/B Test): Split 5% of users to the new model. Measure "Discovery Ratio" (Listening to music not previously in library) vs. Control.

Monitoring Pipeline

Prediction Drift: Monitor the mean predicted score. If it drops, the model is losing confidence.
Feature Drift: Monitor the distribution of user embedding norms. Large shifts suggest changes in user behavior or upstream data bugs.
System Health: Track the completion time of the Sunday batch job. If it exceeds 24h, the system fails to deliver.
Wrap Up

Final Evaluation

Feedback Loop: "Skips" are fed back into the next week's negative sampling. "Saves" are given 5x weight in the positive training set.
Cold Start: Handled by the Audio CNN Tower in the retrieval stage.
Trade-off (Accuracy vs. Diversity): Pure accuracy leads to "Hit" songs. We intentionally inject a Maximum Marginal Relevance (MMR) penalty to the ranker to de-duplicate artists and genres in the final 30.
Distinguishing Insight: A Principal engineer knows that "Exploitation" (giving users what they already know) is easy, but "Discovery" requires a high Novelty Score. I would track the average popularity of recommended tracks and ensure it's significantly lower than the user's current "Top 40" habits to prove the system is actually "discovering."