The Question

Scalable Personalized Recommendation System

Design a large-scale recommendation system for a music streaming platform to generate weekly personalized discovery playlists for 500 million users from a corpus of 100 million tracks. Your design should cover the end-to-end ML lifecycle: from multi-stage candidate retrieval and ranking to batch inference pipelines. Address specific challenges including cold-start for new tracks, popularity bias, and the use of audio-based vs. collaborative embeddings. Detail your approach to high-throughput offline serving, data consistency across weekly updates, and evaluation metrics that balance user satisfaction with discovery novelty.

Two-Tower DNN

LightGBM

Word2Vec

FAISS

CNN

Approximate Nearest Neighbor

Spark

Cassandra

Negative Sampling

MMR

Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:

Business Goal: Is the primary metric long-term user retention or immediate listening time? Assumption: The North Star is "Discovery Satisfaction," measured by songs saved to personal libraries or playlists.

Scale: What is the corpus size and user base? Assumption: 100M+ tracks, 500M+ Monthly Active Users (MAU).

Freshness: Does the playlist need to update mid-week? Assumption: No, this is a batch-generated weekly playlist delivered every Monday.

Cold Start: How do we handle users with zero history or brand-new tracks? Assumption: We use content-based audio embeddings and artist metadata for new tracks; popularity-based fallbacks for new users.

Assumptions:

Corpus: 100M tracks.

Training Data: Implicit feedback (plays, skips, saves) from the last 90 days.

Delivery: Offline batch inference (YAGNI principle: no need for real-time serving if the product is a static weekly list).

Thinking Process

Identify the Funnel: With 100M tracks, we cannot rank everything for every user. I need a multi-stage approach: Retrieval (Candidate Generation) to narrow down to ~500 tracks, followed by Ranking to pick the top 30.

Hybrid Intelligence: Collaborative Filtering (CF) captures "users like you," while Content-based filtering (audio features) solves the "cold start" for new music. I'll combine these into a unified embedding space.

Scaling Bottleneck: 500M users x 100M tracks is a massive join. I must use embeddings and Approximate Nearest Neighbor (ANN) search to make this tractable.

The "Weekly" Constraint: Since it's a weekly batch, I can optimize for throughput and cost-efficiency in the data pipeline rather than millisecond latency at the point of request.

Elite Bonus Points

Calibration for Diversity: Users might like 80% Rock and 20% Jazz. If the model only suggests Rock, it fails. I would implement Calibrated Recommendations to ensure the predicted distribution of genres matches the user's historical preference.

Exploration via Bandits: To prevent "filter bubbles," I’d reserve 2-3 slots for "Exploration" tracks using a Thompson Sampling or UCB approach based on genre-level uncertainty.

Negative Signal Weighting: Not all skips are equal. A skip after 2 seconds is a strong negative; a skip after 2 minutes might be a "soft" positive. I'll use weighted loss functions to reflect dwell time.

Incremental Training: To keep track embeddings fresh without retraining the whole graph, I'd use an incremental "Warm Start" for new items using their audio-feature similarity to existing "anchor" tracks.

Design Breakdown

Requirements

Product Goal: Deliver a personalized 30-song playlist every Monday that introduces users to music they haven't heard but are likely to love.

Success Metrics:

Online: Discovery Save Rate (tracks added to library), Playlist Dwell Time, CTR on the playlist banner.

Offline: Precision@30, NDCG, Diversity Score, Novelty Score (to ensure we aren't just recommending hits).

System Constraints:

Throughput: Must generate 500M playlists within a 24-hour window (Sunday).

Storage: Billions of user-item interaction pairs.

Data Availability: User play logs (timestamp, duration), user library (saves), playlist co-occurrence data, audio analysis (tempo, key, energy).

ML Problem Framing

ML Task Type: Two-stage Recommendation (Retrieval + Ranking).

Prediction Target: Probability of a "Positive Interaction" (defined as a play > 30s OR a save).

Inputs:

User Features: Embedded listening history (last 50 songs), demographic, preferred genres.

Item Features: Track embeddings (from Word2Vec on playlists), Audio CNN embeddings, Artist popularity.

Context Features: Time of year (e.g., holiday music), user's current top genres.

ML Challenges: Popularity bias (the "rich get richer" in CF), position bias in training logs, and high-cardinality categorical features (Artist IDs).

Design Summary & MVP

Concise Summary: A hybrid system using Word2Vec (Item2Vec) on millions of user playlists for collaborative filtering, combined with a Two-Tower Neural Network for personalized retrieval, finalized by a LightGBM Ranker.

Model Architecture & Selection:

Baseline: Item-Item CF using cosine similarity on user-library vectors.

Target Model: Two-Tower DNN for retrieval (user tower + item tower) and a Pointwise Ranking model for the final 30 songs.

Simplicity Audit: I am choosing Batch Inference over real-time. Since Discover Weekly is delivered once a week, we calculate all 500M playlists offline and store them in a Key-Value store (Cassandra/Redis). This avoids the complexity of high-availability real-time ranking.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: User event streams (Kafka) for real-time play logs; Metadata DBs for track info.

Data Ingestion: Use Spark/Flink for ETL. For DW, a Batch (Daily) approach is sufficient.

Data Storage: "Medallion" architecture in a Data Lake (S3/Parquet).

Bronze: Raw logs.

Silver: Deduplicated, joined user-track interactions.

Gold: User profiles and track feature vectors.

Data Quality: Pydantic/Great Expectations for schema validation. We must check for "Session Hijacking" (bots) which can skew user embeddings.

Feature Pipeline

Collaborative Features: Item2Vec. Treat a playlist like a sentence and tracks like words. Use Skip-gram to learn track embeddings based on co-occurrence.

Content Features: Use a pre-trained VGG-ish CNN on track spectrograms to get "Audio Embeddings." This is vital for the cold-start of new releases.

Temporal Features: "Recency" weightings—songs played yesterday are more relevant than songs played 3 years ago.

Feature Store: Use Feast or Tecton to ensure that the features used during training (offline) are exactly the same as those used during batch inference (online consistency).

Model Architecture

Retrieval (Candidate Generation):

Two-Tower DNN: One tower encodes user history/metadata into vector

U

. One tower encodes track features into vector

I

Loss: Sampled Softmax or Triplet Loss to maximize

dot(U, I)

for positive tracks.

Scale: Use FAISS (Facebook AI Similarity Search) to perform ANN search across 100M track embeddings.

Ranking (Final Selection):

XGBoost/LightGBM: Since we only rank ~500 items per user, a GBDT is highly effective at capturing non-linear interactions (e.g., "User likes Genre X AND Item is Popular in Region Y").

Target: Binary classification (Is interaction > 30s?).

Training Pipeline

Negative Sampling: Crucial for retrieval. Use "In-batch negatives" (random tracks) and "Hard negatives" (tracks the user saw but didn't click).

Data Splitting: Time-based split. Train on data up to Week

N-1

, validate on Week

N

. Random splits will cause data leakage via user habits.

Retraining: Two-Tower model retrained weekly; Track embeddings (Item2Vec) can be updated daily to capture new trends.

Serving Pipeline

Pattern: Batch Offline Inference. Every Sunday night, a massive Spark job runs for all active users.

Workflow:

Extract User Embeddings.

FAISS query to find Top 500 tracks.

Join with features.

Rank top 30 with XGBoost.

Write results to Cassandra.

SLA: The API just does a simple GET from Cassandra by user_id, ensuring <10ms response time for the user.

Evaluation Pipeline

Offline: Use Hit Rate @ 30 and NDCG. Compare the new model against the previous week’s model using historical logs.

Online (A/B Test): Split 5% of users to the new model. Measure "Discovery Ratio" (Listening to music not previously in library) vs. Control.

Monitoring Pipeline

Prediction Drift: Monitor the mean predicted score. If it drops, the model is losing confidence.

Feature Drift: Monitor the distribution of user embedding norms. Large shifts suggest changes in user behavior or upstream data bugs.

System Health: Track the completion time of the Sunday batch job. If it exceeds 24h, the system fails to deliver.

Wrap Up

Final Evaluation

Feedback Loop: "Skips" are fed back into the next week's negative sampling. "Saves" are given 5x weight in the positive training set.

Cold Start: Handled by the Audio CNN Tower in the retrieval stage.

Trade-off (Accuracy vs. Diversity): Pure accuracy leads to "Hit" songs. We intentionally inject a Maximum Marginal Relevance (MMR) penalty to the ranker to de-duplicate artists and genres in the final 30.

Distinguishing Insight: A Principal engineer knows that "Exploitation" (giving users what they already know) is easy, but "Discovery" requires a high Novelty Score. I would track the average popularity of recommended tracks and ensure it's significantly lower than the user's current "Top 40" habits to prove the system is actually "discovering."