The Question
ML DesignMusic Recommendation System for High-Scale Streaming
Design a personalized music recommendation system for a global streaming platform with 500M+ users and 100M+ tracks. The system must handle high-concurrency requests (50k+ QPS) with sub-200ms latency. Detail the end-to-end architecture including multi-stage retrieval, ranking strategies to balance engagement vs. discovery, real-time feature engineering for session-based personalization, and robust evaluation frameworks to measure long-term user retention. Address specific challenges like audio-based cold start, handling negative signals (skips), and maintaining model freshness at scale.
Two-Tower DNN
MMoE
FAISS
Kafka
Flink
Spark
Tecton
PyTorch
HNSW
Questions & Insights
Clarifying Questions
Business Goal: Is the primary North Star metric user retention (Days Active), session duration (Watch Time/Listen Time), or discovery (Novelty/Diversity)?
Assumption*: The goal is to maximize Long-term Retention by optimizing for Listen Time** (Play-through rate) while maintaining a baseline of discovery.
Constraints & Scale: What is the scale of the user base and track corpus?
Assumption: 500M+ Monthly Active Users (MAU), 100M+ tracks. P99 latency for the recommendation endpoint must be < 200ms.
Data Freshness: How quickly should user actions (e.g., skipping a song) influence the next recommendation?
Assumption: Near real-time. Within-session preferences should be reflected in the next song selection.
Cold Start: How do we handle new users with no history and new songs with no play counts?
Assumption: New users get popular/trending tracks; new songs use content-based embeddings (audio analysis) for initial retrieval.
Thinking Process
Identify the Funnel: With 100M tracks, a single-stage model is impossible. I need a multi-stage approach: Retrieval (select top 1000) \rightarrowRanking (score and sort top 100) \rightarrowRe-ranking (diversity, business logic).
The Retrieval Bottleneck: High-scale retrieval requires efficient Approximate Nearest Neighbor (ANN) search. I will use a Two-Tower Architecture to separate user and track embeddings.
Ranking Complexity: Once we have 1000 candidates, I need a model that captures non-linear feature interactions. A Deep Interest Network or MMoE (Multi-gate Mixture-of-Experts) is ideal to balance play-time vs. skip-probability.
Data Loop: The system must handle "implicit feedback" (playing a song) vs. "negative feedback" (skipping within 10 seconds).
Elite Bonus Points
Handling Delayed Feedback & Negative Signals: Distinguishing between a "skip" (low relevance) and a "pause/stop" (user left the app, not necessarily low relevance).
Calibration for Session Intent: Implementing a "context-aware" tower that captures whether a user is in "Focus/Study" mode vs. "Party/Discovery" mode using recent session embeddings.
Exploration/Exploitation via Thompson Sampling: Instead of just greedy ranking, use a lightweight bandit approach to inject novel tracks to prevent filter bubbles.
Embedding Versioning: Implementing a "warm-start" strategy for embeddings to prevent recommendation "shocks" when updating the retrieval model.
Design Breakdown
Requirements
Product Goal: Deliver a personalized stream of music that keeps users engaged.
Success Metrics:
Online: Play-through Rate (PTR), 30s-retention rate, Daily Active Minutes.
Offline: Recall@K (for retrieval), AUC/NDCG (for ranking).
Guardrail: P99 Latency, Catalog Coverage (are we recommending more than just the top 1% of artists?).
System Constraints: 50k+ QPS, 100M+ Item Corpus, 200ms Latency budget.
Data Availability: User profile (age, geo), User behavior (play history, skips, likes, playlists), Track metadata (genre, BPM, mood, audio embeddings).
ML Problem Framing
ML Task Type: Multi-stage Ranking (Retrieval + Scoring).
Prediction Target: Multi-objective.
P(\text{CompletePlay} | \text{User, Track, Context})
P(\text{Skip} | \text{User, Track, Context})
Inputs:
User: Historical track IDs, artist affinity, session-level features.
Item: Audio embeddings (VGGish/CLMR), genre, popularity (Global/Regional).
Context: Time of day (morning vs. night), Device (Mobile vs. Desktop), Location.
ML Challenges: Extreme class imbalance (most songs are never played), Feedback loops (users only see what we recommend), and Data sparsity.
Design Summary & MVP
Concise Summary: A two-stage system using Two-Tower DNNs for efficient vector-based retrieval followed by an MMoE-based ranker to optimize for multi-objective engagement metrics.
Baseline Model: Matrix Factorization (ALS) for retrieval; Logistic Regression for ranking.
Target Model: Two-Tower (Retrieval) + Multi-gate Mixture-of-Experts (Ranking).
Simplicity Audit: Avoids Reinforcement Learning (RL) in the first version. Uses batch training with near-real-time feature updates for the ranker.
Architecture Decision Rationale: Two-Tower allows pre-computing item embeddings for sub-10ms ANN search. MMoE handles the trade-off between "playing a song" and "not skipping it," which are distinct user intents.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Mobile/Desktop client logs (JSON/Protobuf) capture
play, pause, skip, and search events.Data Ingestion: Kafka for low-latency event ingestion. Flink for sessionization (grouping events by user session) and calculating real-time counters (e.g., "last 5 songs played").
Data Storage: S3 (Parquet) for raw logs. Partitioned by
date and hour. Data Processing: Spark jobs for daily aggregation (artist popularity, user taste profiles).
Data Quality: De-duplication of events and schema validation at the Flink level to prevent "poison pills" from entering the feature store.
Feature Pipeline
Feature Definition:
User: Embedding of last 20 tracks, mean valence/energy of recently played tracks.
Item: Genre (OHE), BPM (Normalized), Audio features (VGGish), Artist popularity.
Context: Sine/Cosine transformation of "Time of Day" to capture cyclicality.
Feature Store: Use Tecton or Feast.
Offline: Stores historical snapshots for point-in-time joins during training.
Online: Redis-backed store for < 5ms feature retrieval during inference.
Skew Mitigation: Log the exact features used at inference time to a "Log-and-Wait" table to ensure training data matches serving data.
Model Architecture
Retrieval (Two-Tower):
User Tower: Deep Neural Network (DNN) taking user features + historical embeddings \rightarrow User Vector U.
Item Tower: DNN taking track metadata + audio features \rightarrow Item Vector V.
Loss: Binary Cross-Entropy with Sampled Softmax or Triplet Loss to maximize U \cdot V for played songs.
Ranking (MMoE):
Shared Bottom: Embedding layers for sparse features.
Experts: Specialized layers to learn patterns.
Gates: One gate per task (Task A: P(\text{PlayTime} > 30s), Task B: P(\text{Like})).
Why MMoE?: To handle conflicting objectives. A user might "like" a song but not "play it through" (and vice versa).
Training Pipeline
Dataset Construction: Negative sampling is critical. Use "Impression but no play" as hard negatives and "Random tracks" as easy negatives.
Labeling:
Positive: Play duration > 30s.
Negative: Skip within first 5s.
Infrastructure: PyTorch Distributed on GPU clusters.
Retraining: Weekly full-retrain for the Two-Tower model; Daily incremental training for the Ranker to capture trending music.
Serving Pipeline
Retrieval Logic: Export Track Tower embeddings to FAISS (IVF-Flat index). Query with the User Vector generated on-the-fly.
Ranking Service: Containerized (K8s) C++ or Go service calling a TensorFlow Serving or TorchServe endpoint.
Optimization: Use FP16 Quantization for the Ranker to fit in GPU memory and reduce latency.
Fallback: If the model service times out, return the user's "Top Tracks" or "Global Top 50" cached in Redis.
Evaluation Pipeline
Offline:
Recall@100: Percentage of ground-truth tracks caught in retrieval.
NDCG: Measures ranking quality at the top of the list.
Online: A/B Testing (Split traffic by UserID). Track "Long-term Playback Minutes" over a 2-week period.
Monitoring Pipeline
Drift Detection: Monitor the KL-divergence of the predicted score distribution. If scores shift significantly, it indicates feature/label drift.
System Health: Prometheus/Grafana for QPS, Latency (P50, P99), and Error Rates.
Feedback Loop: Automatically trigger a retraining pipeline if offline AUC on a validation set drops below a threshold.
Wrap Up
Final Evaluation
Observability: Track "Prediction Drift" (average predicted play probability vs. actual).
Edge Cases:
Cold Start: For new tracks, use the Item Tower with only metadata/audio features (no play-history) to generate an embedding.
Exploration: Apply a small penalty to the scores of the most popular items to favor "Long-tail" discovery.
Trade-offs:
Accuracy vs. Latency: Increased the candidate pool from 500 to 1000 improves Recall but adds 40ms to Ranking.
Freshness vs. Stability: Real-time updates prevent stale recommendations but risk feedback loops (echo chambers).