The Question

Music Recommendation System for High-Scale Streaming

Design a personalized music recommendation system for a global streaming platform with 500M+ users and 100M+ tracks. The system must handle high-concurrency requests (50k+ QPS) with sub-200ms latency. Detail the end-to-end architecture including multi-stage retrieval, ranking strategies to balance engagement vs. discovery, real-time feature engineering for session-based personalization, and robust evaluation frameworks to measure long-term user retention. Address specific challenges like audio-based cold start, handling negative signals (skips), and maintaining model freshness at scale.

Two-Tower DNN

MMoE

FAISS

Kafka

Flink

Spark

Tecton

PyTorch

HNSW

Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric user retention (Days Active), session duration (Watch Time/Listen Time), or discovery (Novelty/Diversity)?

Assumption*: The goal is to maximize Long-term Retention by optimizing for Listen Time** (Play-through rate) while maintaining a baseline of discovery.

Constraints & Scale: What is the scale of the user base and track corpus?

Assumption: 500M+ Monthly Active Users (MAU), 100M+ tracks. P99 latency for the recommendation endpoint must be < 200ms.

Data Freshness: How quickly should user actions (e.g., skipping a song) influence the next recommendation?

Assumption: Near real-time. Within-session preferences should be reflected in the next song selection.

Cold Start: How do we handle new users with no history and new songs with no play counts?

Assumption: New users get popular/trending tracks; new songs use content-based embeddings (audio analysis) for initial retrieval.

Thinking Process

Identify the Funnel: With 100M tracks, a single-stage model is impossible. I need a multi-stage approach: Retrieval (select top 1000)

\rightarrow

Ranking (score and sort top 100)

\rightarrow

Re-ranking (diversity, business logic).

The Retrieval Bottleneck: High-scale retrieval requires efficient Approximate Nearest Neighbor (ANN) search. I will use a Two-Tower Architecture to separate user and track embeddings.

Ranking Complexity: Once we have 1000 candidates, I need a model that captures non-linear feature interactions. A Deep Interest Network or MMoE (Multi-gate Mixture-of-Experts) is ideal to balance play-time vs. skip-probability.

Data Loop: The system must handle "implicit feedback" (playing a song) vs. "negative feedback" (skipping within 10 seconds).

Elite Bonus Points

Handling Delayed Feedback & Negative Signals: Distinguishing between a "skip" (low relevance) and a "pause/stop" (user left the app, not necessarily low relevance).

Calibration for Session Intent: Implementing a "context-aware" tower that captures whether a user is in "Focus/Study" mode vs. "Party/Discovery" mode using recent session embeddings.

Exploration/Exploitation via Thompson Sampling: Instead of just greedy ranking, use a lightweight bandit approach to inject novel tracks to prevent filter bubbles.

Embedding Versioning: Implementing a "warm-start" strategy for embeddings to prevent recommendation "shocks" when updating the retrieval model.

Design Breakdown

Requirements

Product Goal: Deliver a personalized stream of music that keeps users engaged.

Success Metrics:

Online: Play-through Rate (PTR), 30s-retention rate, Daily Active Minutes.

Offline: Recall@K (for retrieval), AUC/NDCG (for ranking).

Guardrail: P99 Latency, Catalog Coverage (are we recommending more than just the top 1% of artists?).

System Constraints: 50k+ QPS, 100M+ Item Corpus, 200ms Latency budget.

Data Availability: User profile (age, geo), User behavior (play history, skips, likes, playlists), Track metadata (genre, BPM, mood, audio embeddings).

ML Problem Framing

ML Task Type: Multi-stage Ranking (Retrieval + Scoring).

Prediction Target: Multi-objective.

P(\text{CompletePlay} | \text{User, Track, Context})

P(\text{Skip} | \text{User, Track, Context})

Inputs:

User: Historical track IDs, artist affinity, session-level features.

Item: Audio embeddings (VGGish/CLMR), genre, popularity (Global/Regional).

Context: Time of day (morning vs. night), Device (Mobile vs. Desktop), Location.

ML Challenges: Extreme class imbalance (most songs are never played), Feedback loops (users only see what we recommend), and Data sparsity.

Design Summary & MVP

Concise Summary: A two-stage system using Two-Tower DNNs for efficient vector-based retrieval followed by an MMoE-based ranker to optimize for multi-objective engagement metrics.

Baseline Model: Matrix Factorization (ALS) for retrieval; Logistic Regression for ranking.

Target Model: Two-Tower (Retrieval) + Multi-gate Mixture-of-Experts (Ranking).

Simplicity Audit: Avoids Reinforcement Learning (RL) in the first version. Uses batch training with near-real-time feature updates for the ranker.

Architecture Decision Rationale: Two-Tower allows pre-computing item embeddings for sub-10ms ANN search. MMoE handles the trade-off between "playing a song" and "not skipping it," which are distinct user intents.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile/Desktop client logs (JSON/Protobuf) capture play, pause, skip, and search events.

Data Ingestion: Kafka for low-latency event ingestion. Flink for sessionization (grouping events by user session) and calculating real-time counters (e.g., "last 5 songs played").

Data Storage: S3 (Parquet) for raw logs. Partitioned by date and hour.

Data Processing: Spark jobs for daily aggregation (artist popularity, user taste profiles).

Data Quality: De-duplication of events and schema validation at the Flink level to prevent "poison pills" from entering the feature store.

Feature Pipeline

Feature Definition:

User: Embedding of last 20 tracks, mean valence/energy of recently played tracks.

Item: Genre (OHE), BPM (Normalized), Audio features (VGGish), Artist popularity.

Context: Sine/Cosine transformation of "Time of Day" to capture cyclicality.

Feature Store: Use Tecton or Feast.

Offline: Stores historical snapshots for point-in-time joins during training.

Online: Redis-backed store for < 5ms feature retrieval during inference.

Skew Mitigation: Log the exact features used at inference time to a "Log-and-Wait" table to ensure training data matches serving data.

Model Architecture

Retrieval (Two-Tower):

User Tower: Deep Neural Network (DNN) taking user features + historical embeddings

\rightarrow

User Vector

U

Item Tower: DNN taking track metadata + audio features

\rightarrow

Item Vector

V

Loss: Binary Cross-Entropy with Sampled Softmax or Triplet Loss to maximize

U \cdot V

for played songs.

Ranking (MMoE):

Shared Bottom: Embedding layers for sparse features.

Experts: Specialized layers to learn patterns.

Gates: One gate per task (Task A:

P(\text{PlayTime} > 30s)

, Task B:

P(\text{Like})

Why MMoE?: To handle conflicting objectives. A user might "like" a song but not "play it through" (and vice versa).

Training Pipeline

Dataset Construction: Negative sampling is critical. Use "Impression but no play" as hard negatives and "Random tracks" as easy negatives.

Labeling:

Positive: Play duration > 30s.

Negative: Skip within first 5s.

Infrastructure: PyTorch Distributed on GPU clusters.

Retraining: Weekly full-retrain for the Two-Tower model; Daily incremental training for the Ranker to capture trending music.

Serving Pipeline

Retrieval Logic: Export Track Tower embeddings to FAISS (IVF-Flat index). Query with the User Vector generated on-the-fly.

Ranking Service: Containerized (K8s) C++ or Go service calling a TensorFlow Serving or TorchServe endpoint.

Optimization: Use FP16 Quantization for the Ranker to fit in GPU memory and reduce latency.

Fallback: If the model service times out, return the user's "Top Tracks" or "Global Top 50" cached in Redis.

Evaluation Pipeline

Offline:

Recall@100: Percentage of ground-truth tracks caught in retrieval.

NDCG: Measures ranking quality at the top of the list.

Online: A/B Testing (Split traffic by UserID). Track "Long-term Playback Minutes" over a 2-week period.

Monitoring Pipeline

Drift Detection: Monitor the KL-divergence of the predicted score distribution. If scores shift significantly, it indicates feature/label drift.

System Health: Prometheus/Grafana for QPS, Latency (P50, P99), and Error Rates.

Feedback Loop: Automatically trigger a retraining pipeline if offline AUC on a validation set drops below a threshold.

Wrap Up

Final Evaluation

Observability: Track "Prediction Drift" (average predicted play probability vs. actual).

Edge Cases:

Cold Start: For new tracks, use the Item Tower with only metadata/audio features (no play-history) to generate an embedding.

Exploration: Apply a small penalty to the scores of the most popular items to favor "Long-tail" discovery.

Trade-offs:

Accuracy vs. Latency: Increased the candidate pool from 500 to 1000 improves Recall but adds 40ms to Ranking.

Freshness vs. Stability: Real-time updates prevent stale recommendations but risk feedback loops (echo chambers).