DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Music Recommendation System for High-Scale Streaming

Design a personalized music recommendation system for a global streaming platform with 500M+ users and 100M+ tracks. The system must handle high-concurrency requests (50k+ QPS) with sub-200ms latency. Detail the end-to-end architecture including multi-stage retrieval, ranking strategies to balance engagement vs. discovery, real-time feature engineering for session-based personalization, and robust evaluation frameworks to measure long-term user retention. Address specific challenges like audio-based cold start, handling negative signals (skips), and maintaining model freshness at scale.
Two-Tower DNN
MMoE
FAISS
Kafka
Flink
Spark
Tecton
PyTorch
HNSW
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric user retention (Days Active), session duration (Watch Time/Listen Time), or discovery (Novelty/Diversity)?
Assumption*: The goal is to maximize Long-term Retention by optimizing for Listen Time** (Play-through rate) while maintaining a baseline of discovery.
Constraints & Scale: What is the scale of the user base and track corpus?
Assumption: 500M+ Monthly Active Users (MAU), 100M+ tracks. P99 latency for the recommendation endpoint must be < 200ms.
Data Freshness: How quickly should user actions (e.g., skipping a song) influence the next recommendation?
Assumption: Near real-time. Within-session preferences should be reflected in the next song selection.
Cold Start: How do we handle new users with no history and new songs with no play counts?
Assumption: New users get popular/trending tracks; new songs use content-based embeddings (audio analysis) for initial retrieval.

Thinking Process

Identify the Funnel: With 100M tracks, a single-stage model is impossible. I need a multi-stage approach: Retrieval (select top 1000) \rightarrowRanking (score and sort top 100) \rightarrowRe-ranking (diversity, business logic).
The Retrieval Bottleneck: High-scale retrieval requires efficient Approximate Nearest Neighbor (ANN) search. I will use a Two-Tower Architecture to separate user and track embeddings.
Ranking Complexity: Once we have 1000 candidates, I need a model that captures non-linear feature interactions. A Deep Interest Network or MMoE (Multi-gate Mixture-of-Experts) is ideal to balance play-time vs. skip-probability.
Data Loop: The system must handle "implicit feedback" (playing a song) vs. "negative feedback" (skipping within 10 seconds).

Elite Bonus Points

Handling Delayed Feedback & Negative Signals: Distinguishing between a "skip" (low relevance) and a "pause/stop" (user left the app, not necessarily low relevance).
Calibration for Session Intent: Implementing a "context-aware" tower that captures whether a user is in "Focus/Study" mode vs. "Party/Discovery" mode using recent session embeddings.
Exploration/Exploitation via Thompson Sampling: Instead of just greedy ranking, use a lightweight bandit approach to inject novel tracks to prevent filter bubbles.
Embedding Versioning: Implementing a "warm-start" strategy for embeddings to prevent recommendation "shocks" when updating the retrieval model.
Design Breakdown

Requirements

Product Goal: Deliver a personalized stream of music that keeps users engaged.
Success Metrics:
Online: Play-through Rate (PTR), 30s-retention rate, Daily Active Minutes.
Offline: Recall@K (for retrieval), AUC/NDCG (for ranking).
Guardrail: P99 Latency, Catalog Coverage (are we recommending more than just the top 1% of artists?).
System Constraints: 50k+ QPS, 100M+ Item Corpus, 200ms Latency budget.
Data Availability: User profile (age, geo), User behavior (play history, skips, likes, playlists), Track metadata (genre, BPM, mood, audio embeddings).

ML Problem Framing

ML Task Type: Multi-stage Ranking (Retrieval + Scoring).
Prediction Target: Multi-objective.
P(\text{CompletePlay} | \text{User, Track, Context})
P(\text{Skip} | \text{User, Track, Context})
Inputs:
User: Historical track IDs, artist affinity, session-level features.
Item: Audio embeddings (VGGish/CLMR), genre, popularity (Global/Regional).
Context: Time of day (morning vs. night), Device (Mobile vs. Desktop), Location.
ML Challenges: Extreme class imbalance (most songs are never played), Feedback loops (users only see what we recommend), and Data sparsity.

Design Summary & MVP

Concise Summary: A two-stage system using Two-Tower DNNs for efficient vector-based retrieval followed by an MMoE-based ranker to optimize for multi-objective engagement metrics.
Baseline Model: Matrix Factorization (ALS) for retrieval; Logistic Regression for ranking.
Target Model: Two-Tower (Retrieval) + Multi-gate Mixture-of-Experts (Ranking).
Simplicity Audit: Avoids Reinforcement Learning (RL) in the first version. Uses batch training with near-real-time feature updates for the ranker.
Architecture Decision Rationale: Two-Tower allows pre-computing item embeddings for sub-10ms ANN search. MMoE handles the trade-off between "playing a song" and "not skipping it," which are distinct user intents.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile/Desktop client logs (JSON/Protobuf) capture play, pause, skip, and search events.
Data Ingestion: Kafka for low-latency event ingestion. Flink for sessionization (grouping events by user session) and calculating real-time counters (e.g., "last 5 songs played").
Data Storage: S3 (Parquet) for raw logs. Partitioned by date and hour.
Data Processing: Spark jobs for daily aggregation (artist popularity, user taste profiles).
Data Quality: De-duplication of events and schema validation at the Flink level to prevent "poison pills" from entering the feature store.

Feature Pipeline

Feature Definition:
User: Embedding of last 20 tracks, mean valence/energy of recently played tracks.
Item: Genre (OHE), BPM (Normalized), Audio features (VGGish), Artist popularity.
Context: Sine/Cosine transformation of "Time of Day" to capture cyclicality.
Feature Store: Use Tecton or Feast.
Offline: Stores historical snapshots for point-in-time joins during training.
Online: Redis-backed store for < 5ms feature retrieval during inference.
Skew Mitigation: Log the exact features used at inference time to a "Log-and-Wait" table to ensure training data matches serving data.

Model Architecture

Retrieval (Two-Tower):
User Tower: Deep Neural Network (DNN) taking user features + historical embeddings \rightarrow User Vector U.
Item Tower: DNN taking track metadata + audio features \rightarrow Item Vector V.
Loss: Binary Cross-Entropy with Sampled Softmax or Triplet Loss to maximize U \cdot V for played songs.
Ranking (MMoE):
Shared Bottom: Embedding layers for sparse features.
Experts: Specialized layers to learn patterns.
Gates: One gate per task (Task A: P(\text{PlayTime} > 30s), Task B: P(\text{Like})).
Why MMoE?: To handle conflicting objectives. A user might "like" a song but not "play it through" (and vice versa).

Training Pipeline

Dataset Construction: Negative sampling is critical. Use "Impression but no play" as hard negatives and "Random tracks" as easy negatives.
Labeling:
Positive: Play duration > 30s.
Negative: Skip within first 5s.
Infrastructure: PyTorch Distributed on GPU clusters.
Retraining: Weekly full-retrain for the Two-Tower model; Daily incremental training for the Ranker to capture trending music.

Serving Pipeline

Retrieval Logic: Export Track Tower embeddings to FAISS (IVF-Flat index). Query with the User Vector generated on-the-fly.
Ranking Service: Containerized (K8s) C++ or Go service calling a TensorFlow Serving or TorchServe endpoint.
Optimization: Use FP16 Quantization for the Ranker to fit in GPU memory and reduce latency.
Fallback: If the model service times out, return the user's "Top Tracks" or "Global Top 50" cached in Redis.

Evaluation Pipeline

Offline:
Recall@100: Percentage of ground-truth tracks caught in retrieval.
NDCG: Measures ranking quality at the top of the list.
Online: A/B Testing (Split traffic by UserID). Track "Long-term Playback Minutes" over a 2-week period.

Monitoring Pipeline

Drift Detection: Monitor the KL-divergence of the predicted score distribution. If scores shift significantly, it indicates feature/label drift.
System Health: Prometheus/Grafana for QPS, Latency (P50, P99), and Error Rates.
Feedback Loop: Automatically trigger a retraining pipeline if offline AUC on a validation set drops below a threshold.
Wrap Up

Final Evaluation

Observability: Track "Prediction Drift" (average predicted play probability vs. actual).
Edge Cases:
Cold Start: For new tracks, use the Item Tower with only metadata/audio features (no play-history) to generate an embedding.
Exploration: Apply a small penalty to the scores of the most popular items to favor "Long-tail" discovery.
Trade-offs:
Accuracy vs. Latency: Increased the candidate pool from 500 to 1000 improves Recall but adds 40ms to Ranking.
Freshness vs. Stability: Real-time updates prevent stale recommendations but risk feedback loops (echo chambers).