The Question
ML DesignLarge-Scale Personalized Video Recommendation System
Design a recommendation system for a video platform with 1B+ users and 100M+ videos. The system must provide personalized content in real-time with a P99 latency of <200ms. Detail the two-stage pipeline (retrieval and ranking), explain how you handle real-time user feedback, discuss strategies for negative sampling in training, and describe how the system addresses position bias and data freshness. Ensure the design includes a robust infrastructure for model evaluation and monitoring for training-serving skew.
Two-Tower Model
DeepFM
DCN-v2
ANN
HNSW
FAISS
Spark
Flink
Kafka
TensorRT
ONNX
Feast
Questions & Insights
Clarifying Questions
Business Goal: Is the primary North Star metric Watch Time, Click-Through Rate (CTR), or Long-term Retention? Assumption: Maximize total watch time per session.
Constraints & Scale: What is the scale of users and items? Assumption: 1B+ users, 100M+ videos, 100k QPS, <200ms P99 latency.
Edge Cases: How do we handle cold starts for new videos? Assumption: We use content-based features (metadata/embeddings) for new items until collaborative signals accumulate.
Data Freshness: How quickly must new user actions (e.g., a "like") influence recommendations? Assumption: Near real-time (seconds to minutes).
Assumptions:
Corpus of 100M items.
Two-stage architecture: Retrieval (Candidate Generation) and Ranking.
P99 latency requirement of 150ms.
Thinking Process
Identify the Bottleneck: You cannot rank 100M videos in real-time. The design must focus on a funnel approach: Retrieval (fast, coarse) -> Ranking (slow, precise).
Selection vs. Scale: For retrieval, use an approximate nearest neighbor (ANN) search on embeddings. For ranking, use a deep model that captures cross-feature interactions.
Data Flywheel: Ensure the feedback loop (clicks/watches) is captured and processed into features to avoid "stale" recommendations.
Simplicity (YAGNI): Start with a two-tower model for retrieval and a Deep Neural Network (DNN) for ranking before exploring multi-task learning or transformers.
Elite Bonus Points
Calibration: Predicting P(Click) often leads to clickbait. I will implement calibration (e.g., Platt Scaling) or weight positive labels by watch time to ensure "value" over "clicks."
Delayed Feedback Loops: Watch time labels are only available minutes after a click. I'll use a "request-id" based join mechanism in the feature store to handle asynchronous label arrival.
Position Bias Correction: Users are more likely to click the first item. I will include "Position" as a feature during training but set it to a constant/default during inference to de-bias the model.
Exploration (Epsilon-Greedy/Thompson Sampling): To avoid filter bubbles, dedicate 5% of traffic to explore new content or random niches.
Design Breakdown
Requirements
Product Goal: Increase session watch time and user retention.
Success Metrics:
Online: Watch Time, Session Depth, 7-day Retention.
Offline: Recall@K (for retrieval), AUC/NDCG (for ranking).
Guardrail: P99 Latency, Training/Serving Skew (KL Divergence).
System Constraints: 100M item corpus, 100k QPS, distributed training on TBs of data daily.
Data Availability: User watch history, video metadata (tags, title), user demographics, and real-time interaction logs.
ML Problem Framing
ML Task Type: Two-stage Ranking (Retrieval: Similarity/ANN; Ranking: Binary Classification + Regression for watch time).
Prediction Target: E[\text{watch time}] = P(\text{click} | u, i, c) \times E[\text{watch duration} | \text{click}].
Inputs:
User: Historical video IDs (sequence), search history, country, language.
Item: Video ID embedding, channel ID, duration, upload time, topic category.
Context: Device, time of day, current page type.
ML Challenges: Extreme class imbalance (most videos are not watched), position bias, and highly dynamic item catalog.
Design Summary & MVP
Concise Summary: A two-stage pipeline using a Two-Tower Model for embedding-based retrieval and a Deep Ranking model (DNN with Cross-features) to score the top 500 candidates.
Model Architecture & Selection:
Baseline: Popularity-based or Item-Item Collaborative Filtering (Heuristic).
Target Model: Two-Tower Model (Retrieval) and DeepFM or DCN-v2 (Ranking) for feature crossing.
Choice Rationale: Two-Tower allows pre-computing item embeddings for sub-10ms ANN search. Ranking models like DCN-v2 capture non-linear interactions (e.g., User_Language x Video_Language).
Simplicity Audit: Avoid Reinforcement Learning initially. A supervised learning approach with a strong feature store provides 90% of the value with 10% of the complexity.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Client-side event tracking (clicks, pauses, skips) and server-side logs.
Ingestion: Kafka acts as the backbone. We use a Lambda architecture: Spark for high-throughput batch historical data; Flink for sub-second processing of the user's last 5 videos to update recommendations instantly.
Storage: Parquet format on S3 for training data, partitioned by
date and user_id_hash.Feature Pipeline
Feature Engineering:
Categorical: Hash video IDs to a fixed vocabulary size.
Continuous: Log-transform view counts to handle long-tail distributions.
Embeddings: Pre-train item embeddings using Word2Vec (on watch sequences) or use the Two-Tower weights.
Feature Store: Use a system like Feast or Tecton.
Online: Low-latency (Redis/DynamoDB) for user context.
Offline: Point-in-time joins to prevent label leakage (ensuring features used for a training sample were known before the event occurred).
Model Architecture
Retrieval (Two-Tower):
Separate DNNs for User and Item.
Dot product output: score = \text{UserEmbed} \cdot \text{ItemEmbed}.
Loss: Sampled Softmax or Triplet Loss.
Ranking (DCN-v2):
Cross Network: Explicitly models feature interactions u_i \times v_j.
Deep Network: Captures implicit non-linear relations.
This architecture handles sparse categorical features (IDs) and dense features (durations) effectively.
Training Pipeline
Negative Sampling: Since users only click a fraction of items, we use "Easy Negatives" (random videos from the corpus) and "Hard Negatives" (videos shown but not clicked).
Labeling: Positive = Watch time > 30s or > 50% of video.
Data Split: Time-based split (Train on days 1-28, Test on day 29) is mandatory to prevent temporal leakage.
Serving Pipeline
Retrieval Strategy: Export Item Tower embeddings to a Vector DB (e.g., Milvus or Pinecone) using HNSW for O(log N) search.
Ranking Optimization: Use TensorRT or ONNX for model quantization (FP16/INT8) to reduce inference latency from 50ms to 10ms.
Fallback: If the ranker fails, return the retrieval output directly or a cached "Trending" list.
Evaluation Pipeline
Offline: Use NDCG (Normalized Discounted Cumulative Gain) to evaluate if the "best" videos are at the top.
Online: A/B test with a 1% bucket, measuring "Incremental Watch Time" against the control group.
Monitoring Pipeline
Data Drift: Monitor the distribution of input features (e.g., if the % of users from a specific country drops, the pipeline might be broken).
Prediction Drift: Monitor the average score output by the ranker. A sudden shift usually indicates an upstream data quality issue.
Wrap Up
Final Evaluation
Cold Start: Address via a separate content-based retrieval tower (extracting embeddings from video thumbnails using a CNN).
Trade-offs:
Accuracy vs. Latency: We cap the number of items sent to the ranker at 500 to maintain <200ms latency.
Exploration vs. Exploitation: Use a Multi-Armed Bandit layer at the very end to re-rank a few exploratory items.