The Question
ML DesignLarge-Scale Video Recommendation System Design
Design an end-to-end recommendation system for a global video streaming platform with 500M users and 100M videos. The system must optimize for multiple objectives (CTR and Watch Time) while meeting a 200ms P99 latency SLA. Detail the multi-stage architecture, including candidate retrieval via embedding-based search and high-precision ranking. Address specific challenges like position bias, delayed feedback in watch-time labels, and the infrastructure required for real-time feature freshness and model monitoring in production.
Two-Tower
MMoE
FAISS
Kafka
Flink
Spark
Tecton
XGBoost
DeepFM
Protobuf
Isotonic Regression
Questions & Insights
Clarifying Questions
Business Goal: Is the primary North Star metric total watch time, long-term retention, or click-through rate (CTR)?
Assumption*: The goal is to maximize Long-term Watch Time** while maintaining healthy CTR.
Constraints & Scale: What is the scale of the user base and item corpus?
Assumption: 500M Monthly Active Users (MAU), 100M+ videos, and a P99 latency budget of 200ms for the full recommendation pipeline.
Data Freshness: How quickly must new videos appear in recommendations?
Assumption: New videos (breaking news, trending clips) should be discoverable within minutes of upload.
Edge Cases: How do we handle the "Cold Start" problem for new users and "Filter Bubbles"?
Assumption: We will use content-based features for new videos and a "popular/trending" heuristic for new users.
Thinking Process
Identify the Funnel: With 100M videos, I cannot rank everything. I must use a multi-stage architecture: Retrieval (Candidate Generation) followed by Ranking (Scoring).
Addressing Scale vs. Latency: For retrieval, use an embedding-based approach (Two-Tower) to enable Approximate Nearest Neighbor (ANN) search. For ranking, use a more complex model (MMoE or DeepFM) on a smaller subset (~1,000 candidates).
Handling Data Skew: Video consumption data is power-law distributed. I need to handle popular "head" items differently from "tail" items to avoid bias.
MVP Mindset (YAGNI): Start with a simple Two-Tower retrieval and a DNN ranker. Avoid Reinforcement Learning or Graph Neural Nets until the baseline is robust.
Elite Bonus Points
Multi-Task Learning (MTL): Optimizing for a single metric (like CTR) leads to clickbait. Use an MMoE (Multi-gate Mixture-of-Experts) architecture to simultaneously predict CTR and "Watch Time Completion Rate."
Position Bias Correction: Users are more likely to click the first item regardless of quality. Use a "Position Feature" during training (but set it to a constant or use a shallow tower to neutralize it at inference) to decorrelate position from relevance.
Delayed Feedback Handling: "Watch time" is a delayed label. I will implement a "Label Joiner" in the pipeline that waits for a specific window (e.g., 20 mins) before emitting a training sample to ensure label accuracy.
Calibration: Since we optimize for watch time (regression) and CTR (classification), raw scores aren't comparable. Apply Isotonic Regression or Platt Scaling to ensure the probabilities represent real-world frequencies.
Design Breakdown
Requirements
Product Goal: Deliver highly relevant, engaging video content to increase user stickiness.
Success Metrics:
Online: Mean Watch Time per Session, Day-7 Retention, CTR.
Offline: AUC (for CTR), LogLoss, NDCG@K (for ranking order).
Guardrail: P99 Latency < 200ms, Error rate < 0.1%.
System Constraints: 50k QPS at peak; real-time ingestion of "Like/Dislike" signals.
Data Availability: User profile (age, geo), Video metadata (tags, transcript embeddings), Interaction logs (watch percentage, skip events).
ML Problem Framing
ML Task Type: Two-stage Ranking (Retrieval = Candidate Generation; Ranking = Precision Scoring).
Prediction Target:
P(\text{click} | \text{user, item, context})
E(\text{watch\_time} | \text{user, item, context, click})
Inputs:
User: Search history (last 50 queries), Watch history (last 100 videos), Embeddings.
Item: Video ID, Creator ID, Duration, Topic Embeddings, Aggregated historical CTR.
Context: Device (Mobile/Desktop), Time of day (weekday vs. weekend), Network quality.
ML Challenges: Extreme Class Imbalance (most videos are not clicked) and Feedback Loops (users only interact with what we show them).
Design Summary & MVP
Concise Summary: A two-stage pipeline using Two-Tower Embeddings for sub-millisecond retrieval and a Deep Neural Network (DNN) for high-precision ranking of the top 500 candidates.
Baseline Model: Matrix Factorization for retrieval and Logistic Regression with engineered cross-features for ranking.
Target Model: Two-Tower (Retrieval) + Multi-Gate Mixture-of-Experts (Ranking).
Choice Rationale: Two-Tower allows for decoupling User/Item towers for ANN search. MMoE allows balancing click propensity vs. watch duration, preventing clickbait.
Simplicity Audit: We avoid real-time online learning (complex infra) in favor of frequent batch retraining (daily) with a real-time feature store for freshness.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: User interaction events (click, play, pause, seek), Video uploads (metadata/blobs).
Data Ingestion: Kafka for real-time events. Using Protobuf for schema enforcement to prevent downstream pipeline breaks.
Data Storage: S3 for the raw data lake; BigQuery or Snowflake for structured analytical queries used by Data Scientists.
Data Processing: Spark for daily heavy lifting (calculating user-long-term preferences). Flink for session-based features (e.g., "videos watched in the last 10 minutes").
Feature Pipeline
Feature Definition:
User:
user_watch_hist_30d, preferred_categories, avg_session_duration.Item:
video_age, normalized_view_count, embedding_v2.Online Feature Pipeline: Flink aggregates real-time counters (e.g., "how many times was this video clicked in the last 5 mins") and pushes to a low-latency Redis/DynamoDB (Feature Store).
Feature Store: Using Tecton or Feast. It solves the Point-in-Time Join problem, ensuring that during training, we only use features available before the interaction occurred.
Model Architecture
Retrieval (Two-Tower):
User Tower: Dense layers processing user history and context.
Item Tower: Dense layers processing video metadata.
Output: Dot-product of towers. Optimized using sampled softmax loss.
Ranking (MMoE):
Shared Bottom: Shared layers for feature representation.
Experts: Task-specific sub-networks.
Towers: One for P(\text{click}) and one for E(\text{watch\_time}).
Optimization: Quantization (INT8) for the ranking model to reduce inference latency by 3-4x.
Training Pipeline
Dataset Construction: Negative sampling is critical. Use "Easy Negatives" (randomly sampled from corpus) for retrieval and "Hard Negatives" (retrieved but not clicked) for ranking.
Data Splitting: Time-based split. Train on days 1-28, validate on day 29, test on day 30. This avoids data leakage from future events.
Retraining: Daily retraining for the Ranking model. Weekly retraining for the Embedding towers (retrieval) as user interests shift slower than trending topics.
Serving Pipeline
Serving Pattern:
Retrieval: User ID -> Embedding -> FAISS search for top 500 Candidate IDs.
Feature Enrichment: Fetch 500 item features + 1 user feature set from Online Feature Store.
Scoring: MMoE predicts scores for 500 items.
Reliability: If the Ranker service fails, fall back to "Popularity-based" or "Retrieval-only" results (graceful degradation).
Evaluation Pipeline
Offline: Replay evaluation. Use historical logs to see if the model would have ranked the actually clicked video higher than the baseline.
Online: Interleaving or standard A/B testing. Interleaving is faster for ranking changes as it exposes users to both models simultaneously in one list.
Monitoring Pipeline
Model Drift: Monitor the distribution of predicted CTR vs. actual CTR. If the Population Stability Index (PSI) > 0.1, trigger a retraining job.
System Health: Track the "Feature Fill Rate." If the Feature Store starts returning nulls for critical features (like
user_id), the model performance will degrade silently.Wrap Up
Final Evaluation
Trade-offs: Accuracy vs. Latency. We use a smaller MMoE to stay under 200ms, even if a Transformer-based ranker would be +1% more accurate.
Cold Start: Use a separate "Exploration" Bandit (e.g., Thompson Sampling) to allocate 5% of traffic to new videos to collect data.
Distinguishing Insight: Multi-Stage Filtering. Between retrieval and ranking, add a "Filtering" layer to remove videos the user has already watched, blocked creators, or age-inappropriate content. This reduces the load on the heavy Ranker.