The Question

Large-Scale Video Recommendation System Design

Design an end-to-end recommendation system for a global video streaming platform with 500M users and 100M videos. The system must optimize for multiple objectives (CTR and Watch Time) while meeting a 200ms P99 latency SLA. Detail the multi-stage architecture, including candidate retrieval via embedding-based search and high-precision ranking. Address specific challenges like position bias, delayed feedback in watch-time labels, and the infrastructure required for real-time feature freshness and model monitoring in production.

Two-Tower

MMoE

FAISS

Kafka

Flink

Spark

Tecton

XGBoost

DeepFM

Protobuf

Isotonic Regression

Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric total watch time, long-term retention, or click-through rate (CTR)?

Assumption*: The goal is to maximize Long-term Watch Time** while maintaining healthy CTR.

Constraints & Scale: What is the scale of the user base and item corpus?

Assumption: 500M Monthly Active Users (MAU), 100M+ videos, and a P99 latency budget of 200ms for the full recommendation pipeline.

Data Freshness: How quickly must new videos appear in recommendations?

Assumption: New videos (breaking news, trending clips) should be discoverable within minutes of upload.

Edge Cases: How do we handle the "Cold Start" problem for new users and "Filter Bubbles"?

Assumption: We will use content-based features for new videos and a "popular/trending" heuristic for new users.

Thinking Process

Identify the Funnel: With 100M videos, I cannot rank everything. I must use a multi-stage architecture: Retrieval (Candidate Generation) followed by Ranking (Scoring).

Addressing Scale vs. Latency: For retrieval, use an embedding-based approach (Two-Tower) to enable Approximate Nearest Neighbor (ANN) search. For ranking, use a more complex model (MMoE or DeepFM) on a smaller subset (~1,000 candidates).

Handling Data Skew: Video consumption data is power-law distributed. I need to handle popular "head" items differently from "tail" items to avoid bias.

MVP Mindset (YAGNI): Start with a simple Two-Tower retrieval and a DNN ranker. Avoid Reinforcement Learning or Graph Neural Nets until the baseline is robust.

Elite Bonus Points

Multi-Task Learning (MTL): Optimizing for a single metric (like CTR) leads to clickbait. Use an MMoE (Multi-gate Mixture-of-Experts) architecture to simultaneously predict CTR and "Watch Time Completion Rate."

Position Bias Correction: Users are more likely to click the first item regardless of quality. Use a "Position Feature" during training (but set it to a constant or use a shallow tower to neutralize it at inference) to decorrelate position from relevance.

Delayed Feedback Handling: "Watch time" is a delayed label. I will implement a "Label Joiner" in the pipeline that waits for a specific window (e.g., 20 mins) before emitting a training sample to ensure label accuracy.

Calibration: Since we optimize for watch time (regression) and CTR (classification), raw scores aren't comparable. Apply Isotonic Regression or Platt Scaling to ensure the probabilities represent real-world frequencies.

Design Breakdown

Requirements

Product Goal: Deliver highly relevant, engaging video content to increase user stickiness.

Success Metrics:

Online: Mean Watch Time per Session, Day-7 Retention, CTR.

Offline: AUC (for CTR), LogLoss, NDCG@K (for ranking order).

Guardrail: P99 Latency < 200ms, Error rate < 0.1%.

System Constraints: 50k QPS at peak; real-time ingestion of "Like/Dislike" signals.

Data Availability: User profile (age, geo), Video metadata (tags, transcript embeddings), Interaction logs (watch percentage, skip events).

ML Problem Framing

ML Task Type: Two-stage Ranking (Retrieval = Candidate Generation; Ranking = Precision Scoring).

Prediction Target:

P(\text{click} | \text{user, item, context})

E(\text{watch\_time} | \text{user, item, context, click})

Inputs:

User: Search history (last 50 queries), Watch history (last 100 videos), Embeddings.

Item: Video ID, Creator ID, Duration, Topic Embeddings, Aggregated historical CTR.

Context: Device (Mobile/Desktop), Time of day (weekday vs. weekend), Network quality.

ML Challenges: Extreme Class Imbalance (most videos are not clicked) and Feedback Loops (users only interact with what we show them).

Design Summary & MVP

Concise Summary: A two-stage pipeline using Two-Tower Embeddings for sub-millisecond retrieval and a Deep Neural Network (DNN) for high-precision ranking of the top 500 candidates.

Baseline Model: Matrix Factorization for retrieval and Logistic Regression with engineered cross-features for ranking.

Target Model: Two-Tower (Retrieval) + Multi-Gate Mixture-of-Experts (Ranking).

Choice Rationale: Two-Tower allows for decoupling User/Item towers for ANN search. MMoE allows balancing click propensity vs. watch duration, preventing clickbait.

Simplicity Audit: We avoid real-time online learning (complex infra) in favor of frequent batch retraining (daily) with a real-time feature store for freshness.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: User interaction events (click, play, pause, seek), Video uploads (metadata/blobs).

Data Ingestion: Kafka for real-time events. Using Protobuf for schema enforcement to prevent downstream pipeline breaks.

Data Storage: S3 for the raw data lake; BigQuery or Snowflake for structured analytical queries used by Data Scientists.

Data Processing: Spark for daily heavy lifting (calculating user-long-term preferences). Flink for session-based features (e.g., "videos watched in the last 10 minutes").

Feature Pipeline

Feature Definition:

User: user_watch_hist_30d, preferred_categories, avg_session_duration.

Item: video_age, normalized_view_count, embedding_v2.

Online Feature Pipeline: Flink aggregates real-time counters (e.g., "how many times was this video clicked in the last 5 mins") and pushes to a low-latency Redis/DynamoDB (Feature Store).

Feature Store: Using Tecton or Feast. It solves the Point-in-Time Join problem, ensuring that during training, we only use features available before the interaction occurred.

Model Architecture

Retrieval (Two-Tower):

User Tower: Dense layers processing user history and context.

Item Tower: Dense layers processing video metadata.

Output: Dot-product of towers. Optimized using sampled softmax loss.

Ranking (MMoE):

Shared Bottom: Shared layers for feature representation.

Experts: Task-specific sub-networks.

Towers: One for

P(\text{click})

and one for

E(\text{watch\_time})

Optimization: Quantization (INT8) for the ranking model to reduce inference latency by 3-4x.

Training Pipeline

Dataset Construction: Negative sampling is critical. Use "Easy Negatives" (randomly sampled from corpus) for retrieval and "Hard Negatives" (retrieved but not clicked) for ranking.

Data Splitting: Time-based split. Train on days 1-28, validate on day 29, test on day 30. This avoids data leakage from future events.

Retraining: Daily retraining for the Ranking model. Weekly retraining for the Embedding towers (retrieval) as user interests shift slower than trending topics.

Serving Pipeline

Serving Pattern:

Retrieval: User ID -> Embedding -> FAISS search for top 500 Candidate IDs.

Feature Enrichment: Fetch 500 item features + 1 user feature set from Online Feature Store.

Scoring: MMoE predicts scores for 500 items.

Reliability: If the Ranker service fails, fall back to "Popularity-based" or "Retrieval-only" results (graceful degradation).

Evaluation Pipeline

Offline: Replay evaluation. Use historical logs to see if the model would have ranked the actually clicked video higher than the baseline.

Online: Interleaving or standard A/B testing. Interleaving is faster for ranking changes as it exposes users to both models simultaneously in one list.

Monitoring Pipeline

Model Drift: Monitor the distribution of predicted CTR vs. actual CTR. If the Population Stability Index (PSI) > 0.1, trigger a retraining job.

System Health: Track the "Feature Fill Rate." If the Feature Store starts returning nulls for critical features (like user_id), the model performance will degrade silently.

Wrap Up

Final Evaluation

Trade-offs: Accuracy vs. Latency. We use a smaller MMoE to stay under 200ms, even if a Transformer-based ranker would be +1% more accurate.

Cold Start: Use a separate "Exploration" Bandit (e.g., Thompson Sampling) to allocate 5% of traffic to new videos to collect data.

Distinguishing Insight: Multi-Stage Filtering. Between retrieval and ranking, add a "Filtering" layer to remove videos the user has already watched, blocked creators, or age-inappropriate content. This reduces the load on the heavy Ranker.