The Question
ML Design

Scalable Video Recommendation System

As a lead engineer at a global streaming service, how would you architect a production-grade recommendation engine to serve personalized video feeds to hundreds of millions of users while balancing engagement metrics like clicks and long-term watch time? Detail the end-to-end lifecycle from data ingestion to real-time inference and bias mitigation.
Two-Tower Model
MMoE
FAISS/HNSW
MMR Re-ranking
Collaborative Filtering
Questions & Insights

Clarifying Questions

Business Goal: Is the North Star metric total Watch Time, Retention (D7/D30), or Session Depth? Assumption: We aim to maximize Long-form Watch Time while maintaining a secondary goal of User Engagement (Likes/Shares).
Constraints & Scale:
DAU: 500 Million.
Item Corpus: 100 Million videos.
QPS: ~100k peak.
Latency: P99 < 200ms for the entire recommendation funnel.
Edge Cases:
Cold Start: How do we handle 100k new videos uploaded per hour?
Data Freshness: How quickly should a user’s last interaction (e.g., clicking a video) influence the next recommendation?
Assumptions:
I assume a corpus of 100M videos and 500M users.
I assume a two-stage (Retrieval + Ranking) architecture is required to meet the 200ms SLA.
I assume we have access to implicit feedback (clicks, watch time) and explicit feedback (likes, follows).

Thinking Process

Identify the Funnel: With 100M items, a single-pass ranking model is impossible. I must design a multi-stage pipeline: Retrieval (Candidate Generation) -> Filtering -> Ranking (Scoring) -> Re-ranking (Diversity/Business Logic).
Addressing the Bottleneck: The primary bottleneck is the latency-throughput trade-off. I’ll use Approximate Nearest Neighbors (ANN) for retrieval and a Multi-gate Mixture-of-Experts (MMoE) for ranking to handle multiple objectives (click vs. watch time).
Scale Strategy: Use a Feature Store for consistent low-latency feature serving and distributed training (e.g., Parameter Servers or Horovod) for the massive user-item interaction matrix.
Continuous Improvement: Implement a robust Evaluation pipeline with A/B testing and a Monitoring pipeline for feature drift, as video trends change hourly.

Elite Bonus Points

Position Bias Correction: In ranking, items at the top of a list get more clicks regardless of quality. I would implement a "Position Feature" during training (set to a default value during inference) or use Propensity Scoring to de-bias the training data.
Delayed Feedback Modeling: Watch time is a "delayed" label (you only know it after the session ends). I'd implement a "Positive-Unlabeled" learning approach or a feedback windowing strategy to avoid penalizing items that haven't been fully watched yet.
Multi-task Calibration: If the model predicts a 0.8 probability of a click, but the actual CTR is 0.4, the model is uncalibrated. I would apply Platt Scaling or Isotonic Regression to ensure predicted probabilities match empirical reality, which is crucial for downstream business logic.
Exploration (Bandits): To solve the "filter bubble," I would reserve 5% of traffic for Thompson Sampling or Upper Confidence Bound (UCB) strategies to explore new content and collect unbiased data.
Design Breakdown

Functional Reqs

Personalized Home Feed: Users receive a list of videos tailored to their interests upon opening the app.
"Up Next" Recommendations: Real-time suggestions based on the video currently being watched.
Real-time Updates: User actions (likes/dislikes) should influence recommendations within seconds.

Non-Functional Reqs

Scalability: Support 500M DAU and 100M items.
Availability: 99.99% uptime; recommendations should fall back to "Popular" content if the ML service is down.
Latency: P99 < 200ms (Retrieval < 50ms, Ranking < 100ms, Overhead < 50ms).
Freshness: New uploads must be discoverable within minutes.

ML Problem Framing

ML Objective: Maximize E[\text{WatchTime} | \text{User}, \text{Context}, \text{Video}].
ML Category: Ranking (Learning to Rank) using a Pointwise or Listwise approach.
Input/Output/Label:
Input: User Profile, Historical Interactions, Video Metadata, Context (Device, Time).
Output: A ranked list of Video IDs.
Label: Continuous (Watch Time in seconds) and Binary (Clicked, Liked).

Data Prep & Features

Data Pipeline:
Client logs (events) \rightarrow Kafka S3 (Raw Data).
Transformation via Spark for offline training and Flink for real-time feature updates.
Feature Engineering:
User Features: Demographic, Embedding of last N watched videos, Average watch time per category.
Item Features: Video duration, Topic/Tags (NLP), Visual Embeddings (CNN/ViT), Upload timestamp (freshness).
Context Features: Device, Location, Time of Day (e.g., weekend vs. weekday).
Cross-features: User-Category affinity, User-Creator affinity.
Feature Store: Use Redis for online serving and Hive/Iceberg for offline training to prevent Training-Serving Skew.

Model Architecture

Retrieval (Two-Tower Model):
User Tower: Aggregates user history into a 128D vector.
Item Tower: Processes video features into a 128D vector.
Loss: Triplet loss or sampled softmax to maximize the inner product of (User, Positive Item).
Ranking (MMoE - Multi-gate Mixture-of-Experts):
Shared bottom layers to learn general representations.
Task-specific towers (Expert heads) for Click Probability (Classification) and Watch Time (Regression).
This allows the system to balance "clickbait" (high CTR, low watch time) against "quality content."

Training & Serving

Optimization: Adam optimizer with weight decay. Use a Time-based split for training/validation (don't use the future to predict the past).
Serving:
Retrieval: FAISS or HNSW index for sub-10ms similarity search.
Ranking: Model hosted on Triton Inference Server with GPU acceleration for batch scoring.
Addressing Challenges:
Sample Selection Bias: Use "all impressions" for ranking negatives, but "all videos" for retrieval negatives.

System Architecture

Pipeline Deep Dive

Data Pipeline

Ingestion: Kafka handles the firehose of clicks, impressions, and "heartbeat" watch events (sent every 10s to track precise watch duration).
Storage: Raw events are stored in Parquet/Iceberg on S3, partitioned by event_date and user_id_hash for efficient querying.

Feature Pipeline

Stream Processing: Flink calculates sliding window counts (e.g., "how many times has this video been watched in the last 5 minutes?") to detect viral content immediately.
Consistency: The Feature Store acts as the "source of truth." During training, we perform "Point-in-time" joins to ensure the model sees features exactly as they were at the time of the event.

Training Pipeline

Incremental Training: Given the scale, we don't retrain from scratch daily. We use "Warm Start" to initialize weights from the previous day's model and fine-tune on the new day's data.
Workflow: Airflow orchestrates the DAG, ensuring the Two-Tower index is rebuilt and pushed to the Retrieval service after the Ranking model is validated.

Serving Pipeline

Retrieval: We use multiple strategies—Collaborative Filtering, User-Author affinity, and "Popular in your Region"—to ensure high recall. These are merged into a candidate set of ~1,000 videos.
Ranking: The ~1,000 candidates are scored by the MMoE model. We use "Request Batching" on the GPU to score all candidates in a single pass.
Re-ranking: We apply a Maximal Marginal Relevance (MMR) algorithm to ensure the top 10 videos aren't all from the same creator or category.

Evaluation Pipeline

Online Experimentation: We use a "Layered" A/B testing framework where multiple experiments (e.g., UI change vs. Model change) can run concurrently without interference using hashing-based bucketization.
Interleaving: For faster ranking evaluation, we interleave recommendations from Model A and Model B and measure which one the user clicks more frequently.

Monitoring Pipeline

System Metrics: Tracking P99 latency is critical. If latency spikes, the orchestrator triggers a "Simplified Ranking" mode (skipping the MMoE and using a lightweight XGBoost model).
ML Metrics: We monitor the Prediction-to-Label Distribution. If the model starts predicting 0.1 CTR while the actual is 0.2, an automated alert is triggered for retraining or investigation of data pipeline health.
Wrap Up

Advanced Topics

Offline Summary: We look for improvements in PR-AUC for clicks and RMSE for watch time.
Online Summary: The success of the system is measured by a statistically significant increase in Average Daily Watch Time and 7-Day Retention.
Scalability Audit: By using ANN for retrieval and horizontal scaling for ranking nodes, the system can handle a 10x increase in traffic by simply increasing the shard count of the Redis Feature Store and the FAISS index.