The Question
ML Design

Scalable Video Recommendation System

Design a high-throughput, low-latency recommendation engine for a global video sharing platform. The system should handle hundreds of millions of users and items, provide real-time personalization based on user behavior, and optimize for long-term engagement metrics like retention and watch time, while ensuring diversity and handling new content discovery.
Two-Tower Model
MMoE
FAISS/HNSW
MMR Re-ranking
Collaborative Filtering
Questions & Insights

Clarifying Questions

Business Goal: Is the primary metric long-term retention, total watch time, or click-through rate (CTR)?
Assumption*: The North Star is Long-term Retention, optimized via a proxy of Weighted Watch Time** (to avoid clickbait).
Constraints & Scale:
DAU: 200 Million.
Corpus: 500 Million videos.
Latency: P99 < 150ms for the entire recommendation loop.
QPS: ~500k at peak.
Edge Cases:
Cold Start: How do we handle 100k new videos uploaded per hour?
Feedback Loops: How do we prevent the "filter bubble" and ensure diversity?
Assumptions: I assume a multi-stage architecture (Retrieval -> Ranking -> Re-ranking) is required to handle the 500M item corpus within the latency budget.

Thinking Process

Identify the Funnel: With 500M items, a single-pass ranking is impossible. I must design a multi-stage funnel: Retrieval (candidate generation), Ranking (heavy-weight scoring), and Re-ranking (business logic/diversity).
Addressing Freshness: Video platforms live and die by "what's happening now." I need a streaming feature pipeline (Flink) to capture real-time user actions and update the "User Context" immediately.
Optimization Strategy: Moving beyond simple CTR to Multi-Task Learning (MTL). We need to predict multiple objectives simultaneously (Watch time, Like, Share, Dismiss).
System Scalability: Use Approximate Nearest Neighbor (ANN) search for retrieval to keep the first stage sub-10ms.

Elite Bonus Points

Position Bias Modeling: In video feeds, users are more likely to click the top item regardless of relevance. I will implement a shallow "Position Bias" tower during training that is discarded during inference.
Delayed Feedback Loops: Watch time is only known after the session ends. I’ll use a "negative sampling" strategy that treats "no-watch" as a temporary negative but updates the label once the heartbeat signal arrives.
Calibration: Since we optimize for multiple objectives, the raw scores of different models aren't comparable. I will implement Platt Scaling or Isotonic Regression to ensure predicted probabilities match empirical likelihoods for better downstream ranking fusion.
Embeddings Versioning: To prevent "Model Drift" during serving, I'll implement a deployment strategy that ensures item embeddings and the ranking model are updated atomically (Version-Pinned Serving).
Design Breakdown

Functional Reqs

Users receive a personalized home feed of videos.
Related video suggestions appear next to the current video.
New uploads (Cold Start) must be discoverable within minutes.
Explicit feedback (Likes/Dislikes) must influence the feed immediately.

Non-Functional Reqs

Scalability: Support 500k QPS and horizontal scaling of the ranking layer.
Availability: 99.99% (fallback to "Popular Videos" if the ML stack fails).
Latency: End-to-end response < 150ms.
Freshness: New user interactions should influence recommendations within seconds (Near Real-Time).

ML Problem Framing

ML Objective: Maximize E[WatchTime] + \alpha E[Engagement] where engagement is a weighted sum of likes, shares, and follows.
ML Category: Multi-Task Learning (MTL) using classification (for clicks/likes) and regression (for watch time).
Input/Output/Label:
Input: User Profile, Historical Interactions, Video Metadata, Context (Device, Time).
Output: A ranked list of Video IDs.
Label: Watch duration (continuous), Click/Like/Share (binary).

Data Prep & Features

Data Pipeline: Raw logs from Kafka ingested into a Data Lake (S3).
Feature Engineering:
User Features: Historical watch categories (histograms), embedding-based interest vectors, demographic data.
Item Features: Video embeddings (VGGish for audio, CLIP for thumbnails), tags, uploader authority, freshness (time since upload).
Context Features: Time of day, day of week, device type, network latency.
Cross-Features: User-Category affinity scores, User-Uploader interaction history.
Feature Store: Use a dual-database approach (Redis for online low-latency lookups, Hive/Iceberg for offline training).

Model Architecture

Retrieval (Candidate Generation):
Two-Tower Neural Network: A User-Tower and an Item-Tower producing 128d embeddings. Offline, we pre-compute item embeddings and index them in a Vector DB (FAISS/HNSW).
Ranking (Scoring):
MMoE (Multi-gate Mixture-of-Experts): Share lower-level representations while having task-specific "towers" for Watch Time, Clicks, and Completion Rate. This handles the correlation between tasks efficiently.
Loss Functions:
Log-loss for classification tasks.
Huber Loss for watch time regression (robust to outliers).

Training & Serving

Optimization: Adam optimizer with warm-up and exponential decay.
Training Strategy: Incremental training on a sliding window of the last 7 days of data.
Serving:
Online Inference: Ranking model deployed on Triton Inference Server.
Caching: Cache user retrieval results for 5 minutes; rank them on the fly to incorporate real-time context.
Addressing Bias: Use "Exploration" via Upper Confidence Bound (UCB) or Thompson Sampling to give new videos a chance to gain engagement data.

System Architecture

Pipeline Deep Dive

Data Pipeline

Ingestion: Client-side events (clicks, impressions, watch heartbeats) are streamed via Kafka. Heartbeats are sent every 10 seconds to track precise watch time.
Data Storage: S3 acts as the source of truth. Data is partitioned by event_type and date.
Processing: Spark jobs join click logs with impression logs to create "Positive/Negative" training labels, handling the "Attribution Window" (e.g., a click 2 minutes after an impression).

Feature Pipeline

Feature Extraction: Flink calculates "Sliding Window" aggregates (e.g., number of videos a user watched in the last 30 minutes).
Feature Store: Features are tagged with a TTL. Consistency is maintained by using the same transformation logic (via shared feature definitions) for both Flink (online) and Spark (offline).
Versioning: Each feature set has a version ID. The Ranking model metadata includes required feature versions to prevent serving-time crashes.

Training Pipeline

Offline Training: We use a "Time-based Split" for validation (Train on days 1-20, Test on day 21) to prevent data leakage.
Workflow Orchestration: Airflow schedules the daily retraining. If the new model's AUC/NDCG drops by >1% vs. the production model, the deployment is automatically blocked.

Serving Pipeline

Retrieval: Uses a multi-channel approach.
Neural: Two-tower ANN.
Collaborative: User-User / Item-Item CF.
Rule-based: Trending in user's region.
Ranking: A high-precision MMoE model scores the top 500-1000 candidates from Retrieval.
Re-ranking: Implements Maximal Marginal Relevance (MMR) to ensure the list isn't dominated by a single uploader or category.

Evaluation Pipeline

Online Experimentation: Users are hashed into buckets for A/B tests. We track "Interleaved" results for faster model comparison.
Feedback Loop: Online performance (Watch Time per User) is fed back into the training pipeline to adjust the weights of the MTL objectives.

Monitoring Pipeline

System Metrics: Monitoring CPU/GPU utilization and P99 latency.
ML Metrics: We monitor "Feature Drift" (e.g., if the distribution of 'User Age' in serving deviates from training) and "Label Leakage" (checking if the model is predicting perfectly due to a bug).
Wrap Up

Advanced Topics

Offline Summary: We optimize for NDCG@10 (Ranking quality) and Log-Loss (Prediction accuracy).
Online Summary: The ultimate success is determined by an increase in L28 Retention and Average Daily Session Duration.
Scalability Audit: The system scales by sharding the Vector DB (Retrieval) and horizontally scaling the Triton inference cluster (Ranking). For 10x traffic, we would increase the ANN "Search Space" pruning and use model quantization (FP16/INT8).