The Question

Scalable Video Recommendation System Design

Design an end-to-end recommendation system for a global video streaming platform with 100M+ DAU. The system must maximize user engagement (watch time) while maintaining low latency (<200ms). Detail the multi-stage architecture (retrieval and ranking), explain how you handle high-cardinality categorical features, and describe the data pipelines required to minimize training-serving skew. Address how the system balances different product objectives (e.g., clicks vs. completion rate) and how you ensure the model remains fresh in the face of rapidly changing trends.

Two-Tower Model

MMoE

ANN

FAISS

HNSW

Kafka

Flink

Spark

Feature Store

Triton Inference Server

Questions & Insights

Clarifying Questions

Business Goal: What is the primary North Star metric? (e.g., Total Watch Time, 7-day Retention, or CTR?)

Assumption*: The goal is to maximize Total Watch Time** while maintaining a high Click-Through Rate (CTR).

Constraints & Scale: What is the scale of the corpus and traffic?

Assumption: 100M Daily Active Users (DAU), 10M+ video corpus, and 50k+ Queries Per Second (QPS) at peak.

Latency Budget: What is the end-to-end P99 latency requirement for the homepage?

Assumption: P99 latency should be < 200ms.

Data Freshness: How quickly should a user's interaction (e.g., a click) influence the next recommendation?

Assumption: Near real-time (within seconds) for the homepage "Up Next" features.

Edge Cases: Do we need to handle the "Cold Start" problem for new videos/users?

Assumption: Yes, we will use content-based features and exploration strategies for new items.

Thinking Process

Identify the Bottleneck: With 10M videos, I cannot rank every item for every user in real-time. I must use a multi-stage architecture: Retrieval (Candidate Generation) followed by Ranking (Scoring).

Retrieval Strategy: I need a high-recall, low-latency method to narrow 10M videos to ~500. A Two-Tower Neural Network generating embeddings for Approximate Nearest Neighbor (ANN) search is the industry standard for this.

Ranking Strategy: Once I have 500 candidates, I can afford a heavier model. I'll use a Deep Neural Network (DNN) that handles cross-features and multi-objective optimization (click vs. watch time).

Scaling & Freshness: I need a Feature Store to serve real-time user state (e.g., "last 5 videos watched") to the ranker to ensure the system reacts to immediate behavior.

Elite Bonus Points

Position Bias Modeling: In training data, users are more likely to click the first item. I will add "Position" as a feature during training but set it to a default value (or remove it) during inference to decouple relevance from placement.

Multi-Objective Optimization (MMoE): Users might click but skip (clickbait). I'll use a Multi-Gate Mixture-of-Experts (MMoE) to predict both

P(click)

and

E(watch\_time)

simultaneously, combining them into a final utility score.

Exploration vs. Exploitation: To avoid "filter bubbles," I'll implement a simple Upper Confidence Bound (UCB) or "epsilon-greedy" layer to inject 5% fresh/exploratory content into the results.

Calibration: Watch time predictions often have high variance. I'll apply temperature scaling or isotonic regression to ensure predicted watch times align with actual historical averages.

Design Breakdown

Requirements

Product Goal: Surface videos users are most likely to watch to completion, increasing platform engagement.

Success Metrics:

Online: Average Watch Time per session, CTR, 30-day retention.

Offline: Recall@K (for retrieval), AUC/LogLoss (for CTR), MAE (for watch time).

Guardrail: P99 Latency, Training/Serving Skew (KL-Divergence).

System Constraints: 100M DAU, 50k QPS, <200ms latency.

Data Availability: Implicit feedback (clicks, watch duration, skips), user metadata, video metadata (tags, transcript embeddings).

ML Problem Framing

ML Task Type: Two-stage Recommendation (Retrieval + Ranking).

Prediction Target:

Retrieval:

P(\text{video}_i \text{ is watched} | \text{User})

Ranking: Multi-task

\text{Score} = w_1 \cdot P(\text{click}) + w_2 \cdot f(\text{expected\_watch\_time})

Inputs:

User: Historical IDs, search history, demographics, device.

Item: Video ID, Creator ID, Embeddings (VGG/ResNet for visual, BERT for title), Age of video.

Context: Time of day, day of week, location.

ML Challenges: Highly sparse ID features, extreme class imbalance (most videos aren't clicked), and selection bias.

Design Summary & MVP

Concise Summary: A two-stage pipeline using Two-Tower embeddings for ANN-based retrieval and a Multi-Task DNN (MMoE) for precision ranking.

Model Architecture & Selection:

Baseline: Popularity-based or Collaborative Filtering (Matrix Factorization).

Target: Two-Tower (Retrieval) + MMoE DNN (Ranking).

Choice Rationale: Two-tower allows decoupled embedding computation for low latency; MMoE balances competing objectives (click vs. duration) better than a single-task model.

Simplicity Audit: We avoid Reinforcement Learning (RL) or Graph Neural Networks (GNNs) initially to focus on a robust, observable supervised learning pipeline.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Application event logs (click, impression, heartbeats for watch time), Video metadata DB, User profile DB.

Data Ingestion: Kafka for real-time streaming of interaction events. Airflow for daily batch ingestion of metadata.

Data Storage: S3 for raw parquet files (Data Lake), partitioned by date/hour for efficient query scanning.

Data Processing: Spark for joining impressions with clicks/watch-time labels. We use a "Wait-for-Label" window (e.g., 24 hours) to ensure watch-time labels are complete before training.

Feature Pipeline

Feature Definition:

User: user_id (embedding), watched_video_ids (sequence), preferred_genres.

Item: video_id, category, video_length, upload_time.

Context: device_type, time_of_day_bucket.

Feature Engineering:

Continuous: Log-transform view_count to handle power-law distribution.

Categorical: Hashing trick for high-cardinality IDs.

Online/Offline Consistency: Use a Feature Store (e.g., Tecton or Feast). This ensures that the code used to compute "average watch time in last 1 hour" is identical in the Flink stream (serving) and the Spark job (training).

Model Architecture

Retrieval (Two-Tower):

User Tower: Deep network transforming user features into vector

U

Item Tower: Deep network transforming video features into vector

V

Loss: In-batch softmax with sampled negatives to maximize

U \cdot V

Ranking (MMoE):

Bottom: Shared embedding layer for all features.

Middle: Multiple "Expert" networks.

Top: "Towers" for each task (Task 1: Logistic for Click; Task 2: Regression for Watch Time).

Why?: Prevents "Clickbait" by allowing the model to penalize items with high click probability but low predicted watch time.

Training Pipeline

Dataset Construction: Negative sampling is critical. For retrieval, use "all items" as potential negatives. For ranking, use "impressed but not clicked" as negatives.

Data Splitting: Time-based split. Train on days 1-28, validate on day 29, test on day 30. Random splits lead to temporal leakage.

Retraining: Daily batch retraining for the ranker. The Retrieval Tower can be updated less frequently, but item embeddings should be re-generated hourly.

Serving Pipeline

Retrieval Pattern: Use FAISS or HNSW for ANN. The item tower outputs are pre-computed and indexed in a vector DB (e.g., Pinecone or Milvus).

Ranking Pattern: High-throughput inference using NVIDIA Triton or TFServing.

Latency Optimization:

Feature Prefetching: Fetch user features in parallel while retrieval is running.

Model Quantization: Use FP16 or INT8 for ranking weights to reduce memory bandwidth bottlenecks.

Evaluation Pipeline

Offline:

Retrieval: Recall@100.

Ranking: Weighted AUC (where positive samples are weighted by watch time).

Online: A/B test comparing the new model against the current production baseline. Measure "Total Platform Minutes" as the primary KPI.

Monitoring Pipeline

System: Monitor P99 latency of the ANN search. If it spikes, fall back to a cached "Popular Videos" list.

Model: Track Prediction Drift. If the average predicted CTR deviates from the historical average by >10%, trigger an alert for potential data pipeline corruption.

Feature: Monitor Feature Completeness. If user_history is missing in 5% of requests, investigate the Feature Store ingestion.

Wrap Up

Final Evaluation

Cold Start: For new videos, use the Item Tower to project them into the embedding space based on metadata/tags alone.

Exploration/Exploitation: Use a Deterministic Hash-based shuffling to ensure a user sees a diverse set of genres in their top 10.

Trade-offs: We trade off absolute accuracy (e.g., using a Transformer/Cross-network) for lower latency and better maintainability by using an MMoE-DNN architecture.