The Question

Large-Scale Personalized Video Recommendation System

Design a recommendation system for a video platform with 1B+ users and 100M+ videos. The system must provide personalized content in real-time with a P99 latency of <200ms. Detail the two-stage pipeline (retrieval and ranking), explain how you handle real-time user feedback, discuss strategies for negative sampling in training, and describe how the system addresses position bias and data freshness. Ensure the design includes a robust infrastructure for model evaluation and monitoring for training-serving skew.

Two-Tower Model

DeepFM

DCN-v2

ANN

HNSW

FAISS

Spark

Flink

Kafka

TensorRT

ONNX

Feast

Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric Watch Time, Click-Through Rate (CTR), or Long-term Retention? Assumption: Maximize total watch time per session.

Constraints & Scale: What is the scale of users and items? Assumption: 1B+ users, 100M+ videos, 100k QPS, <200ms P99 latency.

Edge Cases: How do we handle cold starts for new videos? Assumption: We use content-based features (metadata/embeddings) for new items until collaborative signals accumulate.

Data Freshness: How quickly must new user actions (e.g., a "like") influence recommendations? Assumption: Near real-time (seconds to minutes).

Assumptions:

Corpus of 100M items.

Two-stage architecture: Retrieval (Candidate Generation) and Ranking.

P99 latency requirement of 150ms.

Thinking Process

Identify the Bottleneck: You cannot rank 100M videos in real-time. The design must focus on a funnel approach: Retrieval (fast, coarse) -> Ranking (slow, precise).

Selection vs. Scale: For retrieval, use an approximate nearest neighbor (ANN) search on embeddings. For ranking, use a deep model that captures cross-feature interactions.

Data Flywheel: Ensure the feedback loop (clicks/watches) is captured and processed into features to avoid "stale" recommendations.

Simplicity (YAGNI): Start with a two-tower model for retrieval and a Deep Neural Network (DNN) for ranking before exploring multi-task learning or transformers.

Elite Bonus Points

Calibration: Predicting

P(Click)

often leads to clickbait. I will implement calibration (e.g., Platt Scaling) or weight positive labels by watch time to ensure "value" over "clicks."

Delayed Feedback Loops: Watch time labels are only available minutes after a click. I'll use a "request-id" based join mechanism in the feature store to handle asynchronous label arrival.

Position Bias Correction: Users are more likely to click the first item. I will include "Position" as a feature during training but set it to a constant/default during inference to de-bias the model.

Exploration (Epsilon-Greedy/Thompson Sampling): To avoid filter bubbles, dedicate 5% of traffic to explore new content or random niches.

Design Breakdown

Requirements

Product Goal: Increase session watch time and user retention.

Success Metrics:

Online: Watch Time, Session Depth, 7-day Retention.

Offline: Recall@K (for retrieval), AUC/NDCG (for ranking).

Guardrail: P99 Latency, Training/Serving Skew (KL Divergence).

System Constraints: 100M item corpus, 100k QPS, distributed training on TBs of data daily.

Data Availability: User watch history, video metadata (tags, title), user demographics, and real-time interaction logs.

ML Problem Framing

ML Task Type: Two-stage Ranking (Retrieval: Similarity/ANN; Ranking: Binary Classification + Regression for watch time).

Prediction Target:

E[\text{watch time}] = P(\text{click} | u, i, c) \times E[\text{watch duration} | \text{click}]

Inputs:

User: Historical video IDs (sequence), search history, country, language.

Item: Video ID embedding, channel ID, duration, upload time, topic category.

Context: Device, time of day, current page type.

ML Challenges: Extreme class imbalance (most videos are not watched), position bias, and highly dynamic item catalog.

Design Summary & MVP

Concise Summary: A two-stage pipeline using a Two-Tower Model for embedding-based retrieval and a Deep Ranking model (DNN with Cross-features) to score the top 500 candidates.

Model Architecture & Selection:

Baseline: Popularity-based or Item-Item Collaborative Filtering (Heuristic).

Target Model: Two-Tower Model (Retrieval) and DeepFM or DCN-v2 (Ranking) for feature crossing.

Choice Rationale: Two-Tower allows pre-computing item embeddings for sub-10ms ANN search. Ranking models like DCN-v2 capture non-linear interactions (e.g., User_Language x Video_Language).

Simplicity Audit: Avoid Reinforcement Learning initially. A supervised learning approach with a strong feature store provides 90% of the value with 10% of the complexity.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Client-side event tracking (clicks, pauses, skips) and server-side logs.

Ingestion: Kafka acts as the backbone. We use a Lambda architecture: Spark for high-throughput batch historical data; Flink for sub-second processing of the user's last 5 videos to update recommendations instantly.

Storage: Parquet format on S3 for training data, partitioned by date and user_id_hash.

Feature Pipeline

Feature Engineering:

Categorical: Hash video IDs to a fixed vocabulary size.

Continuous: Log-transform view counts to handle long-tail distributions.

Embeddings: Pre-train item embeddings using Word2Vec (on watch sequences) or use the Two-Tower weights.

Feature Store: Use a system like Feast or Tecton.

Online: Low-latency (Redis/DynamoDB) for user context.

Offline: Point-in-time joins to prevent label leakage (ensuring features used for a training sample were known before the event occurred).

Model Architecture

Retrieval (Two-Tower):

Separate DNNs for User and Item.

Dot product output:

score = \text{UserEmbed} \cdot \text{ItemEmbed}

Loss: Sampled Softmax or Triplet Loss.

Ranking (DCN-v2):

Cross Network: Explicitly models feature interactions

u_i \times v_j

Deep Network: Captures implicit non-linear relations.

This architecture handles sparse categorical features (IDs) and dense features (durations) effectively.

Training Pipeline

Negative Sampling: Since users only click a fraction of items, we use "Easy Negatives" (random videos from the corpus) and "Hard Negatives" (videos shown but not clicked).

Labeling: Positive = Watch time > 30s or > 50% of video.

Data Split: Time-based split (Train on days 1-28, Test on day 29) is mandatory to prevent temporal leakage.

Serving Pipeline

Retrieval Strategy: Export Item Tower embeddings to a Vector DB (e.g., Milvus or Pinecone) using HNSW for O(log N) search.

Ranking Optimization: Use TensorRT or ONNX for model quantization (FP16/INT8) to reduce inference latency from 50ms to 10ms.

Fallback: If the ranker fails, return the retrieval output directly or a cached "Trending" list.

Evaluation Pipeline

Offline: Use NDCG (Normalized Discounted Cumulative Gain) to evaluate if the "best" videos are at the top.

Online: A/B test with a 1% bucket, measuring "Incremental Watch Time" against the control group.

Monitoring Pipeline

Data Drift: Monitor the distribution of input features (e.g., if the % of users from a specific country drops, the pipeline might be broken).

Prediction Drift: Monitor the average score output by the ranker. A sudden shift usually indicates an upstream data quality issue.

Wrap Up

Final Evaluation

Cold Start: Address via a separate content-based retrieval tower (extracting embeddings from video thumbnails using a CNN).

Trade-offs:

Accuracy vs. Latency: We cap the number of items sent to the ranker at 500 to maintain <200ms latency.

Exploration vs. Exploitation: Use a Multi-Armed Bandit layer at the very end to re-rank a few exploratory items.