DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Large-Scale Personalized Video Recommendation System

Design a recommendation system for a video platform with 1B+ users and 100M+ videos. The system must provide personalized content in real-time with a P99 latency of <200ms. Detail the two-stage pipeline (retrieval and ranking), explain how you handle real-time user feedback, discuss strategies for negative sampling in training, and describe how the system addresses position bias and data freshness. Ensure the design includes a robust infrastructure for model evaluation and monitoring for training-serving skew.
Two-Tower Model
DeepFM
DCN-v2
ANN
HNSW
FAISS
Spark
Flink
Kafka
TensorRT
ONNX
Feast
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric Watch Time, Click-Through Rate (CTR), or Long-term Retention? Assumption: Maximize total watch time per session.
Constraints & Scale: What is the scale of users and items? Assumption: 1B+ users, 100M+ videos, 100k QPS, <200ms P99 latency.
Edge Cases: How do we handle cold starts for new videos? Assumption: We use content-based features (metadata/embeddings) for new items until collaborative signals accumulate.
Data Freshness: How quickly must new user actions (e.g., a "like") influence recommendations? Assumption: Near real-time (seconds to minutes).
Assumptions:
Corpus of 100M items.
Two-stage architecture: Retrieval (Candidate Generation) and Ranking.
P99 latency requirement of 150ms.

Thinking Process

Identify the Bottleneck: You cannot rank 100M videos in real-time. The design must focus on a funnel approach: Retrieval (fast, coarse) -> Ranking (slow, precise).
Selection vs. Scale: For retrieval, use an approximate nearest neighbor (ANN) search on embeddings. For ranking, use a deep model that captures cross-feature interactions.
Data Flywheel: Ensure the feedback loop (clicks/watches) is captured and processed into features to avoid "stale" recommendations.
Simplicity (YAGNI): Start with a two-tower model for retrieval and a Deep Neural Network (DNN) for ranking before exploring multi-task learning or transformers.

Elite Bonus Points

Calibration: Predicting P(Click) often leads to clickbait. I will implement calibration (e.g., Platt Scaling) or weight positive labels by watch time to ensure "value" over "clicks."
Delayed Feedback Loops: Watch time labels are only available minutes after a click. I'll use a "request-id" based join mechanism in the feature store to handle asynchronous label arrival.
Position Bias Correction: Users are more likely to click the first item. I will include "Position" as a feature during training but set it to a constant/default during inference to de-bias the model.
Exploration (Epsilon-Greedy/Thompson Sampling): To avoid filter bubbles, dedicate 5% of traffic to explore new content or random niches.
Design Breakdown

Requirements

Product Goal: Increase session watch time and user retention.
Success Metrics:
Online: Watch Time, Session Depth, 7-day Retention.
Offline: Recall@K (for retrieval), AUC/NDCG (for ranking).
Guardrail: P99 Latency, Training/Serving Skew (KL Divergence).
System Constraints: 100M item corpus, 100k QPS, distributed training on TBs of data daily.
Data Availability: User watch history, video metadata (tags, title), user demographics, and real-time interaction logs.

ML Problem Framing

ML Task Type: Two-stage Ranking (Retrieval: Similarity/ANN; Ranking: Binary Classification + Regression for watch time).
Prediction Target: E[\text{watch time}] = P(\text{click} | u, i, c) \times E[\text{watch duration} | \text{click}].
Inputs:
User: Historical video IDs (sequence), search history, country, language.
Item: Video ID embedding, channel ID, duration, upload time, topic category.
Context: Device, time of day, current page type.
ML Challenges: Extreme class imbalance (most videos are not watched), position bias, and highly dynamic item catalog.

Design Summary & MVP

Concise Summary: A two-stage pipeline using a Two-Tower Model for embedding-based retrieval and a Deep Ranking model (DNN with Cross-features) to score the top 500 candidates.
Model Architecture & Selection:
Baseline: Popularity-based or Item-Item Collaborative Filtering (Heuristic).
Target Model: Two-Tower Model (Retrieval) and DeepFM or DCN-v2 (Ranking) for feature crossing.
Choice Rationale: Two-Tower allows pre-computing item embeddings for sub-10ms ANN search. Ranking models like DCN-v2 capture non-linear interactions (e.g., User_Language x Video_Language).
Simplicity Audit: Avoid Reinforcement Learning initially. A supervised learning approach with a strong feature store provides 90% of the value with 10% of the complexity.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Client-side event tracking (clicks, pauses, skips) and server-side logs.
Ingestion: Kafka acts as the backbone. We use a Lambda architecture: Spark for high-throughput batch historical data; Flink for sub-second processing of the user's last 5 videos to update recommendations instantly.
Storage: Parquet format on S3 for training data, partitioned by date and user_id_hash.

Feature Pipeline

Feature Engineering:
Categorical: Hash video IDs to a fixed vocabulary size.
Continuous: Log-transform view counts to handle long-tail distributions.
Embeddings: Pre-train item embeddings using Word2Vec (on watch sequences) or use the Two-Tower weights.
Feature Store: Use a system like Feast or Tecton.
Online: Low-latency (Redis/DynamoDB) for user context.
Offline: Point-in-time joins to prevent label leakage (ensuring features used for a training sample were known before the event occurred).

Model Architecture

Retrieval (Two-Tower):
Separate DNNs for User and Item.
Dot product output: score = \text{UserEmbed} \cdot \text{ItemEmbed}.
Loss: Sampled Softmax or Triplet Loss.
Ranking (DCN-v2):
Cross Network: Explicitly models feature interactions u_i \times v_j.
Deep Network: Captures implicit non-linear relations.
This architecture handles sparse categorical features (IDs) and dense features (durations) effectively.

Training Pipeline

Negative Sampling: Since users only click a fraction of items, we use "Easy Negatives" (random videos from the corpus) and "Hard Negatives" (videos shown but not clicked).
Labeling: Positive = Watch time > 30s or > 50% of video.
Data Split: Time-based split (Train on days 1-28, Test on day 29) is mandatory to prevent temporal leakage.

Serving Pipeline

Retrieval Strategy: Export Item Tower embeddings to a Vector DB (e.g., Milvus or Pinecone) using HNSW for O(log N) search.
Ranking Optimization: Use TensorRT or ONNX for model quantization (FP16/INT8) to reduce inference latency from 50ms to 10ms.
Fallback: If the ranker fails, return the retrieval output directly or a cached "Trending" list.

Evaluation Pipeline

Offline: Use NDCG (Normalized Discounted Cumulative Gain) to evaluate if the "best" videos are at the top.
Online: A/B test with a 1% bucket, measuring "Incremental Watch Time" against the control group.

Monitoring Pipeline

Data Drift: Monitor the distribution of input features (e.g., if the % of users from a specific country drops, the pipeline might be broken).
Prediction Drift: Monitor the average score output by the ranker. A sudden shift usually indicates an upstream data quality issue.
Wrap Up

Final Evaluation

Cold Start: Address via a separate content-based retrieval tower (extracting embeddings from video thumbnails using a CNN).
Trade-offs:
Accuracy vs. Latency: We cap the number of items sent to the ranker at 500 to maintain <200ms latency.
Exploration vs. Exploitation: Use a Multi-Armed Bandit layer at the very end to re-rank a few exploratory items.