DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Large-Scale Video Recommendation System Design

Design an end-to-end recommendation system for a global video streaming platform with 500M users and 100M videos. The system must optimize for multiple objectives (CTR and Watch Time) while meeting a 200ms P99 latency SLA. Detail the multi-stage architecture, including candidate retrieval via embedding-based search and high-precision ranking. Address specific challenges like position bias, delayed feedback in watch-time labels, and the infrastructure required for real-time feature freshness and model monitoring in production.
Two-Tower
MMoE
FAISS
Kafka
Flink
Spark
Tecton
XGBoost
DeepFM
Protobuf
Isotonic Regression
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric total watch time, long-term retention, or click-through rate (CTR)?
Assumption*: The goal is to maximize Long-term Watch Time** while maintaining healthy CTR.
Constraints & Scale: What is the scale of the user base and item corpus?
Assumption: 500M Monthly Active Users (MAU), 100M+ videos, and a P99 latency budget of 200ms for the full recommendation pipeline.
Data Freshness: How quickly must new videos appear in recommendations?
Assumption: New videos (breaking news, trending clips) should be discoverable within minutes of upload.
Edge Cases: How do we handle the "Cold Start" problem for new users and "Filter Bubbles"?
Assumption: We will use content-based features for new videos and a "popular/trending" heuristic for new users.

Thinking Process

Identify the Funnel: With 100M videos, I cannot rank everything. I must use a multi-stage architecture: Retrieval (Candidate Generation) followed by Ranking (Scoring).
Addressing Scale vs. Latency: For retrieval, use an embedding-based approach (Two-Tower) to enable Approximate Nearest Neighbor (ANN) search. For ranking, use a more complex model (MMoE or DeepFM) on a smaller subset (~1,000 candidates).
Handling Data Skew: Video consumption data is power-law distributed. I need to handle popular "head" items differently from "tail" items to avoid bias.
MVP Mindset (YAGNI): Start with a simple Two-Tower retrieval and a DNN ranker. Avoid Reinforcement Learning or Graph Neural Nets until the baseline is robust.

Elite Bonus Points

Multi-Task Learning (MTL): Optimizing for a single metric (like CTR) leads to clickbait. Use an MMoE (Multi-gate Mixture-of-Experts) architecture to simultaneously predict CTR and "Watch Time Completion Rate."
Position Bias Correction: Users are more likely to click the first item regardless of quality. Use a "Position Feature" during training (but set it to a constant or use a shallow tower to neutralize it at inference) to decorrelate position from relevance.
Delayed Feedback Handling: "Watch time" is a delayed label. I will implement a "Label Joiner" in the pipeline that waits for a specific window (e.g., 20 mins) before emitting a training sample to ensure label accuracy.
Calibration: Since we optimize for watch time (regression) and CTR (classification), raw scores aren't comparable. Apply Isotonic Regression or Platt Scaling to ensure the probabilities represent real-world frequencies.
Design Breakdown

Requirements

Product Goal: Deliver highly relevant, engaging video content to increase user stickiness.
Success Metrics:
Online: Mean Watch Time per Session, Day-7 Retention, CTR.
Offline: AUC (for CTR), LogLoss, NDCG@K (for ranking order).
Guardrail: P99 Latency < 200ms, Error rate < 0.1%.
System Constraints: 50k QPS at peak; real-time ingestion of "Like/Dislike" signals.
Data Availability: User profile (age, geo), Video metadata (tags, transcript embeddings), Interaction logs (watch percentage, skip events).

ML Problem Framing

ML Task Type: Two-stage Ranking (Retrieval = Candidate Generation; Ranking = Precision Scoring).
Prediction Target:
P(\text{click} | \text{user, item, context})
E(\text{watch\_time} | \text{user, item, context, click})
Inputs:
User: Search history (last 50 queries), Watch history (last 100 videos), Embeddings.
Item: Video ID, Creator ID, Duration, Topic Embeddings, Aggregated historical CTR.
Context: Device (Mobile/Desktop), Time of day (weekday vs. weekend), Network quality.
ML Challenges: Extreme Class Imbalance (most videos are not clicked) and Feedback Loops (users only interact with what we show them).

Design Summary & MVP

Concise Summary: A two-stage pipeline using Two-Tower Embeddings for sub-millisecond retrieval and a Deep Neural Network (DNN) for high-precision ranking of the top 500 candidates.
Baseline Model: Matrix Factorization for retrieval and Logistic Regression with engineered cross-features for ranking.
Target Model: Two-Tower (Retrieval) + Multi-Gate Mixture-of-Experts (Ranking).
Choice Rationale: Two-Tower allows for decoupling User/Item towers for ANN search. MMoE allows balancing click propensity vs. watch duration, preventing clickbait.
Simplicity Audit: We avoid real-time online learning (complex infra) in favor of frequent batch retraining (daily) with a real-time feature store for freshness.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: User interaction events (click, play, pause, seek), Video uploads (metadata/blobs).
Data Ingestion: Kafka for real-time events. Using Protobuf for schema enforcement to prevent downstream pipeline breaks.
Data Storage: S3 for the raw data lake; BigQuery or Snowflake for structured analytical queries used by Data Scientists.
Data Processing: Spark for daily heavy lifting (calculating user-long-term preferences). Flink for session-based features (e.g., "videos watched in the last 10 minutes").

Feature Pipeline

Feature Definition:
User: user_watch_hist_30d, preferred_categories, avg_session_duration.
Item: video_age, normalized_view_count, embedding_v2.
Online Feature Pipeline: Flink aggregates real-time counters (e.g., "how many times was this video clicked in the last 5 mins") and pushes to a low-latency Redis/DynamoDB (Feature Store).
Feature Store: Using Tecton or Feast. It solves the Point-in-Time Join problem, ensuring that during training, we only use features available before the interaction occurred.

Model Architecture

Retrieval (Two-Tower):
User Tower: Dense layers processing user history and context.
Item Tower: Dense layers processing video metadata.
Output: Dot-product of towers. Optimized using sampled softmax loss.
Ranking (MMoE):
Shared Bottom: Shared layers for feature representation.
Experts: Task-specific sub-networks.
Towers: One for P(\text{click}) and one for E(\text{watch\_time}).
Optimization: Quantization (INT8) for the ranking model to reduce inference latency by 3-4x.

Training Pipeline

Dataset Construction: Negative sampling is critical. Use "Easy Negatives" (randomly sampled from corpus) for retrieval and "Hard Negatives" (retrieved but not clicked) for ranking.
Data Splitting: Time-based split. Train on days 1-28, validate on day 29, test on day 30. This avoids data leakage from future events.
Retraining: Daily retraining for the Ranking model. Weekly retraining for the Embedding towers (retrieval) as user interests shift slower than trending topics.

Serving Pipeline

Serving Pattern:
Retrieval: User ID -> Embedding -> FAISS search for top 500 Candidate IDs.
Feature Enrichment: Fetch 500 item features + 1 user feature set from Online Feature Store.
Scoring: MMoE predicts scores for 500 items.
Reliability: If the Ranker service fails, fall back to "Popularity-based" or "Retrieval-only" results (graceful degradation).

Evaluation Pipeline

Offline: Replay evaluation. Use historical logs to see if the model would have ranked the actually clicked video higher than the baseline.
Online: Interleaving or standard A/B testing. Interleaving is faster for ranking changes as it exposes users to both models simultaneously in one list.

Monitoring Pipeline

Model Drift: Monitor the distribution of predicted CTR vs. actual CTR. If the Population Stability Index (PSI) > 0.1, trigger a retraining job.
System Health: Track the "Feature Fill Rate." If the Feature Store starts returning nulls for critical features (like user_id), the model performance will degrade silently.
Wrap Up

Final Evaluation

Trade-offs: Accuracy vs. Latency. We use a smaller MMoE to stay under 200ms, even if a Transformer-based ranker would be +1% more accurate.
Cold Start: Use a separate "Exploration" Bandit (e.g., Thompson Sampling) to allocate 5% of traffic to new videos to collect data.
Distinguishing Insight: Multi-Stage Filtering. Between retrieval and ranking, add a "Filtering" layer to remove videos the user has already watched, blocked creators, or age-inappropriate content. This reduces the load on the heavy Ranker.