The Question
ML Design

Large-Scale Real-time Recommendation System

Design a high-scale, real-time recommendation system for a content platform with 100M+ items. Detail the multi-stage funnel architecture, handle real-time feature engineering, address position bias, and ensure the system meets sub-200ms P99 latency constraints for 100k+ QPS.
Two-Tower Model
DeepFM
HNSW
ANN
Kafka
Spark
Flink
Milvus
Triton
Feast
PyTorch
AUC
NDCG
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric Watch Time, CTR, or User Retention?
Assumption: Primary goal is long-term Watch Time (to drive engagement).
Constraints & Scale: What is the scale of the item corpus and traffic?
Assumption: 100M videos, 500M DAU, 100k QPS, and a P99 latency budget of 150ms.
Edge Cases: How do we handle cold starts and data freshness?
Assumption: Freshness is critical (new videos must be recommendable within minutes). Cold start for users is handled via popular/trending heuristics.
Assumptions: I assume we have access to user historical interactions (clicks, watch percentage), video metadata, and real-time context (device, location).

Thinking Process

Identify the Funnel: With 100M videos, I cannot rank everything in real-time. I must use a multi-stage architecture: Retrieval (Candidate Generation) to narrow down to ~1,000 items, and Ranking to select the top ~20.
Bottleneck Analysis: The bottleneck is usually the latency of the ranking model and the freshness of the feature store.
MVP Focus: Avoid complex Graph Neural Networks initially. Use a Two-Tower model for retrieval and a DeepFM or simple MLP for ranking.
Scalability: Decouple feature computation from inference using a Feature Store to ensure online/offline consistency.

Elite Bonus Points

Position Bias Modeling: Users are more likely to click the first item regardless of quality. I will include "Position" as a feature during training but set it to a default value (e.g., 0) during inference to de-bias the model.
Exploration vs. Exploitation: Implement a simple Epsilon-Greedy or Multi-Armed Bandit (MAB) strategy in the final re-ranking stage to show fresh/diverse content.
Calibration: Ensure the predicted probability of a click matches the real CTR. This is crucial if we eventually combine multiple objectives (e.g., CTR * WatchTime).
Embeddings Versioning: Use a "Warm-start" strategy for new video embeddings by averaging embeddings of videos with similar metadata to avoid the "zero-vector" cold start.
Design Breakdown

Requirements

Product Goal: Increase total platform watch time.
Success Metrics:
Online: Mean Watch Time per session, Day-7 Retention.
Offline: AUC (for CTR), NDCG (for ranking order), Recall@K (for retrieval).
Guardrail: P99 Latency < 150ms, Infrastructure Cost per 1k requests.
System Constraints: 100M items, 100k QPS, <5 min data freshness for new videos.
Data Availability: User logs (clicks, skips), Video Metadata (tags, duration), Context (time, device).

ML Problem Framing

ML Task Type: Two-stage Ranking.
Prediction Target: Multi-objective. Score = P(Click) \times E(WatchTime | Click).
Inputs:
User: Historical IDs of last 50 videos watched, user demographics.
Item: Video ID, Creator ID, Duration, Genre, Embedding.
Context: Device, Time of day, Region.
Outputs: A ranked list of Video IDs.
ML Challenges: Highly imbalanced labels (most videos aren't watched), massive cardinality of IDs, and feedback loops.

Design Summary & MVP

Concise Summary: A two-stage pipeline using Two-Tower embeddings for low-latency retrieval followed by a DeepFM model for precise ranking.
Model Architecture & Selection:
Baseline: Popularity-based ranking or Logistic Regression on simple count features.
Target Model: Two-Tower DNN (Retrieval) + DeepFM (Ranking).
Choice Rationale: Two-Tower allows for Approximate Nearest Neighbor (ANN) search via Milvus/FAISS. DeepFM handles high-order feature interactions effectively for ranking.
Simplicity Audit: We use standard ANN and Feed-forward nets. We skip RL-based ranking or real-time online learning for the MVP.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile/Web client logs via Kafka.
Ingestion: Kafka acts as the buffer. Spark Streaming processes logs for immediate feature updates.
Storage: S3 for raw logs (Parquet format). Partitioned by date/hour for efficient batch training.
Processing: Sessionization of user events (e.g., grouping clicks within a 30-min window) to create clean training samples.

Feature Pipeline

Feature Definition:
User: Embedding of last N watched videos (Mean pooling).
Item: Category (One-hot), Video length (Normalized).
Cross: User-Category affinity (e.g., how many "Sci-Fi" videos watched in last 24h).
Online/Offline Consistency: Use a Feature Store (Feast/Tecton). During training, we perform a "Point-in-time" join to ensure the model only sees data available at the time of the event.

Model Architecture

Retrieval (Two-Tower):
User Tower: f(UserFeatures) -> V_{user} \in \mathbb{R}^d.
Item Tower: g(ItemFeatures) -> V_{item} \in \mathbb{R}^d.
Loss: Triplet loss or sampled softmax.
Ranking (DeepFM):
FM Part: Captures low-order interactions between features (e.g., UserID and VideoID).
Deep Part: Captures high-order non-linearities.
Rationale: DeepFM doesn't require manual feature engineering for cross-features, making it faster to iterate.

Training Pipeline

Label Construction:
Negative Sampling: For retrieval, use in-batch negatives. For ranking, use "Impressed but not clicked" as negatives.
Training: Distributed training on GPU clusters using Horovod or PyTorch DDP.
Retraining: Weekly full retrains with daily incremental updates to embeddings.

Serving Pipeline

Retrieval: Use HNSW (Hierarchical Navigable Small World) algorithm for ANN search in Milvus. This scales to 100M items with <10ms latency.
Ranking: Containerized Python service using Triton Inference Server.
Optimization: Use FP16 Quantization for the ranking model to reduce memory footprint and latency.

Evaluation Pipeline

Offline: Use a "Leave-one-out" strategy or time-based split (Train on Mon-Sat, Test on Sun).
Online: 5% traffic A/B test. Measure "Mean Watch Time" as the gold standard.

Monitoring Pipeline

Prediction Drift: Monitor the distribution of the ranking scores. If the mean score drops by 20%, trigger an alert.
System Health: Track P99 latency. If latency > 200ms, fallback to a cached "Trending" list.
Wrap Up

Final Evaluation

Edge Cases: For new users, we provide a "Survey" during onboarding or show globally popular videos (Cold Start).
Trade-offs: We trade off model complexity (Transformer-based ranking) for latency and maintainability in the MVP.
Feedback Loop: Clicks are fed back into the system in real-time (Flink) to update "Last N watched" features, allowing the system to react to a user's current session intent within seconds.