The Question
ML Design

Large-Scale Short-Video Recommendation System

Design an end-to-end recommendation engine for a high-concurrency short-video platform, focusing on a multi-stage ranking funnel (retrieval and ranking) that optimizes for diverse engagement metrics and handles massive data scale with sub-200ms latency.
Two-Tower Model
MMoE
ANN
FAISS
Kafka
Flink
Spark
TensorRT
MTL
GNN
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric maximizing short-term engagement (CTR/Watch Time) or long-term retention?
Assumption*: We aim to maximize Total Watch Time and Retention (Day 1/Day 7)**.
Constraints & Scale: What is the scale of the user base and content library?
Assumption: 1 Billion Monthly Active Users (MAU), 100 Million active videos. We need to handle ~100k QPS at peak.
Latency Budget: What is the end-to-end P99 latency requirement for the "For You" feed?
Assumption: P99 latency < 200ms for the entire ranking pipeline.
Data Freshness: How quickly should a user's latest interaction (e.g., liking a video) influence their feed?
Assumption: Near real-time (seconds) for features, daily/hourly for model weights.
Cold Start: How do we handle new videos?
Assumption: We need a mechanism to push new content to a small, diverse subset of users to gather initial engagement signals.

Thinking Process

Identify the Funnel: With 100M videos, I cannot rank everything. I must use a multi-stage funnel: Retrieval (filter 100M to 1000) -> Ranking (score 1000) -> Re-ranking (apply business logic/diversity).
Address Data Velocity: TikTok's magic is its reactivity. The system must support real-time feature engineering (e.g., "last 5 videos watched") to capture shifting user intent within a single session.
Model Selection (MVP): Start with a Two-Tower model for retrieval (scalability) and a Deep Neural Network (DNN) or Multi-Gate Mixture-of-Experts (MMoE) for ranking to optimize multiple objectives (like, share, watch time).
Scale Strategy: Use a Vector Database for approximate nearest neighbor (ANN) search during retrieval to meet the latency budget.

Elite Bonus Points

Multi-Task Learning (MTL): Optimizing for a single metric (e.g., click) leads to clickbait. Using MMoE to balance Watch Time, Likes, and Shares is critical.
Position Bias Correction: Users are more likely to watch the first video shown. I'll implement a shallow tower or use position as a feature during training but set it to a default value during inference to de-bias.
Calibration: Watch time is a regression task, but likes are classification. I will use Isotonic Regression or Platt scaling to ensure predicted probabilities match real-world frequencies for better ranking fusion.
Real-time Feature Interaction: Implementing a Streaming Feature Store using Flink to compute "on-the-fly" user-video category affinities.
Design Breakdown

Requirements

Product Goal: Deliver a highly personalized, addictive "For You" feed that maximizes user retention and time spent.
Success Metrics:
Online Metrics: Average Daily Watch Time, Day-N Retention, Click-Through Rate (CTR).
Offline Metrics: LogLoss (Classification), RMSE (Watch Time), NDCG (Ranking quality), AUC.
Guardrail Metrics: P99 Latency, Diversity (Entropy of categories), Model Training Cost.
System Constraints: 1B users, 100M videos, 200ms latency budget, global distribution.
Data Availability: User profile (age, geo), Video metadata (tags, duration, audio), Interaction logs (watch%, like, comment, skip).

ML Problem Framing

ML Task Type: Two-stage ranking problem.
Stage 1 (Retrieval): Extreme Multiclass Classification (predicting the next video ID).
Stage 2 (Ranking): Multi-objective Classification & Regression.
Prediction Target:
P(\text{engagement} | \text{user, video, context})
E[\text{watch\_time} | \text{user, video, context}]
Inputs:
User: Historical IDs, embedding of last N watches, device, location.
Item: Video embeddings (from CNN/Vision Transformer), audio embeddings, creator features.
Context: Time of day, day of week, network speed (low speed = show shorter videos).
ML Challenges: Highly skewed data (most videos get 0 views), feedback loops, and massive item churn.

Design Summary & MVP

Concise Summary: A two-stage system using Two-Tower Embeddings for sub-10ms retrieval of 1,000 candidates, followed by a Multi-Task DNN that scores candidates based on predicted engagement and watch time.
Model Architecture & Selection:
Baseline: Collaborative Filtering (Matrix Factorization) + Logistic Regression.
Target Model: Two-Tower (Retrieval) + MMoE (Ranking).
Simplicity Audit: Avoids Reinforcement Learning (RL) in MVP. Uses Supervised Learning which is easier to debug and scale.
Architecture Decision Rationale: Two-Tower allows for pre-computing video embeddings, enabling millisecond retrieval via ANN. MMoE handles the "Watch Time vs. Like" trade-off better than a single-head DNN.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile app events (video_start, video_end, like, share, follow).
Data Ingestion: Kafka for high-throughput event streaming. S3 for long-term storage of raw JSON logs.
Data Storage: Parquet on S3 partitioned by date/hour for efficient Spark reads.
Data Processing: Spark for heavy-duty batch joins of user history. Flink for real-time aggregation (e.g., number of times this video was skipped in the last 5 minutes).
Data Quality: De-duplication of events and schema enforcement using Confluent Schema Registry.

Feature Pipeline

Feature Definition:
User: user_id (embedding), watched_video_ids (sequence), preferred_languages.
Item: video_id, author_id, video_duration, topic_tags.
Context: hour_of_day, network_type, region.
Feature Engineering: Log-transforming view counts, Z-score normalization for duration, and hashing for high-cardinality categorical strings.
Online/Offline Consistency: Use a Feature Store (e.g., Feast). Training data is generated by "point-in-time" joins in the offline store to prevent label leakage.

Model Architecture

Retrieval (Two-Tower):
User Tower: Learns a 128-dim vector from user history and context.
Video Tower: Learns a 128-dim vector from video metadata and visual features.
Loss: Triplet loss or sampled softmax.
Ranking (MMoE):
Shared Bottom Layer: Learns common representations.
Experts: Multiple sub-networks specializing in different feature patterns.
Gates: Learns which expert to trust for "Watch Time" vs. "Like".
Final Layers: Sigmoid for Likes, ReLU for Watch Time (regression).
Inference Optimization: Use TensorRT for GPU inference and FP16 quantization to reduce latency.

Training Pipeline

Dataset Construction: Negative sampling is key. For every "watched" video (positive), sample K videos the user was shown but didn't interact with (negative).
Data Splitting: Time-based split. Train on days 1-27, validate on day 28, test on day 29. No random split to avoid "predicting the past using the future."
Infrastructure: Distributed training using Horovod or PyTorch DistributedDataParallel (DDP) across multiple A100 GPUs.

Serving Pipeline

Pattern: Request-Response via gRPC.
Caching: Cache top-N results for popular user segments (e.g., new users in a specific region) in Redis.
Fallback: if the Ranking Service times out, return the raw Retrieval results or a cached "Trending" list to ensure the user always sees content.

Evaluation Pipeline

Offline: AUC for binary tasks (likes). Recall@K for retrieval.
Online: A/B testing framework. Split users by user_id hash. Monitor "Time Spent per Session" as the primary needle-mover.

Monitoring Pipeline

Concept Drift: Monitor the distribution of predicted scores. If the average predicted Watch Time drops significantly, trigger an automated retrain.
System Health: Standard P99 latency and CPU/GPU memory saturation alerts.
Wrap Up

Final Evaluation

Cold Start: Use a "Multi-Armed Bandit" (e.g., Thompson Sampling) to allocate 5% of traffic to new videos to collect data.
Trade-offs: We sacrifice "Perfect Accuracy" (complex Transformer models) for "Low Latency" (Two-Tower/DNN) to ensure the infinite scroll feels seamless.
Distinguishing Insight: To truly reach TikTok's level, one must implement Graph Neural Networks (GNNs) for the retrieval stage later, capturing the "User-Author-Video" graph, but for MVP, Two-Tower is the standard industry choice for scale.