The Question
ML Design

Scalable Short-Video Recommendation System Design

Design the core recommendation engine for a high-growth short-form video platform similar to TikTok. The system must serve personalized content to 100M+ daily active users from a corpus of 100M+ videos. Your design should specifically address the multi-stage funnel (retrieval and ranking), the optimization for multiple competing objectives (e.g., watch time vs. engagement), and the engineering of real-time data pipelines to handle sub-second feedback loops. Focus on how you would minimize serving latency while maximizing content freshness and handling the cold-start problem for new creators.
Two-Tower Model
MMoE
HNSW
FAISS
Kafka
Flink
Redis
Milvus
DeepFM
Transformer
TRT
ONNX
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric Total Watch Time, Day-1 Retention, or a composite score of engagement (likes, shares, follows)?
Assumption*: The goal is to maximize Total Watch Time** while maintaining a healthy diversity of content to prevent "filter bubbles."
Constraints & Scale: What is the scale of the user base and the content corpus?
Assumption: 1 Billion+ DAU, 100M+ active video corpus, and a P99 latency budget of < 200ms for the entire recommendation stack.
Data Freshness: How quickly should a user's interaction (e.g., liking a video) influence their next recommendation?
Assumption: Near real-time (within seconds) to capture immediate interest shifts.
Cold Start: How do we handle new videos with zero interactions?
Assumption: We need an exploration strategy (e.g., epsilon-greedy or multi-armed bandits) to give new content initial impressions.

Thinking Process

The Funnel Approach: With 100M videos, a single model cannot rank everything. I must design a multi-stage pipeline: Retrieval (Candidate Generation) to narrow down to ~1k candidates, and Ranking to produce the final top 10.
Multi-Objective Optimization: TikTok isn't just about clicks. It’s about watch time, likes, and shares. I’ll need a Multi-Gate Mixture-of-Experts (MMoE) or similar architecture to predict multiple labels simultaneously.
Feature Engineering is King: In short-video, the interaction between user preferences and video content (visual/audio) is vital. I need a robust embedding strategy for video/audio features.
Scaling for Latency: To hit sub-200ms, I must leverage Approximate Nearest Neighbor (ANN) search for retrieval and a high-performance C++ or Go serving layer for ranking.

Elite Bonus Points

Position Bias Correction: Users are more likely to watch a video just because it’s there. I'll implement a "position feature" during training but set it to a default value during inference to de-bias the model.
Delayed Feedback Loops: In video, a "watch time" label isn't available until the session ends or the video finishes. I'll use a Fuxi-style streaming joiner to handle labels that arrive minutes after the features.
Explore-Exploit via Calibration: Use a Calibration layer to ensure the predicted probabilities match real-world distributions, allowing for more stable "Upper Confidence Bound" (UCB) exploration of new content.
Graph-based Cold Start: Use Graph Neural Networks (GNNs) or DeepWalk on the user-video interaction graph to propagate embeddings from popular videos to new ones via shared metadata (e.g., same creator or song).
Design Breakdown

Requirements

Product Goal: Deliver a personalized, addictive feed of short videos.
Success Metrics:
Online: Average Watch Time per User, 7-day Retention, CTR on "Share."
Offline: AUC (for binary actions), LogLoss, nDCG (for ranking order).
Guardrail: P99 Latency, Daily Video Diversity Score.
System Constraints: 10M QPS at peak, < 200ms latency, 100PB+ of historical interaction data.
Data Availability: User profile, Video metadata (hashtags, duration), Raw Video/Audio pixels/waves, User interaction logs (real-time).

ML Problem Framing

ML Task Type: Two-stage Ranking.
Retrieval: Multi-channel candidate generation (Collaborative Filtering, Content-based, Embedding-based).
Ranking: Multi-task binary classification and regression.
Prediction Target: Score = \sum w_i \cdot P(\text{action}_i) where action includes Like, Finish, Share, and predicted Watch Time.
Inputs:
User: Historical watch history (last 50 videos), demographics, device.
Item: Video embeddings (from CLIP/VGG), audio features, creator stats.
Context: Time of day, location (city/country), network speed.
ML Challenges: High cardinality of features, extreme data sparsity, and "Echo Chamber" bias.

Design Summary & MVP

Concise Summary: A two-stage system utilizing a Two-Tower Neural Network for low-latency retrieval and an MMoE (Multi-gate Mixture-of-Experts) model for high-precision ranking.
Model Architecture & Selection:
Baseline: Collaborative Filtering (User-Item Matrix Factorization).
Target: Two-Tower (Retrieval) + DeepFM or MMoE (Ranking).
Choice Rationale: MMoE handles the trade-off between conflicting objectives (e.g., Like vs. Watch Time) better than a single-head DNN.
Simplicity Audit: The MVP avoids Reinforcement Learning (RL) in favor of supervised learning with a weighted score, significantly reducing training complexity while hitting 90% of the value.
Architecture Decision Rationale:
Scalability: Retrieval uses ANN (FAISS/HNSW), which scales logarithmically with corpus size.
Freshness: The Online Feature Store ensures the model reacts to the user's very last swipe.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile app events (ViewStart, ViewEnd, Like, Follow).
Data Ingestion: Kafka as the backbone. Flink for real-time windowing (calculating a user's watch time in the last 5 minutes).
Data Storage: Iceberg/S3 for long-term storage. Partitioned by date and hour to optimize heavy batch training reads.
Data Quality: De-duplication of events (at-least-once to exact-once logic in Flink) and schema validation using Confluent Schema Registry.

Feature Pipeline

Feature Definition:
User: user_id embeddings, watched_category_ids (sequence), avg_watch_time_30d.
Item: video_id embeddings, visual_embedding (from a pre-trained Vision Transformer), audio_mood.
Online Feature Pipeline: Flink maintains a "sliding window" of the user's last 20 interactions. This is pushed to Redis/Milvus for low-latency retrieval.
Training/Serving Skew: We use a Unified Feature Logging pattern. We log the features exactly as they were seen at inference time to the offline store, rather than re-computing them from history.

Model Architecture

Retrieval (Candidate Generation):
Two-Tower Model: Separate MLP towers for User and Video. The output is a 128-d vector.
Loss: Triplet loss or Softmax with sampled negatives.
Benefit: User tower can be cached; Item tower can be pre-indexed in an ANN.
Ranking (MMoE):
Input Layer: Dense features + Embeddings (shared).
Expert Layer: 8-16 Expert MLPs to learn different data patterns.
Gating Network: A softmax layer per task (Like, WatchTime, Share) that decides which experts to weight for that specific prediction.
Output Layer: Sigmoid (for binary) and ReLU (for watch time regression).

Training Pipeline

Label Construction:
is_liked: binary.
watch_ratio: min(watch_time / video_duration, 2.0).
Handling Imbalance: Downsample "skips" (negative class) to balance the "likes" (positive class), then use a calibration factor to correct the predicted probability.
Infrastructure: Horovod on PyTorch for distributed data-parallel training across 100+ GPUs.
Retraining: Daily batch retraining on the last 30 days of data + Online Learning (incremental updates) every 1 hour to capture viral trends.

Serving Pipeline

Retrieval Pattern:
Fetch UserTower(user_context) vector.
Query Milvus/FAISS using HNSW index to get top 1000 video IDs.
Ranking Pattern:
Multi-threaded feature hydration (fetch metadata for 1000 items from Redis in parallel).
Batch inference on Triton Inference Server using ONNX/TensorRT optimized models.
Reliability: If the Ranker fails, fallback to a "Popularity-based" list (cached) to ensure the user sees something.

Evaluation Pipeline

Offline: We use a Time-based Split. Train on Monday-Saturday, test on Sunday. Metrics: GAUC (Group AUC - AUC calculated per user and averaged) to ensure fairness across users.
Online: Interleaving for rapid model comparison (mixing results from two models and seeing which one gets more clicks) followed by a 5% traffic A/B test.

Monitoring Pipeline

System: Track QPS and P99. If ranking latency > 150ms, trigger an automated fallback to a smaller "Teacher-Student" distilled model.
ML Drift: Monitor Feature Value Distribution. If the "avg_video_length" in the serving request deviates significantly from the training set, alert for data pipeline bugs.
Wrap Up

Final Evaluation

Cold Start: For new videos, we use a separate "Warm-up" retrieval tower that targets users with high "Exploration" scores (users who enjoy being the first to see new content).
Trade-offs:
Complexity vs. Latency: MMoE adds compute cost over a simple DNN but improves multi-metric performance significantly.
Accuracy vs. Diversity: We use Maximal Marginal Relevance (MMR) in the re-ranking stage to penalize videos that are too similar to what the user just saw.
Distinguishing Insight: TikTok's secret sauce is the Content Understanding Tower. By using a pre-trained Multimodal Transformer, the system understands that a video of a cat and a video of a kitten are related even if they have no overlapping hashtags.