The Question

Scalable Short-Video Recommendation System Design

Design the core recommendation engine for a high-growth short-form video platform similar to TikTok. The system must serve personalized content to 100M+ daily active users from a corpus of 100M+ videos. Your design should specifically address the multi-stage funnel (retrieval and ranking), the optimization for multiple competing objectives (e.g., watch time vs. engagement), and the engineering of real-time data pipelines to handle sub-second feedback loops. Focus on how you would minimize serving latency while maximizing content freshness and handling the cold-start problem for new creators.

Two-Tower Model

MMoE

HNSW

FAISS

Kafka

Flink

Redis

Milvus

DeepFM

Transformer

TRT

ONNX

Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric Total Watch Time, Day-1 Retention, or a composite score of engagement (likes, shares, follows)?

Assumption*: The goal is to maximize Total Watch Time** while maintaining a healthy diversity of content to prevent "filter bubbles."

Constraints & Scale: What is the scale of the user base and the content corpus?

Assumption: 1 Billion+ DAU, 100M+ active video corpus, and a P99 latency budget of < 200ms for the entire recommendation stack.

Data Freshness: How quickly should a user's interaction (e.g., liking a video) influence their next recommendation?

Assumption: Near real-time (within seconds) to capture immediate interest shifts.

Cold Start: How do we handle new videos with zero interactions?

Assumption: We need an exploration strategy (e.g., epsilon-greedy or multi-armed bandits) to give new content initial impressions.

Thinking Process

The Funnel Approach: With 100M videos, a single model cannot rank everything. I must design a multi-stage pipeline: Retrieval (Candidate Generation) to narrow down to ~1k candidates, and Ranking to produce the final top 10.

Multi-Objective Optimization: TikTok isn't just about clicks. It’s about watch time, likes, and shares. I’ll need a Multi-Gate Mixture-of-Experts (MMoE) or similar architecture to predict multiple labels simultaneously.

Feature Engineering is King: In short-video, the interaction between user preferences and video content (visual/audio) is vital. I need a robust embedding strategy for video/audio features.

Scaling for Latency: To hit sub-200ms, I must leverage Approximate Nearest Neighbor (ANN) search for retrieval and a high-performance C++ or Go serving layer for ranking.

Elite Bonus Points

Position Bias Correction: Users are more likely to watch a video just because it’s there. I'll implement a "position feature" during training but set it to a default value during inference to de-bias the model.

Delayed Feedback Loops: In video, a "watch time" label isn't available until the session ends or the video finishes. I'll use a Fuxi-style streaming joiner to handle labels that arrive minutes after the features.

Explore-Exploit via Calibration: Use a Calibration layer to ensure the predicted probabilities match real-world distributions, allowing for more stable "Upper Confidence Bound" (UCB) exploration of new content.

Graph-based Cold Start: Use Graph Neural Networks (GNNs) or DeepWalk on the user-video interaction graph to propagate embeddings from popular videos to new ones via shared metadata (e.g., same creator or song).

Design Breakdown

Requirements

Product Goal: Deliver a personalized, addictive feed of short videos.

Success Metrics:

Online: Average Watch Time per User, 7-day Retention, CTR on "Share."

Offline: AUC (for binary actions), LogLoss, nDCG (for ranking order).

Guardrail: P99 Latency, Daily Video Diversity Score.

System Constraints: 10M QPS at peak, < 200ms latency, 100PB+ of historical interaction data.

Data Availability: User profile, Video metadata (hashtags, duration), Raw Video/Audio pixels/waves, User interaction logs (real-time).

ML Problem Framing

ML Task Type: Two-stage Ranking.

Retrieval: Multi-channel candidate generation (Collaborative Filtering, Content-based, Embedding-based).

Ranking: Multi-task binary classification and regression.

Prediction Target:

Score = \sum w_i \cdot P(\text{action}_i)

where action includes Like, Finish, Share, and predicted Watch Time.

Inputs:

User: Historical watch history (last 50 videos), demographics, device.

Item: Video embeddings (from CLIP/VGG), audio features, creator stats.

Context: Time of day, location (city/country), network speed.

ML Challenges: High cardinality of features, extreme data sparsity, and "Echo Chamber" bias.

Design Summary & MVP

Concise Summary: A two-stage system utilizing a Two-Tower Neural Network for low-latency retrieval and an MMoE (Multi-gate Mixture-of-Experts) model for high-precision ranking.

Model Architecture & Selection:

Baseline: Collaborative Filtering (User-Item Matrix Factorization).

Target: Two-Tower (Retrieval) + DeepFM or MMoE (Ranking).

Choice Rationale: MMoE handles the trade-off between conflicting objectives (e.g., Like vs. Watch Time) better than a single-head DNN.

Simplicity Audit: The MVP avoids Reinforcement Learning (RL) in favor of supervised learning with a weighted score, significantly reducing training complexity while hitting 90% of the value.

Architecture Decision Rationale:

Scalability: Retrieval uses ANN (FAISS/HNSW), which scales logarithmically with corpus size.

Freshness: The Online Feature Store ensures the model reacts to the user's very last swipe.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile app events (ViewStart, ViewEnd, Like, Follow).

Data Ingestion: Kafka as the backbone. Flink for real-time windowing (calculating a user's watch time in the last 5 minutes).

Data Storage: Iceberg/S3 for long-term storage. Partitioned by date and hour to optimize heavy batch training reads.

Data Quality: De-duplication of events (at-least-once to exact-once logic in Flink) and schema validation using Confluent Schema Registry.

Feature Pipeline

Feature Definition:

User: user_id embeddings, watched_category_ids (sequence), avg_watch_time_30d.

Item: video_id embeddings, visual_embedding (from a pre-trained Vision Transformer), audio_mood.

Online Feature Pipeline: Flink maintains a "sliding window" of the user's last 20 interactions. This is pushed to Redis/Milvus for low-latency retrieval.

Training/Serving Skew: We use a Unified Feature Logging pattern. We log the features exactly as they were seen at inference time to the offline store, rather than re-computing them from history.

Model Architecture

Retrieval (Candidate Generation):

Two-Tower Model: Separate MLP towers for User and Video. The output is a 128-d vector.

Loss: Triplet loss or Softmax with sampled negatives.

Benefit: User tower can be cached; Item tower can be pre-indexed in an ANN.

Ranking (MMoE):

Input Layer: Dense features + Embeddings (shared).

Expert Layer: 8-16 Expert MLPs to learn different data patterns.

Gating Network: A softmax layer per task (Like, WatchTime, Share) that decides which experts to weight for that specific prediction.

Output Layer: Sigmoid (for binary) and ReLU (for watch time regression).

Training Pipeline

Label Construction:

is_liked: binary.

watch_ratio: min(watch_time / video_duration, 2.0).

Handling Imbalance: Downsample "skips" (negative class) to balance the "likes" (positive class), then use a calibration factor to correct the predicted probability.

Infrastructure: Horovod on PyTorch for distributed data-parallel training across 100+ GPUs.

Retraining: Daily batch retraining on the last 30 days of data + Online Learning (incremental updates) every 1 hour to capture viral trends.

Serving Pipeline

Retrieval Pattern:

Fetch UserTower(user_context) vector.

Query Milvus/FAISS using HNSW index to get top 1000 video IDs.

Ranking Pattern:

Multi-threaded feature hydration (fetch metadata for 1000 items from Redis in parallel).

Batch inference on Triton Inference Server using ONNX/TensorRT optimized models.

Reliability: If the Ranker fails, fallback to a "Popularity-based" list (cached) to ensure the user sees something.

Evaluation Pipeline

Offline: We use a Time-based Split. Train on Monday-Saturday, test on Sunday. Metrics: GAUC (Group AUC - AUC calculated per user and averaged) to ensure fairness across users.

Online: Interleaving for rapid model comparison (mixing results from two models and seeing which one gets more clicks) followed by a 5% traffic A/B test.

Monitoring Pipeline

System: Track QPS and P99. If ranking latency > 150ms, trigger an automated fallback to a smaller "Teacher-Student" distilled model.

ML Drift: Monitor Feature Value Distribution. If the "avg_video_length" in the serving request deviates significantly from the training set, alert for data pipeline bugs.

Wrap Up

Final Evaluation

Cold Start: For new videos, we use a separate "Warm-up" retrieval tower that targets users with high "Exploration" scores (users who enjoy being the first to see new content).

Trade-offs:

Complexity vs. Latency: MMoE adds compute cost over a simple DNN but improves multi-metric performance significantly.

Accuracy vs. Diversity: We use Maximal Marginal Relevance (MMR) in the re-ranking stage to penalize videos that are too similar to what the user just saw.

Distinguishing Insight: TikTok's secret sauce is the Content Understanding Tower. By using a pre-trained Multimodal Transformer, the system understands that a video of a cat and a video of a kitten are related even if they have no overlapping hashtags.