The Question
ML Design

Short-Form Video Recommendation System

Design the end-to-end recommendation engine for a short-form video platform similar to TikTok. The system must scale to 1B+ users and 100M+ videos, delivering personalized content with sub-150ms latency. Your design should specifically address the multi-objective nature of engagement (watch time, likes, shares), the need for ultra-fresh real-time user feedback, and strategies for video cold-start/exploration. Elaborate on the data pipelines for real-time feature engineering, the two-stage retrieval and ranking architecture, and the infrastructure required for continuous model evaluation and deployment.
Two-Tower
MMoE
FAISS
HNSW
Kafka
Flink
Spark
Redis
PyTorch
Triton Inference Server
Protobuf
Kubernetes
Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:
Business Goal: Is the primary North Star metric total watch time, user retention (L7/L30), or a multi-objective blend of likes, shares, and follows? Assumption: Maximize high-quality watch time and long-term retention.
Constraints & Scale: What is the scale of the corpus and user base? Assumption: 1B+ DAU, 100M+ active videos, and a P99 latency budget of <150ms for the full recommendation pipeline.
Freshness: How quickly must a newly uploaded video or a user's latest interaction influence the feed? Assumption: Near real-time (seconds) for user actions, and minutes for new video ingestion.
Cold Start: How do we handle new videos with zero interactions? Assumption: Use content-based features (audio/visual embeddings) and a dedicated "exploration" traffic bucket.
Assumptions:
Corpus size: 100M videos.
Retrieval candidates: Top 1,000.
Ranking output: Top 10-20 videos per request.
Environment: Cloud-native (K8s) with GPU acceleration for inference.

Thinking Process

Identify the Funnel: With 100M items, a single-pass ranking is impossible. I need a multi-stage funnel: Retrieval (efficient filtering) -> Ranking (precise scoring) -> Re-ranking (business logic and diversity).
Multi-Objective nature: TikTok isn't just about clicks. It's about watch time (regression) and social signals (classification). I should use a Multi-Gate Mixture-of-Experts (MMoE) to handle these conflicting goals.
Real-time Feedback: User interest in short-form video shifts in seconds. The system must ingest "skip" or "like" events immediately to update the next video in the session.
Scalability vs. Simplicity: Stick to a Two-Tower architecture for retrieval (MVP standard) and a DeepFM or MMoE for ranking. Avoid complex Reinforcement Learning for the MVP.

Elite Bonus Points

Positional Bias Correction: Since users are more likely to watch what's presented first, I’ll implement a "position feature" during training but set it to a default value during inference to de-bias the model.
Delayed Feedback Loop: Watch time labels are only finalized after a session ends or the user scrolls. I’ll use a "negative sampling" strategy that updates "not watched" labels if a "watch" event arrives late to avoid false negatives.
Online Learning/Streaming Training: Using Flink to compute "interaction hourly trends" as features to capture viral content before the daily batch training job runs.
Exploration via Bandits: Implementing a small epsilon-greedy or Thompson Sampling layer in the re-ranker to allocate 5% of traffic to "cold-start" videos to discover the next viral hit.
Design Breakdown

Requirements

Product Goal: Deliver a personalized, addictive "For You" feed that maximizes user satisfaction.
Success Metrics:
Online: Daily Active Users (DAU), Average Watch Time per User, 1-day Retention.
Offline: AUC (for binary actions), LogLoss, and nDCG (for ranking order).
Guardrail: P99 Latency (<150ms), Diversity Score (Intra-list distance), and Ad-load ratio.
System Constraints: 1M+ QPS, PB-scale data storage, 100ms model execution time.
Data Availability: User profile, Video metadata, Real-time interaction logs (clicks, skips, shares).

ML Problem Framing

ML Task Type: Multi-stage Ranking (Retrieval + Ranking).
Prediction Target: Multi-task probability: y = w_1 P(finish) + w_2 P(like) + w_3 P(share) - w_4 P(skip).
Inputs:
User: Historical categories, embedding of last 50 watched videos, device, geo.
Item: Video duration, audio-id, creator-id, visual embeddings (ResNet/ViT), hashtags.
Context: Time of day, network speed (important for video bitrate/buffering).
ML Challenges: Extreme data imbalance (shares are rare), high-cardinality ID features, and the "Echo Chamber" effect.

Design Summary & MVP

Concise Summary: A two-stage system using Two-Tower embeddings for high-speed retrieval and an MMoE neural network for multi-objective ranking, optimized for low-latency serving via TENSORRT.
Model Architecture & Selection:
Baseline Model: Matrix Factorization (Retrieval) + Logistic Regression (Ranking).
Target Model: Two-Tower (Retrieval) + MMoE (Ranking).
Choice Rationale: Two-Tower allows for efficient Approximate Nearest Neighbor (ANN) search; MMoE allows different "experts" to learn watch-time vs. engagement independently.
ML Life Cycle Summary: Data ingested via Kafka -> Features stored in Redis (Online) and S3 (Offline) -> Training via PyTorch Distributed -> Serving via Triton Inference Server -> Monitoring via Prometheus/Grafana.
Simplicity Audit: Avoids RL or Graph Neural Networks for the MVP, focusing on robust embeddings and established multi-task architectures.
Architecture Decision Rationale: This decoupled approach ensures horizontal scalability and allows the retrieval and ranking teams to iterate independently.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile app logs (Protobuf), Video CMS (Metadata), Creator DB (User stats).
Data Ingestion: Kafka for high-throughput event streaming. Using Flink for stateful windowed aggregations (e.g., "how many times was this video watched in the last 10 minutes?").
Data Storage: S3 for the Data Lake (Parquet format, partitioned by date/hour). Snowflake/BigQuery for analytical queries.
Data Processing: Spark for heavy-duty batch processing and feature backfills.
Data Quality: Deequ or Great Expectations for schema validation. Deduplication of events at the Kafka-to-S3 sink.

Feature Pipeline

Feature Definition:
User: Dense embeddings of user history, sparse categorical IDs (City, Language).
Item: Video embeddings (derived from pre-trained CLIP/Audio models), counts (likes, views).
Feature Engineering: Log-transform for counts, Z-score normalization for dense features.
Feature Store: Redis for low-latency (<5ms) online retrieval; S3 for offline training.
Training/Serving Skew: Use a unified feature logging service. During inference, the features used are logged directly to the training set to ensure the model sees exactly what it saw during serving.

Model Architecture

Retrieval (Two-Tower):
User Tower: Deep Neural Network (DNN) mapping user features to a 128D vector.
Item Tower: DNN mapping video features to a 128D vector.
Loss: InfoNCE (Contrastive Loss) to maximize the dot product of (User, Positive Video) pairs.
Ranking (MMoE):
Input Layer: Shared embedding layer for categorical features.
Expert Layer: 8-16 MLP experts to capture different data distributions.
Gate Layer: Softmax gates per task (Watch time, Like, Share).
Optimization: Use mixed-precision training (FP16) and model quantization (INT8) for inference speed.

Training Pipeline

Dataset Construction: Sample negatives from the "impressions but no click" pool. For retrieval, use "in-batch negatives" to scale training.
Data Splitting: Time-based split. Train on days 1-28, validate on day 29, test on day 30.
Infrastructure: Distributed training using Horovod or PyTorch DDP on NVIDIA A100 clusters.
Retraining Strategy: Daily batch retraining for the ranking model; hourly "fine-tuning" or incremental updates for the retrieval index to include new videos.

Serving Pipeline

Serving Pattern:
Retrieval: FAISS or HNSW (Approximate Nearest Neighbor) running on a separate cluster.
Ranking: Request-response serving using gRPC.
Latency Optimization: Model pruning and using NVIDIA Triton for batching multiple requests.
Reliability: Fallback to "Popular Videos" if the ML service times out.

Evaluation Pipeline

Offline Evaluation:
Retrieval: Recall@K (is the ground truth in the top 1000?).
Ranking: Weighted AUC for each task, GAUC (Group AUC) per user to ensure intra-user ranking quality.
Online Evaluation: A/B testing framework measuring "Total Watch Time" and "Day-7 Retention".

Monitoring Pipeline

System Monitoring: Latency (P50, P99), throughput (QPS), and CPU/GPU utilization.
Data Monitoring: Feature drift (e.g., if a categorical feature's distribution changes due to a bug).
Model Monitoring: Prediction drift (e.g., if the average predicted watch time deviates significantly from actuals).
Wrap Up

Final Evaluation

Observability: Real-time dashboards for "Click-Through Rate" (CTR) by video category.
Feedback Loop: "Negative Feedback" (user clicking 'not interested') is weighted heavily to immediately suppress similar content.
Edge Cases:
Cold Start: Force-injecting 5% of new videos into the retrieval stage.
Filter Bubbles: Injecting "Exploration" items from random/diverse categories.
Trade-offs:
Accuracy vs. Latency: Fewer experts in MMoE reduce latency but might hurt AUC.
Freshness vs. Stability: Incremental training is faster but prone to catastrophic forgetting; daily batches are stable but slower to adapt.