The Question
Design

Scalable Video Recommendation System

Design a large-scale video recommendation engine similar to YouTube. The system must provide personalized 'Home' feeds for hundreds of millions of users, handling a corpus of billions of videos with sub-200ms latency. Detail how you would solve the retrieval vs. ranking challenge, manage real-time user feedback loops, and ensure content diversity while maintaining high relevance.
NoSQL
Cassandra
Redis
Kafka
Flink
Spark
Vector DB
gRPC
ANN
HNSW
CDN
Protobuf
Questions & Insights

Clarifying Questions

Scale & Performance: What is the scale of Daily Active Users (DAU) and how many videos are in the corpus?
Freshness vs. Relevance: Should the system prioritize newly uploaded videos (freshness) or historical high-performance videos (relevance)?
Feedback Loops: Do we need to handle both explicit (likes, subscriptions) and implicit (watch time, clicks) signals?
Real-time Requirements: Should the recommendations update immediately after a user watches a video, or is a daily/hourly batch update sufficient?
Assumptions:
DAU: 500M.
Corpus: Billions of videos.
Latency: < 200ms for the recommendation list.
Feedback: Both implicit and explicit; sub-second latency for recording events, but model updates can be near real-time (minutes).
Strategy: Two-stage pipeline (Retrieval + Ranking) is the industry standard for this scale.

Thinking Process

The Retrieval Bottleneck: How do we narrow down billions of videos to a few hundred candidates in milliseconds? (Answer: Approximate Nearest Neighbor (ANN) search via embeddings).
The Ranking Bottleneck: How do we accurately score those hundreds of candidates using complex features (user history, video context)? (Answer: Deep neural networks with feature cross-layer).
Data Freshness: How do we incorporate a user's latest action into the feed immediately? (Answer: Online feature store updates and event-driven embedding refreshes).
Diversity & Exploration: How do we prevent filter bubbles? (Answer: Re-ranking stage for diversity and "exploration" sampling).

Bonus Points

Multi-gate Mixture-of-Experts (MMoE): Designing the ranking model to optimize for multiple objectives (e.g., watch time AND click-through rate) simultaneously.
Negative Feedback Handling: Implementing a "don't recommend this channel" or "not interested" signal with immediate cache invalidation.
Feature Drift Monitoring: Staff-level systems monitor the statistical distribution of features in real-time to detect if the model's training data no longer matches live traffic.
Cost-Efficient Training: Using "Negative Sampling" in the retrieval stage to train models on a fraction of the non-clicked data to save on compute costs.
Design Breakdown

Functional Requirements

Core Use Cases:
Fetch personalized home feed (list of videos).
Fetch "Up Next" / Sidebar recommendations for a specific video.
Record user interactions (watch duration, clicks, skips).
Scope Control:
In-scope: Recommendation serving pipeline, feature storage, event ingestion.
Out-of-scope: Video transcoding, search engine (text-based), ads placement (separate engine).

Non-Functional Requirements

Scale: Support 500M DAU and 100k+ QPS for recommendation requests.
Latency: P99 < 200ms for recommendation generation.
Availability: 99.99% availability (Highly available over strongly consistent).
Consistency: Eventual consistency for new video indexing and user history updates.
Fault Tolerance: If the ML ranking service fails, fallback to a cached global "Popular Videos" list.

Estimation

Traffic: 500M DAU. Assuming 20 feed refreshes/day = 10B daily requests.
Average QPS: ~115k QPS. Peak QPS: ~250k QPS.
Storage:
User history (1000 actions/user * 500M users) = 500B records.
Video features (1B videos * 1KB metadata) = 1TB.
Bandwidth:
Each response (~20 video IDs) is small (~1KB).
115k QPS * 1KB = 115MB/s outgoing.

Blueprint

This architecture follows the Retrieval-Ranking pattern. A "Recommendation Service" orchestrates the flow: first fetching 100-500 candidates from a Vector DB, then scoring them using a Ranking Service, and finally applying business logic (filtering/diversity).
Recommendation Service: Orchestrates retrieval, ranking, and post-processing.
User Event Service: Ingests clicks and watch time into a stream for real-time feature updates.
Feature Store: Stores low-latency user and video features for ranking.
Vector DB: Performs ANN search to find candidates based on user embeddings.
Ranking Service: High-compute ML service that scores candidates.
Simplicity Audit: This design avoids complex "Online Learning" (updating model weights in real-time) and instead focuses on "Online Ingestion" (updating feature inputs), which provides 90% of the value with significantly less operational risk.
Architecture Decision Rationale:
Why this architecture?: You cannot run a heavy Deep Learning model on 1 billion videos. Two-stage filtering is the only way to balance accuracy and latency.
Functional Satisfaction: Covers home feed and user feedback loops.
Non-functional Satisfaction: Vector DB handles scale, Feature Store handles low latency, and Kafka ensures eventual consistency for feedback.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery: Video thumbnails and metadata (titles) are cached on a global CDN.
API Gateway: Handles AuthN/AuthZ. It also implements Context Injection: attaching the user's device type, location (geo-hash), and current time to the request header to inform recommendations.

Service

Recommendation Service:
Protocol: gRPC for internal communication between services.
Workflow:
Call Retrieval Service for ~500 candidates.
Call Ranking Service to score candidates.
Post-Processing: Filter already watched videos, remove "not interested" content, and ensure diversity (e.g., no more than 3 videos from the same channel).
Resilience:
Fallbacks: If the Ranking Service times out, return the raw Retrieval list (shuffled).
Circuit Breaking: Applied to the Ranking Service to prevent cascading failure if a model update causes latency spikes.

Storage

Feature DB (Cassandra):
Access Pattern: Heavy writes (user clicks) and heavy random reads (ranking).
Schema:
user_features: user_id (PK), last_10_categories (list), watch_history_stats (blob).
video_features: video_id (PK), click_through_rate, total_views, embedding_version.
Vector DB (Milvus/Pinecone):
Stores high-dimensional embeddings (d=128 or 256).
Index: HNSW (Hierarchical Navigable Small World) for sub-10ms search latency.

Cache

Purpose: Reducing read load on Feature DB during Ranking.
Key-Value Schema:
Key: u:{user_id} or v:{video_id}.
Value: Protobuf-serialized feature vector.
Failure Handling: If Cache misses, fetch from Cassandra and write back. If Redis is down, Service Layer falls back to local LRU cache of popular video features.

Messaging

Purpose: Decoupling the feedback loop. When a user watches a video, the system must update the User Embedding and Feature Store asynchronously.
Event Schema: {user_id, video_id, action_type: "WATCH", duration: 120s, timestamp: ...}.
Technical Selection: Kafka. High throughput for billions of events. Partitioned by user_id to ensure chronological order of a single user's history during processing.

Data Processing

Processing Model:
Streaming (Flink): Aggregates real-time features (e.g., "how many times has this video been clicked in the last 5 minutes?"). Updates the Feature Store.
Batch (Spark): Weekly/Daily re-training of the Embedding model (Two-Tower Network) and the Ranking model (Deep & Cross Network).
Correctness: Checkpointing in Flink ensures no user events are lost.
Wrap Up

Advanced Topics

Trade-offs: We choose Precision over Recall in the Ranking stage and Recall over Precision in the Retrieval stage. This is a fundamental trade-off to manage the vast search space.
Cold Start: For new users, we provide "Popular" or "Trending" videos based on their Geo-location. For new videos, we use "Content-based Filtering" (extracting features from video tags/descriptions) until enough user interaction data exists.
Bottleneck Analysis:
Hot Shards: Popular videos (e.g., a viral music video) can cause hot partitions in the Feature Store. Mitigation: Cache the top 10,000 video features in local memory on every Ranking worker node.
Fan-out: Fetching features for 500 candidates simultaneously is expensive. Optimization: Use multi-get/batch requests and parallel I/O.
Security: PII scrubbing in the Data Processing Layer ensures user watch history used for training does not contain clear-text identifiers.