The Question

Scalable Video Recommendation System

Design a large-scale video recommendation engine similar to YouTube. The system must provide personalized 'Home' feeds for hundreds of millions of users, handling a corpus of billions of videos with sub-200ms latency. Detail how you would solve the retrieval vs. ranking challenge, manage real-time user feedback loops, and ensure content diversity while maintaining high relevance.

NoSQL

Cassandra

Redis

Kafka

Flink

Spark

Vector DB

gRPC

ANN

HNSW

CDN

Protobuf

Questions & Insights

Clarifying Questions

Scale & Performance: What is the scale of Daily Active Users (DAU) and how many videos are in the corpus?

Freshness vs. Relevance: Should the system prioritize newly uploaded videos (freshness) or historical high-performance videos (relevance)?

Feedback Loops: Do we need to handle both explicit (likes, subscriptions) and implicit (watch time, clicks) signals?

Real-time Requirements: Should the recommendations update immediately after a user watches a video, or is a daily/hourly batch update sufficient?

Assumptions:

DAU: 500M.

Corpus: Billions of videos.

Latency: < 200ms for the recommendation list.

Feedback: Both implicit and explicit; sub-second latency for recording events, but model updates can be near real-time (minutes).

Strategy: Two-stage pipeline (Retrieval + Ranking) is the industry standard for this scale.

Thinking Process

The Retrieval Bottleneck: How do we narrow down billions of videos to a few hundred candidates in milliseconds? (Answer: Approximate Nearest Neighbor (ANN) search via embeddings).

The Ranking Bottleneck: How do we accurately score those hundreds of candidates using complex features (user history, video context)? (Answer: Deep neural networks with feature cross-layer).

Data Freshness: How do we incorporate a user's latest action into the feed immediately? (Answer: Online feature store updates and event-driven embedding refreshes).

Diversity & Exploration: How do we prevent filter bubbles? (Answer: Re-ranking stage for diversity and "exploration" sampling).

Bonus Points

Multi-gate Mixture-of-Experts (MMoE): Designing the ranking model to optimize for multiple objectives (e.g., watch time AND click-through rate) simultaneously.

Negative Feedback Handling: Implementing a "don't recommend this channel" or "not interested" signal with immediate cache invalidation.

Feature Drift Monitoring: Staff-level systems monitor the statistical distribution of features in real-time to detect if the model's training data no longer matches live traffic.

Cost-Efficient Training: Using "Negative Sampling" in the retrieval stage to train models on a fraction of the non-clicked data to save on compute costs.

Design Breakdown

Functional Requirements

Core Use Cases:

Fetch personalized home feed (list of videos).

Fetch "Up Next" / Sidebar recommendations for a specific video.

Record user interactions (watch duration, clicks, skips).

Scope Control:

In-scope: Recommendation serving pipeline, feature storage, event ingestion.

Out-of-scope: Video transcoding, search engine (text-based), ads placement (separate engine).

Non-Functional Requirements

Scale: Support 500M DAU and 100k+ QPS for recommendation requests.

Latency: P99 < 200ms for recommendation generation.

Availability: 99.99% availability (Highly available over strongly consistent).

Consistency: Eventual consistency for new video indexing and user history updates.

Fault Tolerance: If the ML ranking service fails, fallback to a cached global "Popular Videos" list.

Estimation

Traffic: 500M DAU. Assuming 20 feed refreshes/day = 10B daily requests.

Average QPS: ~115k QPS. Peak QPS: ~250k QPS.

Storage:

User history (1000 actions/user * 500M users) = 500B records.

Video features (1B videos * 1KB metadata) = 1TB.

Bandwidth:

Each response (~20 video IDs) is small (~1KB).

115k QPS * 1KB = 115MB/s outgoing.

Blueprint

This architecture follows the Retrieval-Ranking pattern. A "Recommendation Service" orchestrates the flow: first fetching 100-500 candidates from a Vector DB, then scoring them using a Ranking Service, and finally applying business logic (filtering/diversity).

Recommendation Service: Orchestrates retrieval, ranking, and post-processing.

User Event Service: Ingests clicks and watch time into a stream for real-time feature updates.

Feature Store: Stores low-latency user and video features for ranking.

Vector DB: Performs ANN search to find candidates based on user embeddings.

Ranking Service: High-compute ML service that scores candidates.

Simplicity Audit: This design avoids complex "Online Learning" (updating model weights in real-time) and instead focuses on "Online Ingestion" (updating feature inputs), which provides 90% of the value with significantly less operational risk.

Architecture Decision Rationale:

Why this architecture?: You cannot run a heavy Deep Learning model on 1 billion videos. Two-stage filtering is the only way to balance accuracy and latency.

Functional Satisfaction: Covers home feed and user feedback loops.

Non-functional Satisfaction: Vector DB handles scale, Feature Store handles low latency, and Kafka ensures eventual consistency for feedback.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery: Video thumbnails and metadata (titles) are cached on a global CDN.

API Gateway: Handles AuthN/AuthZ. It also implements Context Injection: attaching the user's device type, location (geo-hash), and current time to the request header to inform recommendations.

Service

Recommendation Service:

Protocol: gRPC for internal communication between services.

Workflow:

Call Retrieval Service for ~500 candidates.

Call Ranking Service to score candidates.

Post-Processing: Filter already watched videos, remove "not interested" content, and ensure diversity (e.g., no more than 3 videos from the same channel).

Resilience:

Fallbacks: If the Ranking Service times out, return the raw Retrieval list (shuffled).

Circuit Breaking: Applied to the Ranking Service to prevent cascading failure if a model update causes latency spikes.

Storage

Feature DB (Cassandra):

Access Pattern: Heavy writes (user clicks) and heavy random reads (ranking).

Schema:

user_features: user_id (PK), last_10_categories (list), watch_history_stats (blob).

video_features: video_id (PK), click_through_rate, total_views, embedding_version.

Vector DB (Milvus/Pinecone):

Stores high-dimensional embeddings (

d=128

256

Index: HNSW (Hierarchical Navigable Small World) for sub-10ms search latency.

Cache

Purpose: Reducing read load on Feature DB during Ranking.

Key-Value Schema:

Key: u:{user_id} or v:{video_id}.

Value: Protobuf-serialized feature vector.

Failure Handling: If Cache misses, fetch from Cassandra and write back. If Redis is down, Service Layer falls back to local LRU cache of popular video features.

Messaging

Purpose: Decoupling the feedback loop. When a user watches a video, the system must update the User Embedding and Feature Store asynchronously.

Event Schema: {user_id, video_id, action_type: "WATCH", duration: 120s, timestamp: ...}.

Technical Selection: Kafka. High throughput for billions of events. Partitioned by user_id to ensure chronological order of a single user's history during processing.

Data Processing

Processing Model:

Streaming (Flink): Aggregates real-time features (e.g., "how many times has this video been clicked in the last 5 minutes?"). Updates the Feature Store.

Batch (Spark): Weekly/Daily re-training of the Embedding model (Two-Tower Network) and the Ranking model (Deep & Cross Network).

Correctness: Checkpointing in Flink ensures no user events are lost.

Wrap Up

Advanced Topics

Trade-offs: We choose Precision over Recall in the Ranking stage and Recall over Precision in the Retrieval stage. This is a fundamental trade-off to manage the vast search space.

Cold Start: For new users, we provide "Popular" or "Trending" videos based on their Geo-location. For new videos, we use "Content-based Filtering" (extracting features from video tags/descriptions) until enough user interaction data exists.

Bottleneck Analysis:

Hot Shards: Popular videos (e.g., a viral music video) can cause hot partitions in the Feature Store. Mitigation: Cache the top 10,000 video features in local memory on every Ranking worker node.

Fan-out: Fetching features for 500 candidates simultaneously is expensive. Optimization: Use multi-get/batch requests and parallel I/O.

Security: PII scrubbing in the Data Processing Layer ensures user watch history used for training does not contain clear-text identifiers.