The Question
Design

Video Recommendation System

Design a personalized video recommendation system at YouTube's scale. The system should surface relevant content to hundreds of millions of users in real time based on watch history, engagement signals, and collaborative filtering, while balancing freshness and diversity.
Vector Database
ANN
Two-Tower Model
Feature Store
Spark
Questions & Insights

Thinking Process

To design a recommendation system like YouTube, focus on the "Information Retrieval" funnel. You cannot rank 1 billion videos in real-time; you must filter them down through stages.
The Funnel Architecture: How do we reduce 1 billion videos to 10-20 candidates for a specific user in under 200ms? (Candidate Generation -> Ranking -> Re-ranking).
The Retrieval Bottleneck: How do we mathematically find "similar" content without a full table scan? (Vector Embeddings & Approximate Nearest Neighbor/ANN).
Feature Freshness: How do we balance historical preferences (what you liked last year) with context (what you are watching right now)? (Feature Stores & Real-time context injection).
The Cold Start Problem: How does the system handle a brand-new user or a brand-new video with zero engagement data?

Bonus Points

Two-Tower Architecture: Using a "User Tower" and "Video Tower" to generate embeddings independently, allowing for extremely fast dot-product similarity searches in production.
Exploration vs. Exploitation (E&E): Implementing Multi-Armed Bandits or epsilon-greedy strategies to ensure users aren't trapped in an "echo chamber" by occasionally showing new/diverse content.
Multi-gate Mixture-of-Experts (MMoE): Optimizing for multiple objectives simultaneously (e.g., Click-Through Rate vs. Watch Time vs. Shares) rather than a single metric.
Negative Downsampling: Crucial for training; the system needs to learn not just what users liked, but strategically ignore the millions of things they didn't interact with.
Design Breakdown

Functional Requirements

Generate Recommendations: Provide a list of top-N videos for a user upon opening the home page.
Capture Feedback: Track clicks, views, likes, and "not interested" signals to improve future recommendations.
Content Variety: Ensure the feed is not 100% identical every time the user refreshes.

Non-Functional Requirements

Low Latency: Recommendations must be served in < 200ms.
Scalability: Support billions of videos and millions of concurrent users.
High Availability: The system must serve a "default" list (e.g., trending) if the personalized engine fails.
Data Consistency: Eventual consistency is acceptable for feedback loops (a "like" doesn't need to change the feed in milliseconds).

Estimation

Daily Active Users (DAU): 500 Million.
Total Videos: 1 Billion+.
Throughput: 500M DAU / 86400s \approx 6,000 requests per second (RPS) average, peaking at 15,000+ RPS.
Storage (Metadata): 1 Billion videos * 1KB metadata = 1TB.
Embeddings: 1 Billion videos * 256-dim float vector (1KB) = 1TB (Vector Index).

Blueprint

Concise Summary: A classic two-stage pipeline using an offline training loop and an online inference engine. The system retrieves 100s of candidates via Vector Search and ranks the top 20 via a Deep Neural Network.
Major Components:
Recommendation Service: The orchestrator that fetches user context and coordinates retrieval and ranking.
Candidate Retrieval (ANN): Uses a Vector Database (like Milvus/Pinecone) to find videos close to the user's embedding.
Ranking Service: A heavy ML inference service (Triton/TensorFlow Serving) that scores the retrieved candidates.
Data Processing (Spark): Offline pipeline to compute embeddings and aggregate user features.
Simplicity Audit: This design avoids complex real-time "Online Learning" (training on every click) in favor of batch updates, which is significantly easier to maintain and monitor for an MVP.
Architecture Decision Rationale:
Why this architecture?: The "Funnel" approach is the industry standard for any recommendation system operating at scale. Direct ranking is computationally impossible for 10^9 items.
Functional Satisfaction: Meets the need for personalized lists and feedback capture via a decoupled event-driven pipeline.
Non-functional Satisfaction: Latency is managed by doing the heavy lifting (training/embedding generation) offline and using high-speed Vector search and Feature Stores (Redis) for online requests.

High Level Architecture

Sub-system Deep Dive

Service

Recommendation Service: Stateless Go/Java service. It takes user_id, calls the Vector DB for the top 200 candidates, hydrates those candidates with features from Redis, and sends them to the Ranking Service.
API Spec:
GET /v1/recommendations?user_id=123&limit=20
POST /v1/events (Click, WatchTime, Dislike)
Scaling: Horizontal scaling behind an NLB (Network Load Balancer).

Storage

Data Model (Vector DB): Stores video_id and its corresponding float32[256] embedding vector. Indexed using HNSW (Hierarchical Navigable Small World) for logarithmic search time.
Database Logic:
Primary Data Store (Data Lake): S3/HDFS stores the source of truth for all interactions.
Metadata: PostreSQL/Spanner for video titles, descriptions, and uploader info.

Cache

Redis (Feature Store): Used for extremely low-latency lookups of pre-computed features (e.g., User's last 50 watched categories, Video's average watch time in the last hour).
TTL: 24 hours for user features, updated by the Spark pipeline.
Eviction: Least Recently Used (LRU).

Messaging

Kafka: Acts as the buffer for all user interaction events.
Topic Structure: user-interactions topic partitioned by user_id to ensure chronological order of events for a single user.
Delivery Guarantee: At-least-once delivery is sufficient for recommendation training data.

Data Processing

Spark / Batch Jobs:
Job A: Aggregates raw logs into feature tables (e.g., total likes per user per category).
Job B (Embedding Gen): Runs the "Video Tower" of the Neural Network to generate new embeddings for newly uploaded videos.
Windowing: Hourly batches for feature updates; daily for model re-training.
Wrap Up

Advanced Topics

Monitoring:
Metrics: Click-Through Rate (CTR), Mean Reciprocal Rank (MRR), P99 Latency of the Ranking Service.
Tools: Prometheus/Grafana for system health; ELK for log analysis.
Trade-offs:
Batch vs. Real-time: We chose batch for embeddings (YAGNI). This means new videos might take an hour to appear in personalized recommendations (mitigated by a "Trending/New" retrieval heuristic).
Consistency: The system is eventually consistent. A user liking a video won't immediately change their home feed until the next feature update.
Bottlenecks: The Ranking Service is the most CPU/GPU intensive. We use a two-stage process specifically to prevent the Ranking Service from being the bottleneck.
Failure Handling:
Fallback: If the Recommendation Service times out, the API Gateway returns a cached "Global Trending" list.
Replication: Vector DB and Redis are deployed across multiple Availability Zones (AZs).
Alternatives & Optimization:
Approximate Nearest Neighbor: Instead of HNSW, one could use FAISS (Facebook AI Similarity Search) if memory efficiency is more critical than search speed.
Knowledge Distillation: Use a smaller, faster model for the Ranking Service to further reduce latency at the cost of slight accuracy loss.