The Question

Scalable Video Feed & Recommendation System

Design a high-scale short-form video feed system similar to TikTok's 'For You' page. The system must support 500 million daily active users, provide personalized content with sub-200ms latency, and handle massive ingestion of user interaction signals. Explain how you would architect the multi-stage recommendation pipeline (Retrieval and Ranking), manage global video distribution, and ensure seamless playback across varying network conditions while maintaining high availability.

Kafka

Flink

Redis

Cassandra

Milvus

Vector DB

CDN

HLS

gRPC

NoSQL

Machine Learning

Questions & Insights

Clarifying Questions

Scale: What is the target Daily Active Users (DAU) and the average number of videos watched per user daily?

Content Type: Are we supporting only short-form videos (e.g., < 60s), or do we need to handle long-form/live streams for the MVP?

Personalization Latency: How quickly should a user's interaction (e.g., liking a video) influence their feed? Is real-time feedback (seconds) required, or is near-real-time (minutes) acceptable?

Global Distribution: Do we need to support multi-region delivery with localized content, or is a single-region deployment sufficient for the MVP?

Assumptions:

Scale: 500M DAU, 40 videos watched per user/day.

Content: Short-form videos only (~15-60s).

Personalization: Hybrid approach. Near-real-time feedback for the next session, but the immediate "next video" uses a pre-computed candidate pool.

Availability: High availability is prioritized over strict consistency for feed generation.

Thinking Process

Core Bottleneck: The primary challenge is not just delivering data, but the "Recommendation Cold Start" and the massive write-volume of interaction events (likes, views, skips) that drive the ML models.

Key Questions for Design Path:

How do we retrieve a small set of relevant candidates from billions of videos in < 100ms?

How do we ingest and process TBs of interaction logs daily without crashing the feed service?

How do we balance "Exploitation" (showing what they like) vs. "Exploration" (showing new categories) in the feed?

How do we ensure video playback starts instantly (Zero-latency perception)?

Bonus Points

Vector Embeddings & ANN: Using Vector Databases (e.g., Milvus or Pinecone) for Approximate Nearest Neighbor (ANN) search to perform semantic retrieval in the recommendation pipeline.

Online Machine Learning: Implementation of a "Flin-to-Model" pipeline where user signals are processed via Flink and fed into a parameter server for sub-second model updates.

QUIC/HTTP3 Protocol: Utilizing QUIC to reduce head-of-line blocking and improve video start times in lossy mobile network conditions.

Cost-Efficient Storage Tiering: Moving older, less popular videos to "Cold Storage" (S3 Glacier) while keeping viral hits in multi-tier CDNs.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can scroll through an infinite "For You" feed of short videos.

Users can "Like" or "Follow" to influence the feed.

Creators can upload videos.

Video playback must be seamless (auto-play).

Scope Control:

In-scope: Feed generation, video metadata management, basic interaction tracking.

Out-of-scope: Live streaming, social messaging, ad-engine integration, search functionality.

Non-Functional Requirements

Scale: Support 500M DAU and 20B video views per day.

Latency: Feed generation (metadata) < 200ms; Video start time < 500ms.

Availability & Reliability: 99.99% availability; the system must serve a "fallback" popular feed if the recommendation engine fails.

Consistency: Eventual consistency for like counts and view counts.

Fault Tolerance: Isolated failure of the ML ranking service should not stop users from seeing videos.

Estimation

Traffic Estimation:

Read QPS: 500M DAU * 40 videos / 86400s ≈ 230k QPS (Average). Peak QPS ≈ 500k-700k.

Write QPS (Interactions): Each video view generates ~5 events (start, 25%, 50%, finish, like) ≈ 1.1M QPS (Average).

Storage Estimation:

Video Metadata: 1B videos * 1KB/record = 1TB.

Video Blobs: 1B videos * 5MB/video = 5PB.

Bandwidth Estimation:

Outgoing (Video Stream): 230k views/s * 5MB = 1.15 TB/s (mostly handled by CDN).

Blueprint

The design follows a "Retrieval & Ranking" architecture. A Feed Service orchestrates the flow by fetching candidates from a pre-computed cache or a Vector DB, sending them to a Ranking Service, and returning the ordered list to the client.

Major Components:

API Gateway: Handles authentication, rate limiting, and request routing.

Feed Service: Orchestrates the assembly of the user's feed.

Recommendation Service: Multi-stage pipeline (Retrieval -> Ranking -> Re-ranking).

Interaction Service: High-throughput ingestion of user signals (likes, watch time).

Video Service: Manages video uploads, transcoding, and metadata storage.

Simplicity Audit: We use a pre-computed feed strategy for most users to minimize runtime computation, falling back to a "Popularity-based" feed for guests or system failures.

Architecture Decision Rationale:

Retrieval/Ranking Split: Necessary because ranking 1 billion videos in real-time is computationally impossible.

Event-driven Analytics: Decouples the feed delivery from the complex ML feature engineering.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

Use a global CDN (Cloudflare/Akamai) for video segment delivery (.m3u8/HLS).

Latency-based DNS routing to the nearest regional data center.

Security & Perimeter:

API Gateway handles JWT validation.

Rate limiting set at 100 feed requests per minute per user to prevent scraping.

Service

Topology & Scaling:

Feed Service: Stateless, scaled horizontally based on QPS. Multi-AZ deployment.

Recommendation Service: Heavy CPU/GPU usage; scaled based on inference latency.

API Schema Design:

GET /v1/feed:

Request: user_id, device_id, cursor, count.

Response: list<video_objects>, next_cursor.

POST /v1/interactions:

Request: video_id, interaction_type (LIKE/VIEW/SKIP), duration_ms.

Resilience:

Circuit breaker on the Recommendation Service. If it times out, the Feed Service returns a cached "Global Top 20" list.

Storage

Access Pattern:

Video Metadata: High read, low write.

Interactions: Extremely high write.

Database Table Design:

Video Metadata (NoSQL - Cassandra): video_id (K), creator_id, s3_url, thumbnail_url, tags, created_at.

User Features (Key-Value - DynamoDB): user_id (K), preferred_tags, embedding_vector, last_active.

Technical Selection:

Cassandra: For metadata, as it handles high-volume writes and scales linearly.

Vector DB (Milvus): To store video embeddings for real-time similarity retrieval.

Cache

Purpose & Justification: Pre-computed feeds are stored in Redis to ensure < 50ms response times for the initial app load.

Key-Value Schema:

Key: feed:{user_id}.

Value: List<video_id>.

TTL: 2 hours.

Failure Handling: If Redis is down, the system triggers a "Cold Start" retrieval from the Recommendation Service.

Messaging

Purpose & Decoupling: Kafka decouples user actions (high frequency) from the heavy ML training and feature update processes.

Event Schema: timestamp, user_id, video_id, event_type.

Throughput: Partitioned by user_id to ensure chronological order of actions for a single user during processing.

Technical Selection: Kafka is chosen for its durability and ability to replay events for ML model backtesting.

Data Processing

Processing Model: Streaming (Flink) for real-time feature updates (e.g., if a user watches 3 cat videos, update their "interest vector" immediately).

Processing DAG:

Source: Kafka.

Map: Extract user/video features.

Aggregate: Update rolling counters (e.g., likes in last 10 mins).

Sink: Vector DB / Feature Store.

Technical Selection: Apache Flink for low-latency stateful processing.

Wrap Up

Advanced Topics

Trade-offs (Latency vs. Freshness): We pre-compute feeds (high freshness/low latency) but use a hybrid approach where the next "page" of the feed can be re-ranked based on the last 5 minutes of activity.

Reliability: If the main recommendation pipeline fails, we use a "Diversity-based Fallback" (randomly sampled popular videos from various categories) to ensure the user never sees an empty screen.

Zero-Latency Video: The API response includes the first 2 seconds of video as a data URI or a high-priority pre-fetch URL for the client to begin buffering immediately.

Security: PII scrubbing in the Interaction Service before data reaches the ML logs to comply with GDPR/CCPA.