The Question
DesignReal-time Trending Video Ranking System
Design a highly scalable system for YouTube to identify and display the top 10 most popular videos globally based on view count within a rolling 24-hour window. The system must handle over 100,000 write events per second with minimal lag and serve the leaderboard to millions of concurrent users with sub-50ms latency. Discuss how you handle high-cardinality data, viral 'hot' keys, and ensure the system remains available even during massive traffic spikes.
Kafka
Apache Flink
Redis
CDN
Load Balancer
ZSET
Count-Min Sketch
Sliding Window
Questions & Insights
Clarifying Questions
What is the definition of popular? (e.g., views in the last 24 hours, likes, or all-time engagement?)
Assumption: We will define "popular" as the most-viewed videos within a rolling 24-hour window.
What is the required freshness of the data? (e.g., real-time updates vs. hourly updates?)
Assumption: 1–5 minute delay is acceptable for the "Top 10" list.
What is the scale of the system? (e.g., DAU and daily video views?)
Assumption: YouTube scale—approx. 5 billion views per day, leading to ~60k average write QPS and peak 150k+.
Is the "Top 10" global or personalized/regional?
Assumption: A single global "Top 10" for the MVP to minimize complexity.
Is 100% precision required?
Assumption: No. Approximate rankings are acceptable at this scale to ensure low latency and high availability.
Thinking Process
Core Bottleneck: The massive write volume (view events) makes a synchronous database update impossible. We must decouple ingestion from aggregation.
Key Questions for Design:
How do we ingest 100k+ events per second without dropping data or slowing down the user experience? (Answer: Async Message Queue).
How do we aggregate these events in a rolling window efficiently? (Answer: Stream Processing).
Where do we store the final result so that millions of users can read it with sub-10ms latency? (Answer: In-memory Cache).
How do we handle "Hot Keys" (e.g., a viral video receiving millions of views simultaneously)? (Answer: Local pre-aggregation in the ingestion layer).
Bonus Points
Count-Min Sketch: Use a probabilistic data structure for frequency estimation to save 99% of memory compared to a hash map at the cost of slight over-estimation.
Local Pre-aggregation: In-memory buffering on the ingestion service (e.g., 10 seconds of views for Video X) to reduce the number of messages sent to Kafka.
Lambda Architecture: Use a batch layer (Hadoop/Spark) for 100% accurate daily reports and a speed layer (Flink) for real-time trending, merging results if necessary.
Read-Heavy Optimization: Push the "Top 10" list to the Edge/CDN to avoid hitting the origin server for every request.
Design Breakdown
Functional Requirements
Core Use Cases:
Ingest video view events from millions of clients.
Compute the top 10 most viewed videos in a rolling 24-hour window.
Provide an API endpoint to retrieve the current Top 10 list.
Scope Control:
In-scope: Global trending list, real-time view counting, high-throughput ingestion.
Out-of-scope: Personalization (recommendations), regional trending, video playback/streaming, comment processing.
Non-Functional Requirements
Scale: Must handle 150k+ write QPS and millions of read QPS.
Latency: Read latency < 50ms; Processing lag < 5 mins.
Availability & Reliability: 99.99% availability for the read path; no single point of failure (SPOF).
Consistency: Eventual consistency is sufficient; exact counts are less critical than ranking order.
Fault Tolerance: The system must recover from node failures in the stream processing layer without losing significant data.
Estimation
Traffic:
5 Billion views / 86,400 seconds ≈ 58k QPS (Average).
Peak = 2x - 3x Average ≈ 150k QPS.
Storage:
Top 10 List: 10 * (8 bytes VideoID + 8 bytes Count) ≈ 160 bytes. (Negligible).
Aggregation State (if using a Map): 1 million active videos * (16 bytes) ≈ 16MB. (Very small).
Bandwidth:
Ingress: 150k events/s * 100 bytes/event ≈ 15MB/s (Manageable for a Kafka cluster).
Blueprint
Concise Summary: A streaming pipeline architecture that captures view events via an API, buffers them in a message queue, aggregates them using a rolling-window stream processor, and serves the final top 10 list from an in-memory cache.
Major Components:
Ingestion Service: Stateless workers that receive view heartbeats and perform initial batching.
Kafka (Messaging Layer): Durable log for decoupling event production from processing.
Flink (Data Processing Layer): Statefully aggregates counts over a 24-hour sliding window.
Redis (Cache Layer): Stores the final Top 10 list for low-latency retrieval.
Simplicity Audit: This design avoids complex relational joins or heavy database writes. It uses a unidirectional data flow which is easy to scale and monitor.
Architecture Decision Rationale:
Why this architecture?: Stream processing is the industry standard for "Top K" problems at scale. It handles the "Big Data" aspect of views while maintaining low latency.
Functional Satisfaction: Direct mapping from events to a sorted leaderboard.
Non-functional Satisfaction: Kafka provides durability; Flink provides scalability; Redis provides the required read latency.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
Global Load Balancing: Use Latency-based DNS to route traffic to the nearest Ingestion cluster.
CDN: The API Service response for
/top-10 is cached at the Edge with a TTL of 60 seconds to drastically reduce origin load.Security:
Rate Limiting: Applied at the API Gateway to prevent bot-driven view inflation.
Service
Topology & Scaling:
Ingestion Service: Stateless, horizontally scaled on K8s based on CPU/Network usage.
Local Aggregation: To mitigate Kafka pressure, workers buffer views in a local ConcurrentHashMap and flush to Kafka every 1 second or 1000 events.
API Schema Design:
Endpoint:
GET /v1/trendingResponse:
{"videos": [{"id": "abc", "rank": 1}, ...]}Protocol: REST over HTTP/2.
Resilience:
Retries: Client-side retries with exponential backoff for ingestion.
Storage
Technical Selection: Redis (Sorted Set / ZSET).
Access Pattern:
Flink writes the top 1,000 videos to the Redis ZSET every minute.
API Service calls
ZREVRANGE trending_videos 0 9 to get the Top 10.Distribution Logic:
Since the dataset is tiny (Top 1000 videos), we use a Redis Master-Replica setup for high availability. No sharding is required.
Cache
Purpose: To serve the final "Top 10" list at high QPS.
Key-Value Schema:
Key:
trending_globalStructure:
ZSET (Member: VideoID, Score: ViewCount).TTL: No TTL, explicitly overwritten by Flink.
Failure Handling: If Redis is empty, the API Service can fall back to the last known good result stored in a persistent backup (e.g., S3 or a secondary DB).
Messaging
Purpose: Decouples the ingestion of billions of events from the compute-heavy aggregation.
Throughput & Partitioning:
Partition key:
VideoID. This ensures all views for a specific video go to the same Flink task for easier local aggregation.Technical Selection: Kafka. High throughput and data retention for 24 hours allow for replay in case of Flink failure.
Data Processing
Processing Model: Sliding Window. (e.g., 24-hour window that slides every 1 minute).
Processing DAG:
Source: Kafka. Map: Deserialize and filter invalid events. Window Aggregate: Sum views per VideoID. Global Rank: A final operator collects the Top K from all partitions and outputs to Redis.Technical Selection: Apache Flink. Best-in-class for stateful stream processing and windowing.
Optimization: Use Two-stage Aggregation (Aggregate locally by partition first, then merge) to avoid the "Global Top K" becoming a bottleneck.
Wrap Up
Advanced Topics
Trade-offs: We trade Consistency for Availability. A view might take 2 minutes to reflect in the Top 10 list, but the system will never crash under high load.
Reliability: If Flink fails, Kafka's 24-hour retention allows us to reprocess events. Flink Checkpoints ensure we don't start from zero.
Bottleneck Analysis:
Hot Video Partition: If one video gets 90% of views, a single Kafka partition/Flink task will be overwhelmed.
Mitigation: Use "Salting" for the partition key or pre-aggregate in the Ingestion Service to reduce the volume of a single key.
Security: View-count auditing is needed to prevent "view farming." This can be a separate offline Spark job that detects patterns and issues "correction" events to the stream.