The Question

Real-time Trending Video Ranking System

Design a highly scalable system for YouTube to identify and display the top 10 most popular videos globally based on view count within a rolling 24-hour window. The system must handle over 100,000 write events per second with minimal lag and serve the leaderboard to millions of concurrent users with sub-50ms latency. Discuss how you handle high-cardinality data, viral 'hot' keys, and ensure the system remains available even during massive traffic spikes.

Kafka

Apache Flink

Redis

CDN

Load Balancer

ZSET

Count-Min Sketch

Sliding Window

Questions & Insights

Clarifying Questions

What is the definition of popular? (e.g., views in the last 24 hours, likes, or all-time engagement?)

Assumption: We will define "popular" as the most-viewed videos within a rolling 24-hour window.

What is the required freshness of the data? (e.g., real-time updates vs. hourly updates?)

Assumption: 1–5 minute delay is acceptable for the "Top 10" list.

What is the scale of the system? (e.g., DAU and daily video views?)

Assumption: YouTube scale—approx. 5 billion views per day, leading to ~60k average write QPS and peak 150k+.

Is the "Top 10" global or personalized/regional?

Assumption: A single global "Top 10" for the MVP to minimize complexity.

Is 100% precision required?

Assumption: No. Approximate rankings are acceptable at this scale to ensure low latency and high availability.

Thinking Process

Core Bottleneck: The massive write volume (view events) makes a synchronous database update impossible. We must decouple ingestion from aggregation.

Key Questions for Design:

How do we ingest 100k+ events per second without dropping data or slowing down the user experience? (Answer: Async Message Queue).

How do we aggregate these events in a rolling window efficiently? (Answer: Stream Processing).

Where do we store the final result so that millions of users can read it with sub-10ms latency? (Answer: In-memory Cache).

How do we handle "Hot Keys" (e.g., a viral video receiving millions of views simultaneously)? (Answer: Local pre-aggregation in the ingestion layer).

Bonus Points

Count-Min Sketch: Use a probabilistic data structure for frequency estimation to save 99% of memory compared to a hash map at the cost of slight over-estimation.

Local Pre-aggregation: In-memory buffering on the ingestion service (e.g., 10 seconds of views for Video X) to reduce the number of messages sent to Kafka.

Lambda Architecture: Use a batch layer (Hadoop/Spark) for 100% accurate daily reports and a speed layer (Flink) for real-time trending, merging results if necessary.

Read-Heavy Optimization: Push the "Top 10" list to the Edge/CDN to avoid hitting the origin server for every request.

Design Breakdown

Functional Requirements

Core Use Cases:

Ingest video view events from millions of clients.

Compute the top 10 most viewed videos in a rolling 24-hour window.

Provide an API endpoint to retrieve the current Top 10 list.

Scope Control:

In-scope: Global trending list, real-time view counting, high-throughput ingestion.

Out-of-scope: Personalization (recommendations), regional trending, video playback/streaming, comment processing.

Non-Functional Requirements

Scale: Must handle 150k+ write QPS and millions of read QPS.

Latency: Read latency < 50ms; Processing lag < 5 mins.

Availability & Reliability: 99.99% availability for the read path; no single point of failure (SPOF).

Consistency: Eventual consistency is sufficient; exact counts are less critical than ranking order.

Fault Tolerance: The system must recover from node failures in the stream processing layer without losing significant data.

Estimation

Traffic:

5 Billion views / 86,400 seconds ≈ 58k QPS (Average).

Peak = 2x - 3x Average ≈ 150k QPS.

Storage:

Top 10 List: 10 * (8 bytes VideoID + 8 bytes Count) ≈ 160 bytes. (Negligible).

Aggregation State (if using a Map): 1 million active videos * (16 bytes) ≈ 16MB. (Very small).

Bandwidth:

Ingress: 150k events/s * 100 bytes/event ≈ 15MB/s (Manageable for a Kafka cluster).

Blueprint

Concise Summary: A streaming pipeline architecture that captures view events via an API, buffers them in a message queue, aggregates them using a rolling-window stream processor, and serves the final top 10 list from an in-memory cache.

Major Components:

Ingestion Service: Stateless workers that receive view heartbeats and perform initial batching.

Kafka (Messaging Layer): Durable log for decoupling event production from processing.

Flink (Data Processing Layer): Statefully aggregates counts over a 24-hour sliding window.

Redis (Cache Layer): Stores the final Top 10 list for low-latency retrieval.

Simplicity Audit: This design avoids complex relational joins or heavy database writes. It uses a unidirectional data flow which is easy to scale and monitor.

Architecture Decision Rationale:

Why this architecture?: Stream processing is the industry standard for "Top K" problems at scale. It handles the "Big Data" aspect of views while maintaining low latency.

Functional Satisfaction: Direct mapping from events to a sorted leaderboard.

Non-functional Satisfaction: Kafka provides durability; Flink provides scalability; Redis provides the required read latency.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

Global Load Balancing: Use Latency-based DNS to route traffic to the nearest Ingestion cluster.

CDN: The API Service response for /top-10 is cached at the Edge with a TTL of 60 seconds to drastically reduce origin load.

Security:

Rate Limiting: Applied at the API Gateway to prevent bot-driven view inflation.

Service

Topology & Scaling:

Ingestion Service: Stateless, horizontally scaled on K8s based on CPU/Network usage.

Local Aggregation: To mitigate Kafka pressure, workers buffer views in a local ConcurrentHashMap and flush to Kafka every 1 second or 1000 events.

API Schema Design:

Endpoint: GET /v1/trending

Response: {"videos": [{"id": "abc", "rank": 1}, ...]}

Protocol: REST over HTTP/2.

Resilience:

Retries: Client-side retries with exponential backoff for ingestion.

Storage

Technical Selection: Redis (Sorted Set / ZSET).

Access Pattern:

Flink writes the top 1,000 videos to the Redis ZSET every minute.

API Service calls ZREVRANGE trending_videos 0 9 to get the Top 10.

Distribution Logic:

Since the dataset is tiny (Top 1000 videos), we use a Redis Master-Replica setup for high availability. No sharding is required.

Cache

Purpose: To serve the final "Top 10" list at high QPS.

Key-Value Schema:

Key: trending_global

Structure: ZSET (Member: VideoID, Score: ViewCount).

TTL: No TTL, explicitly overwritten by Flink.

Failure Handling: If Redis is empty, the API Service can fall back to the last known good result stored in a persistent backup (e.g., S3 or a secondary DB).

Messaging

Purpose: Decouples the ingestion of billions of events from the compute-heavy aggregation.

Throughput & Partitioning:

Partition key: VideoID. This ensures all views for a specific video go to the same Flink task for easier local aggregation.

Technical Selection: Kafka. High throughput and data retention for 24 hours allow for replay in case of Flink failure.

Data Processing

Processing Model: Sliding Window. (e.g., 24-hour window that slides every 1 minute).

Processing DAG:

Source: Kafka.

Map: Deserialize and filter invalid events.

Window Aggregate: Sum views per VideoID.

Global Rank: A final operator collects the Top K from all partitions and outputs to Redis.

Technical Selection: Apache Flink. Best-in-class for stateful stream processing and windowing.

Optimization: Use Two-stage Aggregation (Aggregate locally by partition first, then merge) to avoid the "Global Top K" becoming a bottleneck.

Wrap Up

Advanced Topics

Trade-offs: We trade Consistency for Availability. A view might take 2 minutes to reflect in the Top 10 list, but the system will never crash under high load.

Reliability: If Flink fails, Kafka's 24-hour retention allows us to reprocess events. Flink Checkpoints ensure we don't start from zero.

Bottleneck Analysis:

Hot Video Partition: If one video gets 90% of views, a single Kafka partition/Flink task will be overwhelmed.

Mitigation: Use "Salting" for the partition key or pre-aggregate in the Ingestion Service to reduce the volume of a single key.

Security: View-count auditing is needed to prevent "view farming." This can be a separate offline Spark job that detects patterns and issues "correction" events to the stream.