The Question

Global Video Streaming Platform

Design a large-scale video sharing and streaming platform similar to YouTube. The system must support millions of daily video uploads and billions of views globally. Key challenges include reliable large-file ingestion, efficient asynchronous video transcoding for multiple device resolutions, and low-latency global content delivery. Discuss how you would handle adaptive bitrate streaming, cost-effective storage for massive datasets, and maintaining system availability during traffic spikes.

CDN

Kafka

PostgreSQL

Redis

FFmpeg

HLS

DASH

Microservices

TUS Protocol

Kubernetes

Questions & Insights

Clarifying Questions

What is the scale of the system? (Assumed: 1 Billion DAU, 5 Million new video uploads/day, 5 Billion video views/day).

What are the core features for the MVP? (Assumed: Video uploading, metadata storage, video processing/transcoding, and low-latency video streaming).

What are the constraints on video size and format? (Assumed: Support up to 4K resolution, max 2GB per file, MP4/MOV inputs).

Is global availability required? (Assumed: Yes, users are worldwide, requiring a Content Delivery Network (CDN) strategy).

What is the target latency for playback start? (Assumed: Under 200ms for the first frame).

Thinking Process

The design revolves around the massive asymmetry between write (upload/transcode) and read (stream) traffic, alongside the sheer volume of binary data.

How do we handle massive video uploads reliably? Use resumable uploads via a pre-signed URL to decouple the application server from the heavy lifting of byte-stream ingestion.

How do we ensure videos are playable on any device? Implement an asynchronous transcoding pipeline that converts raw uploads into multiple resolutions (1080p, 720p, 480p) and formats (HLS/DASH).

How do we scale video delivery globally? Utilize a multi-tiered CDN strategy to cache popular content at the edge, closer to users.

How do we manage metadata at scale? Separate video binary storage (Object Store) from video metadata (Sharded RDBMS/NoSQL) and utilize caching for frequently accessed video info.

Bonus Points

Adaptive Bitrate Streaming (ABR): Mentioning HLS/DASH to dynamically adjust video quality based on the user's real-time network conditions.

Cost-Optimized Storage Tiers: Moving older, unpopular videos from "Standard" Object Storage to "Cold" storage (e.g., S3 Glacier) to save millions in infrastructure costs.

Video Chunking: Uploading and processing videos in small chunks (e.g., 2-5 seconds) to allow parallel transcoding and faster "time-to-first-view" before the entire file is processed.

CDN Pre-warming: Intelligently pushing highly anticipated content (e.g., a trailer from a major studio) to the edge before it "goes live" to prevent a thundering herd on the origin.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can upload videos.

Users can view videos with minimal buffering.

Users can search for videos by title.

Content creators receive notifications when transcoding is complete.

Scope Control:

In-scope: Uploading, Transcoding, Streaming, Metadata management.

Out-of-scope: Recommendations engine, real-time comments, live streaming, copyright Content ID (for MVP).

Non-Functional Requirements

Scale: Support PB-scale storage and 100k+ concurrent streams.

Latency: Low latency for video playback start and search queries.

Availability & Reliability: 99.99% availability; no data loss for uploaded videos.

Consistency: Eventual consistency is acceptable for view counts and search indexing.

Security & Privacy: Support for private/unlisted videos and secure pre-signed upload URLs.

Estimation

Traffic Estimation:

Uploads: 5M/day

\approx

60 uploads/sec.

Views: 5B/day

\approx

60,000 QPS (Read).

Storage Estimation:

5M videos * 200MB (average compressed size) = 1 PB/day.

365 PB/year.

Bandwidth Estimation:

Ingress: 60 uploads/sec * 200MB = 12 GB/s.

Egress: 60,000 views/sec * (average bitrate 2Mbps) = 120 Gbps (massively distributed via CDN).

Blueprint

Concise Summary: A microservices architecture leveraging an asynchronous processing pipeline for video transcoding and a globally distributed CDN for delivery.

Major Components:

API Gateway: Entry point for auth, rate limiting, and routing.

Upload Service: Generates pre-signed URLs and manages upload sessions.

Object Storage: High-durability storage for raw and transcoded video files.

Transcoding Pipeline: Worker cluster that converts videos into multiple formats.

Metadata DB: Relational store for video info, user data, and status.

CDN: Edge cache for video segments to ensure low-latency delivery.

Simplicity Audit: This design avoids complex recommendation graphs or real-time synchronization in favor of a robust, linear pipeline for video availability.

Architecture Decision Rationale:

Why this architecture?: Decoupling upload from processing via a Message Queue (Kafka) allows for independent scaling and fault tolerance. If transcoding fails, the raw file is safe in S3.

Functional Requirement Satisfaction: Covers the full lifecycle from upload to playback.

Non-functional Requirement Satisfaction: CDN ensures global scale; Object Storage ensures reliability; Kafka provides a buffer for traffic spikes.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

CDN Strategy: Use a hybrid approach (e.g., Cloudfront or Akamai). Static assets and video segments (.ts files for HLS) are cached at the edge.

L7 Load Balancing: Route traffic based on geography and service type (e.g., /upload vs /watch).

Security & Perimeter:

API Gateway: Handles JWT-based authentication.

Rate Limiting: Tiered limits (e.g., higher for verified creators) to prevent DDoS.

Service

Topology & Scaling: Stateless microservices deployed in Kubernetes across multiple Availability Zones.

API Schema Design:

POST /v1/uploads: Initializes upload, returns pre-signed S3 URL.

GET /v1/videos/{id}: Returns metadata and the HLS manifest URL (.m3u8).

PATCH /v1/videos/{id}: Update title/description (Idempotent).

Resilience & Reliability:

TUS Protocol: Use resumable file uploads to handle network interruptions on mobile.

Circuit Breaker: Used between Video Service and Metadata DB to prevent cascading failures.

Storage

Access Pattern:

Metadata: Heavy Read (60k QPS), Low Write (60 QPS).

Video Blobs: Write once, Read many.

Database Table Design (Metadata DB):

VideoID (UUID, PK), UserID (FK), Title, Description, RawURL, ManifestURL, Status (Pending/Processing/Ready), CreatedAt.

Technical Selection:

PostgreSQL: With horizontal sharding by VideoID for high availability and ACID compliance for metadata.

S3 (Object Storage): Industry standard for cost-effective, durable blob storage.

Reliability & Recovery: S3 provides 99.999999999% durability. Multi-region replication for critical metadata.

Cache

Purpose & Justification: Reduces Metadata DB load for trending videos and stores temporary upload session state.

Key-Value Schema:

Key: video_meta:{id}, Value: JSON blob of video details. TTL: 1 hour (longer for viral content).

Key: upload_session:{user_id}, Value: Current chunk offset.

Technical Selection: Redis. Support for high-throughput sub-millisecond reads and data structures like Hashes for metadata.

Messaging

Purpose & Decoupling: Decouples the upload completion from the intensive transcoding process.

Event Schema: VideoUploadedEvent { video_id, s3_path, user_id }.

Throughput & Partitioning: Kafka topic partitioned by video_id to ensure ordered processing if multiple updates occur.

Technical Selection: Kafka. High throughput and durability allow for replaying messages if transcoding workers fail.

Data Processing

Processing Model: Asynchronous Batch/Streaming processing via FFmpeg.

Processing DAG:

Pull raw video from S3.

Inspect metadata (resolution, bitrate).

Parallel transcode into 1080p, 720p, 480p.

Segment into 4-second .ts chunks.

Generate HLS Manifest (.m3u8).

Upload segments to Transcoded S3.

Technical Selection: Custom worker fleet using FFmpeg and managed by a workflow engine like Temporal or AWS Step Functions.

Infrastructure (Optional)

Observability: Prometheus for metrics (transcoding lag, 5xx errors), ELK stack for log aggregation, and Jaeger for tracing the lifecycle of an upload.

Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for view counts and search results to maintain high availability and low latency.

Reliability: If the Transcoding Worker fails, the message remains in Kafka (or goes to DLQ), allowing for a retry without the user re-uploading.

Bottleneck Analysis: The primary bottleneck is the cost and speed of transcoding.

Optimization: Use hardware acceleration (GPUs) for faster transcoding or "Per-Title Encoding" to optimize bitrate/quality for specific types of content (e.g., animation vs. high-motion sports).

Security: Pre-signed URLs ensure that the client never touches our application servers with massive binary data, and only authorized users can upload to specific S3 buckets.