The Question

Design a Global Video Sharing and Streaming Platform

Design a system similar to YouTube that supports large-scale video uploads, transcoding into multiple resolutions, and global low-latency playback. The system must handle millions of concurrent viewers and provide a seamless experience across varying network conditions. Discuss how you would handle massive storage requirements, efficient video delivery, and the asynchronous processing of video files.

Kafka

Cassandra

Redis

ElasticSearch

CDN

HLS

DASH

FFmpeg

Kubernetes

gRPC

Questions & Insights

Clarifying Questions

What is the target scale? (Assumption: 100M Daily Active Users (DAU), 5M video uploads per day, and 1B video views per day).

What are the primary functional requirements for the MVP? (Assumption: Video upload, video streaming/playback, and metadata search).

What video qualities and formats must we support? (Assumption: Support for multiple resolutions up to 1080p using H.264/AAC, delivered via HLS/DASH for adaptive bitrate streaming).

Is there a requirement for social features or monetization in the MVP? (Assumption: Basic likes and comments are in-scope; ads and complex recommendations are out-of-scope for the MVP).

Thinking Process

Core Bottleneck: High-bandwidth video ingestion and low-latency global video delivery.

Key Progressive Questions:

How do we handle large file uploads reliably without failing on poor connections? (Answer: Chunked uploads with resumability).

How do we ensure videos play smoothly across different devices and network speeds? (Answer: Async transcoding pipeline and Adaptive Bitrate Streaming).

How do we scale storage and delivery for petabytes of data? (Answer: Object storage for blobs and Global CDN for delivery).

How do we manage high-velocity metadata (views, likes)? (Answer: NoSQL database for horizontal scaling).

Bonus Points

Adaptive Bitrate Streaming (ABR): Implementing DASH (Dynamic Adaptive Streaming over HTTP) or HLS to dynamically adjust video quality based on the user's real-time bandwidth.

Cost-Optimized Storage Tiering: Using S3 Intelligent-Tiering or moving older, less-popular videos to "Cold Storage" (e.g., Glacier) to reduce OpEx.

Content ID & Copyright Ingestion: Implementing a fingerprinting service during the transcoding phase to check against a database of copyrighted material.

Edge Side Compositing: Using Lambda@Edge to personalize manifest files (.m3u8) for users at the CDN level.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can upload videos (up to 1GB).

Users can view videos with minimal buffering.

Users can search for videos by title.

Users can like/comment on videos.

Scope Control:

In-Scope: Upload, Transcoding, Streaming, Metadata Management, Search.

Out-of-Scope: Live streaming, 4K/8K support, complex recommendation engine, Video analytics dashboard.

Non-Functional Requirements

Scale: Support 5M uploads and 1B views daily (1:200 write-to-read ratio).

Latency: Playback start latency < 200ms (via CDN); Upload completion is asynchronous.

Availability & Reliability: 99.99% availability for playback; 99.999% durability for stored videos.

Consistency: Eventual consistency for view counts and comments; Strong consistency for video metadata during upload.

Security & Privacy: Support for private/unlisted videos and TLS for all transmissions.

Estimation

Traffic Estimation:

Upload QPS: 5M / 86400s

\approx

60 uploads/sec.

Read QPS: 1B / 86400s

\approx

11,500 views/sec (Peak ~25k).

Storage Estimation:

5M videos/day * 100MB (avg) = 500 TB/day.

1 Year storage

\approx

180 PB (before replication/transcoding).

Bandwidth Estimation:

Outgoing: 11,500 views/sec 2MB/min 5 min (avg)

\approx

115 GB/sec.

Blueprint

Concise Summary: A microservices-based architecture centered around an asynchronous transcoding pipeline and a global CDN for delivery.

Major Components:

API Gateway: Entry point for authentication, rate limiting, and request routing.

Upload Service: Handles chunked video uploads and stores raw files in S3.

Transcoding Pipeline: An event-driven system (Kafka + Workers) that converts raw videos into multiple formats and resolutions.

Video Service: Manages video metadata (titles, descriptions, URLs).

CDN: Distributes transcoded video files to edge locations for low-latency playback.

Simplicity Audit: This design avoids complex "Live" infrastructure and relies on managed Object Storage and CDNs to handle the heaviest lifting, which is standard for an MVP.

Architecture Decision Rationale:

Why this architecture?: Decoupling upload from transcoding ensures system resilience; if the transcoder is busy, uploads aren't blocked.

Functional Satisfaction: Covers the full lifecycle from upload to global consumption.

Non-functional Satisfaction: CDN ensures low latency; S3/NoSQL ensure massive scale.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

A Global CDN (e.g., Cloudfront/Akamai) is mandatory to cache video segments (.ts files) near users.

Latency-based DNS routing directs users to the nearest CDN POP.

Security & Perimeter:

API Gateway handles OAuth2/JWT validation.

Rate limiting at the user level to prevent API abuse during uploads.

Service

Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS/GKE). Auto-scaling based on CPU for API services and Queue depth for Transcode workers.

API Schema Design:

POST /v1/uploads: Initializes a resumable upload session. Returns upload_id.

PUT /v1/uploads/{id}: Uploads a video chunk.

GET /v1/videos/{id}: Fetches metadata and the HLS master manifest (.m3u8) link.

Resilience:

Retries with exponential backoff for chunk uploads.

Circuit breakers on the Video Service to prevent cascading failures if the Metadata DB is slow.

Storage

Access Pattern:

High write for raw video blobs.

Heavy read-heavy for metadata.

Database Table Design:

Video Metadata (Cassandra): video_id (PK), user_id, title, description, manifest_url, status (processing/ready), created_at.

Choice: Cassandra is used for its high write throughput and easy horizontal scaling as the video library grows.

Technical Selection:

Object Storage (S3): For both raw and transcoded video files.

ElasticSearch: For full-text search on video titles and descriptions.

Cache

Purpose: Reduce load on Cassandra for "Viral" videos.

Key-Value Schema:

Key: video:metadata:{video_id}

Value: JSON blob of video details.

TTL: 1 hour, with cache invalidation on metadata updates.

Technical Selection: Redis.

Messaging

Purpose: Decouple the expensive transcoding process from the user-facing upload service.

Event Schema: VideoUploadedEvent { video_id, raw_s3_path, user_id }.

Failure Handling: Dead-letter queues (DLQ) for videos that fail transcoding (e.g., corrupted files).

Technical Selection: Kafka.

Data Processing

Processing Model: Asynchronous batch processing per video.

Processing DAG:

Pull message from Kafka.

Download raw video from Raw S3.

FFmpeg Transformation: Generate 360p, 720p, 1080p versions and segment into .ts files.

Upload segments to Transcoded S3.

Update Video Service status to "READY".

Technical Selection: Custom workers using FFmpeg libraries.

Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for view counts. A centralized counter would be a bottleneck; instead, we buffer view increments in Redis and flush to DB periodically.

Reliability: Using Resumable Uploads. If a user loses connection at 90% of a 1GB file, they only re-upload the missing chunks.

Bottleneck Analysis:

Hot Shards: A viral video might overwhelm a single DB shard. Solution: Cache metadata heavily and use a CDN for the video data.

Transcoding Lag: Sudden spikes in uploads. Solution: Use spot instances for auto-scaling transcoding workers to keep costs low while handling bursts.

Security: Pre-signed URLs for uploads to allow clients to upload directly to S3, bypassing the server to save bandwidth and improve security.