The Question
DesignVideo Streaming Platform
Design a large-scale video streaming platform similar to YouTube. The system should handle video upload and transcoding pipelines, CDN-based adaptive bitrate delivery, search and discovery, and personalized recommendations for a global user base.
CDN
S3
FFmpeg Transcoding
HLS/DASH
Redis
Questions & Insights
Thinking Process
To design a video streaming platform like YouTube, the core challenge is the asymmetry between write (upload/transcode) and read (global playback) paths.
Focus Areas: Large file handling, asynchronous processing pipelines, and low-latency global content delivery.
Progressive Questions:
How do we handle multi-gigabyte uploads without timing out or wasting bandwidth? (Answer: Chunked uploads + Resumable sessions).
How do we ensure videos play smoothly on both 4K monitors and 3G mobile phones? (Answer: Asynchronous transcoding into multiple resolutions + Adaptive Bitrate Streaming).
How do we prevent our primary database from melting under millions of concurrent playback requests? (Answer: Decouple metadata from the video binary and utilize a CDN for the heavy lifting).
How do we ensure global availability with sub-second start times? (Answer: Geo-distributed Object Storage and Edge Caching).
Bonus Points
Cost-Optimized Storage Tiering: Using S3 Intelligent-Tiering or moving older, unpopular videos to "Cold Storage" (Glacier) to reduce OpEx.
Adaptive Bitrate Streaming (ABR): Implementing HLS (HTTP Live Streaming) or MPEG-DASH to dynamically adjust video quality based on the user's real-time network throughput.
VNET/Edge Side Compositing: Using Lambda@Edge to personalize manifest files or inject ads at the CDN level rather than the origin.
Quorum-based Metadata Writes: Using Linearizable consistency for video status (to prevent "Video Not Found" after a successful upload) while using Eventual consistency for view counts.
Design Breakdown
Functional Requirements
Upload: Users can upload videos (up to 1GB for MVP).
Streaming: Users can watch videos in different resolutions (360p, 720p, 1080p).
Search: Users can search for videos by title.
Metadata: Users can view video titles, descriptions, and view counts.
Non-Functional Requirements
High Availability: 99.9% uptime for playback (Read > Write).
Low Latency: Minimal buffering and fast start times globally.
Scalability: Support for millions of concurrent viewers.
Reliability: Uploaded data must not be lost (Durability).
Estimation
DAU: 10 Million.
Daily Uploads: 50,000 videos.
Average Video Size: 200MB (Original).
Daily Ingest Storage: 50k * 200MB = 10 TB/day.
Read/Write Ratio: 100:1 (Heavy Read).
Egress Bandwidth: 10M views * 100MB avg watched = 1 PB/day.
Bandwidth throughput: ~11.5 GB/s (Requires heavy CDN offloading).
Blueprint
Concise Summary: A microservices-based architecture that separates the heavy video processing pipeline from the lightweight metadata and search services, utilizing a CDN for global delivery.
Major Components:
Upload Service: Handles chunked binary transfers and generates upload pre-signed URLs.
Transcoding Engine: An asynchronous worker fleet that converts raw files into HLS/DASH segments and multiple resolutions.
CDN (Content Delivery Network): Caches processed video segments at edge locations close to users.
Metadata Store: A relational database for structured video info (UserID, Title, Storage Path).
Simplicity Audit: This design uses S3 for storage and FFMPEG-based workers for processing, avoiding complex custom streaming servers by leveraging standard HTTP-based streaming (HLS).
Architecture Decision Rationale:
Why this architecture is the best for this problem?: It decouples the compute-heavy transcoding task from the user-facing API, ensuring that a surge in uploads doesn't crash the playback experience.
Functional Requirement Satisfaction: Meets upload, transcode, and play requirements via S3 + Workers + CDN.
Non-functional Requirement Satisfaction: CDN ensures low latency; S3 ensures 99.999999999% durability; Queue-based processing ensures system resilience.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: Services are deployed as Dockerized containers in an auto-scaling group (K8s/ECS).
API Spec:
POST /v1/uploads/resumable: Returns a session ID and S3 pre-signed URL for chunked uploads.GET /v1/videos/{id}: Returns metadata and the CDN URL for the .m3u8 manifest file.GET /v1/search?q=...: Interfaces with the Metadata DB (MVP) or ElasticSearch (Post-MVP).Communication: REST for external; gRPC for internal service-to-service calls.
Storage
Data Model:
Videos Table: video_id (PK), user_id, title, description, s3_path, status (processing/ready), created_at.Database Logic: Postgres is used for ACID compliance on metadata. Horizontal scaling via Read Replicas since playback info is read-intensive.
Cache
Redis: Stores the
Video object serialized as JSON.TTL: 24 hours for popular videos; evicted via LRU (Least Recently Used).
Logic: When
GET /video/{id} is called, check Redis first to avoid DB hits.Messaging
SQS / Kafka: Acts as a buffer between the Upload Service and Transcoding Workers.
Message Structure:
{ "video_id": "123", "input_path": "s3://raw/123.mp4", "resolutions": ["360p", "720p", "1080p"] }.Guarantees: At-least-once delivery to ensure no video is left untranscoded.
Data Processing
Transcoding Workers: Python/Go wrappers around FFMPEG.
DAG/Workflow:
Pull message from SQS.
Download raw file from S3.
Run parallel FFMPEG processes for different resolutions.
Segment videos into 10-second
.ts chunks.Generate a Master Playlist (
.m3u8).Upload all to Processed S3.
Wrap Up
Advanced Topics
Monitoring:
Prometheus/Grafana: Track Transcoding Queue depth (to scale workers) and API 5xx errors.
CloudWatch: Monitor S3 egress costs and CDN hit ratios.
Trade-offs:
Availability vs. Consistency: We choose Eventual Consistency for view counts. A user might see 100 views while another sees 105; this is acceptable to avoid locking the DB.
Bottlenecks:
Transcoding Speed: Heavy 4K videos take time. Optimization: Split a single video into chunks and transcode chunks in parallel across multiple workers.
Cost: Bandwidth is expensive. Optimization: Use CDN with aggressive caching and private peering with ISPs.
Failure Handling:
Dead Letter Queues (DLQ): For videos that fail transcoding more than 3 times.
S3 Cross-Region Replication: Ensure videos are available even if an entire AWS region goes down.
Alternatives:
NoSQL (Cassandra): Could be used for metadata if we expect billions of rows, but Postgres is simpler for an MVP.
Peer-to-Peer (P2P) Streaming: Could be used to reduce CDN costs (like BitChute), but increases client-side complexity.