The Question
Design

Video Streaming Platform

Design a large-scale video streaming and sharing platform similar to YouTube. The system should support video uploads, transcoding into multiple resolutions, CDN-based delivery, and personalized recommendations for hundreds of millions of users worldwide.
CDN
S3
FFmpeg Transcoding
ABR Streaming
PostgreSQL
Questions & Insights

Thinking Process

To design a video-sharing platform like YouTube, we must prioritize efficient content delivery and resilient video processing.
Core Bottleneck: High-bandwidth video delivery and the CPU-intensive transcoding process.
Progressive Questions:
How do we ensure users across the globe experience low latency during playback? (Answer: CDN and Edge Caching).
How do we handle different device types and network conditions for a single upload? (Answer: Async Transcoding Pipeline and Adaptive Bitrate Streaming).
How do we prevent the metadata database from becoming a bottleneck during viral events? (Answer: Cache-aside pattern for hot video metadata).
How do we scale the storage of petabytes of raw and processed video files? (Answer: Decoupled Object Storage with Lifecycle Policies).

Bonus Points

Adaptive Bitrate Streaming (ABR): Implementing protocols like DASH (Dynamic Adaptive Streaming over HTTP) or HLS to dynamically adjust video quality based on the user's real-time network bandwidth.
Geo-sharded Metadata: Using a globally distributed database (e.g., Spanner or CockroachDB) to store video metadata, ensuring low-latency writes and consistent reads across regions.
Cost Optimization: Using cold storage (e.g., S3 Glacier) for rarely watched legacy videos while keeping viral content in SSD-backed edge caches to optimize the high cost of egress.
Blob Storage Multi-part Upload: Implementing resumable uploads with checksum validation to handle large file transfers over unstable mobile networks.
Design Breakdown

Functional Requirements

Video Upload: Users can upload videos (up to 1GB for MVP).
Video Streaming: Users can watch videos in various resolutions (360p, 720p, 1080p).
Search: Users can search for videos by title.
View Counts: Real-time (eventually consistent) tracking of video views.

Non-Functional Requirements

High Availability: Playback must be available 99.99% of the time.
Low Latency: Start-up time for video playback should be < 2 seconds globally.
Scalability: Must handle a massive increase in both storage volume and concurrent viewers.
Reliability: Uploaded videos must not be lost (Durability).

Estimation

DAU: 5 Million (MVP scale).
Daily Uploads: 10,000 videos.
Average Video Size: 200MB (Original) + 300MB (Transcoded versions).
Daily Storage Growth: 10,000 * 500MB = 5 TB / day.
Read/Write Ratio: 100:1 (Heavy Read).
Bandwidth: 5M views * 100MB avg = 500 TB egress per day. (Requires heavy CDN reliance).

Blueprint

Concise Summary: An asynchronous, event-driven architecture that decouples video ingestion from playback, utilizing a distributed transcoding pipeline and Global CDN delivery.
Major Components:
API Gateway: Entry point for authentication, rate limiting, and request routing to internal services.
Upload Service: Handles multipart uploads and persists raw video files to temporary storage.
Transcoding Workers: CPU-optimized fleet that converts raw videos into multiple formats and resolutions.
Metadata DB: Stores video information (title, URL, owner, duration).
Object Storage: High-durability storage for both original and processed video files.
CDN: Geographically distributed cache to serve video segments to end-users from the nearest edge.
Simplicity Audit: This architecture uses managed Object Storage and CDNs to offload the heaviest tasks (storage and delivery), allowing the application logic to remain lightweight and horizontally scalable.
Architecture Decision Rationale:
Why this architecture?: Video processing is too slow for synchronous requests; an async worker-queue model is mandatory for stability.
Functional Satisfaction: Covers the full lifecycle from upload to playback and search.
Non-functional Satisfaction: CDN ensures low latency; Object Storage ensures durability; microservices allow independent scaling of the Transcoder fleet during peak upload times.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Services are deployed as Docker containers in an Auto-scaling Group. The Upload Service is optimized for high-timeout connections, while the Video Metadata Service is optimized for high-throughput small reads.
API Spec:
POST /v1/uploads: Initiates a resumable upload session.
GET /v1/videos/{id}: Returns metadata and the CDN URL for the manifest file (m3u8/mpd).
GET /v1/search?q={query}: Basic keyword search against the metadata database.

Storage

Data Model:
Table: Videos: video_id (PK), user_id, title, description, raw_path, manifest_path, status (processing/ready), created_at.
Database Logic: PostgreSQL with B-Tree indexes on video_id and user_id. For search, a GIN index on the title column is used for basic MVP full-text search capability.

Cache

Data Structures: Redis Strings for video metadata caching (Key: vid:{id}).
TTL: 24 hours for standard videos; shorter for trending content.
Eviction: LRU (Least Recently Used) to ensure high-memory efficiency for the most popular content.

Messaging

Topic Structure: transcoding_tasks queue. Each message contains video_id and the S3 path to the raw file.
Delivery Guarantees: At-least-once delivery to ensure no video upload is "lost" in processing.
Consumers: Transcoder workers pull tasks, process them, and acknowledge upon completion.

Data Processing

DAG/Transformations: The Transcoder splits the raw video into 10-second chunks, converts them into H.264/AAC at 360p, 720p, and 1080p, and generates an HLS manifest file.
Windowing Strategy: Not applicable for video files; processing is per-file.
Wrap Up

Advanced Topics

Monitoring:
Prometheus/Grafana: Monitor Transcoding queue depth and worker CPU usage.
CDN Logs: Track egress traffic and cache hit ratios.
Trade-offs:
Consistency vs Availability: We choose Eventual Consistency for view counts and metadata updates to maintain high availability during peak traffic.
Bottlenecks: The Transcoding process is the slowest part. If the queue grows too long, "Time to Ready" for videos increases.
Failure Handling:
S3 Replication: Cross-region replication for disaster recovery.
Queue Dead Letter Queues (DLQ): To capture videos that fail transcoding consistently for manual inspection.
Alternatives & Optimization:
Alternative: Use a NoSQL DB like Cassandra for metadata if the write load scales beyond what a single Postgres master can handle.
Optimization: Implement Edge Side Rendering or Lambda@Edge to customize manifests (e.g., ad insertion) without hitting the origin.