The Question
Design

Video Streaming Platform

Design a large-scale video streaming platform similar to YouTube. The system should handle video upload and transcoding pipelines, CDN-based adaptive bitrate delivery, search and discovery, and personalized recommendations for a global user base.
PostgreSQL
Cassandra
DynamoDB
Elasticsearch
S3
Questions & Insights

Thinking Process

To design a video streaming platform like YouTube at scale, the primary bottleneck shifts from simple CRUD operations to massive binary data ingress and high-bandwidth egress.
How do we handle multi-gigabyte uploads reliably?
Use Resumable Chunked Uploads. The client breaks files into chunks; the server validates hashes and tracks progress in a metadata store to allow resumes after network failure.
How do we ensure smooth playback across diverse devices and bandwidths?
Implement an Asynchronous Transcoding Pipeline. Raw videos are converted into multiple resolutions (360p, 720p, 1080p) and formats (HLS/DASH) to support Adaptive Bitrate Streaming (ABR).
How do we minimize global latency?
Leverage a Content Delivery Network (CDN) to cache transcoded video segments at edge locations, moving the "heavy lifting" closer to the end user.
How do we manage high-frequency metadata (View Counts)?
Use Write-Back Caching or stream processing. Incrementing a SQL counter on every view is a recipe for database lock contention.

Bonus Points

Cost-Aware Storage Tiering: Store "hot" popular videos in SSD-backed caches/standard S3, and "cold" legacy videos in S3 Glacier to optimize multi-petabyte storage costs.
QUIC/HTTP3 for Uploads: Recommend QUIC to reduce head-of-line blocking and improve upload performance in lossy mobile network environments.
VMAF-based Encoding: Mention using Netflix’s Video Multi-Method Assessment Fusion (VMAF) to dynamically adjust encoding bitrates based on visual complexity (e.g., an animation needs less bitrate than an action movie).
Read-Your-Writes Consistency: Ensuring that a creator sees their video in the "My Videos" list immediately after upload using session-based routing or strongly consistent metadata stores.
Design Breakdown

Functional Requirements

Upload: Users can upload videos up to 2GB.
View: Users can stream videos with minimal buffering.
Metadata: Users can search for videos by title/description.
Interaction: Basic view counts and "Like" functionality.

Non-Functional Requirements

High Availability: 99.99% for viewing (Read-heavy system).
Scalability: Support 100M Daily Active Users (DAU).
Reliability: Uploaded videos must never be lost (Durability).
Performance: Low startup latency for video playback.

Estimation

DAU: 100M.
Upload Rate: 1% of DAU upload 1 video/day = 1M videos/day.
Storage: 1M videos * 100MB (avg compressed) = 100 TB/day.
Egress (Streaming): 100M users 5 videos 50MB = 25 PB/day.
Implication: Massive CDN reliance is mandatory.
Write QPS (Metadata): 1M uploads / 86400s \approx 12 QPS.
Read QPS (Metadata): 500M views / 86400s \approx 6,000 QPS.

Blueprint

Concise Summary: A microservices architecture leveraging asynchronous processing for video ingestion and a CDN-first approach for content delivery.
Major Components:
Upload Service: Handles resumable file uploads and initiates the processing pipeline.
Transcoding Worker: Decoupled fleet of workers that convert raw video into streamable formats.
Metadata DB: Distributed store for video info, user data, and view counts.
CDN: Globally distributed edge caches to serve video segments.
Simplicity Audit: This architecture avoids complex recommendation engines and real-time live-streaming synchronization, focusing strictly on the core loop: Upload -> Transcode -> Store -> Stream.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Services are deployed as Docker containers in K8s clusters. The Upload Service is stateless and scales based on active socket connections.
API Spec:
POST /v1/uploads: Initiates a session, returns an upload_id.
PUT /v1/uploads/{id}/chunks: Transfers binary data.
GET /v1/videos/{id}: Returns metadata and the HLS manifest URL (.m3u8).
Communication: Internal communication via gRPC; external via REST/HTTPS.

Storage

Data Model:
Video Metadata: video_id (PK), creator_id, title, description, s3_path, status (Processing/Ready).
Database Logic:
Use Cassandra or DynamoDB for high-scale metadata to handle high-volume writes and horizontal scaling.
Use ElasticSearch (secondary index) for video search functionality.

Cache

Data Structures: Redis Hashes for video metadata; Redis HyperLogLog or Sorted Sets for approximate real-time view counts to prevent DB thrashing.
TTL: 24 hours for popular video metadata.
Eviction: LRU (Least Recently Used).

Messaging

Topic Structure: video-transcode-tasks.
Delivery Guarantees: At-least-once delivery. Transcoding is idempotent (overwriting the same S3 path is safe).
Consumers: Transcoding workers pulling tasks. Using a queue (SQS/RabbitMQ) allows for graceful handling of spikes in uploads.

Data Processing

Transcoding DAG:
Inspection: Verify file integrity/codec.
Splitting: Break video into 4-second GOP (Group of Pictures) chunks.
Parallel Transcode: Convert chunks to 360p, 720p, 1080p concurrently.
Packaging: Generate M3U8/MPD manifest files.
Technology: FFmpeg-based binaries running on GPU-optimized instances.
Wrap Up

Advanced Topics

Trade-offs: We prioritize Availability over Consistency (AP) for view counts and metadata. It is acceptable if a user sees "1.1M views" while another sees "1.11M views."
Bottlenecks: The Transcoding Pipeline is the most expensive and slowest part.
Optimization: Implement "Priority Queues" so that "Verified Creators" or short videos are processed faster than 2-hour long uploads.
Failure Handling:
S3 Replication: Cross-region replication for disaster recovery.
Worker Retries: If a transcoding worker fails mid-task, the message remains in the MQ (Visibility Timeout) and is retried by another worker.
Alternatives:
Database: Could use Sharded PostgreSQL if complex relational queries (e.g., joins on user subscriptions) are required immediately.
Direct S3 Upload: Instead of an Upload Service proxy, use S3 Presigned URLs to let clients upload directly to S3, reducing server bandwidth costs. (Recommended for the next iteration).