The Question
Design

Ultra-Scale Metadata Storage for Video Streaming

Design a highly-available metadata storage system for a global video streaming platform with 100M+ DAUs. The system must handle massive write loads for watch progress tracking (1M+ events/sec) and provide low-latency access to the content catalog. Specifically, address how you would scale a relational database architecture (Postgres-based) to meet these requirements while handling connection management, data partitioning, and write-heavy bottlenecks. Discuss the trade-offs between consistency and performance for playback features.
PostgreSQL
Redis
Kafka
PGBouncer
S3
CDN
CDC
JSONB
Questions & Insights

Clarifying Questions

Scale and Traffic: What is the target Daily Active User (DAU) count and what is the expected write QPS for features like "watch progress" heartbeats?
Data Characteristics: Are we storing only metadata (titles, user profiles, watch history) or are we attempting to store binary large objects (BLOBs) like video segments in Postgres?
Consistency vs. Availability: For the streaming metadata, is strict ACID compliance required for every transaction (e.g., billing), or is eventual consistency acceptable for watch history?
Geographic Distribution: Is the application global, requiring multi-region data locality for low-latency metadata access?
Assumptions for this design:
DAU: 100 million.
Write Load: Watch progress heartbeats occur every 10 seconds per active user (approx. 1M writes/sec at peak).
Storage Scope: Postgres will handle structured metadata (Catalog, User data, History); S3/CDN will handle video binary data.
Scale Strategy: We will utilize Postgres with horizontal sharding and partitioning to handle "ultra-scale."

Thinking Process

Core Bottleneck: The primary bottleneck in ultra-scale Postgres is connection management and write-throughput on a single primary node.
Strategy Questions:
How do we handle 1M+ concurrent connections without crashing the Postgres backend?
How do we partition data to ensure no single table exceeds the performance "cliff" (typically >100GB or millions of rows)?
How do we scale writes horizontally while maintaining Postgres's relational benefits?
How do we protect the database from the massive heartbeat traffic of millions of concurrent streamers?

Bonus Points

Logical Decoding (CDC): Using Postgres Logical Replication slots to stream changes to an analytics engine or search index (Elasticsearch) without impacting read performance.
Partial Indexing: Implementing partial indexes (e.g., WHERE status = 'active') to keep index sizes small and memory-resident.
BRIN Indexes: Utilizing Block Range Indexes for massive time-series tables (like logs or history) to save space and speed up range scans.
Zero-Downtime Sharding: Using a "Shadow Sharding" approach or Citus-like shard rebalancer to move data between nodes without stopping writes.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can browse a massive catalog of videos.
Users can resume videos from where they left off (Watch Progress).
Users can manage their profiles and preferences.
Scope Control:
In-Scope: Database schema design, sharding strategy, connection pooling, and caching.
Out-of-Scope: Video transcoding pipelines, CDN content propagation algorithms, and billing/payment gateways.

Non-Functional Requirements

Scale: Support 100M users and petabytes of metadata.
Latency: Metadata retrieval (e.g., "Continue Watching") must be < 50ms (p99).
Availability & Reliability: 99.99% uptime; no data loss for user progress.
Consistency: Strong consistency for user accounts; Eventual consistency for watch history and view counts.
Fault Tolerance: Automatic failover for database primaries.

Estimation

Traffic Estimation:
10M concurrent users 1 heartbeat/10s = 1 Million Writes/Sec**.
Catalog browsing = 500k Reads/Sec.
Storage Estimation:
User Metadata: 100M users * 1KB = 100GB.
Watch History (3 years): 100M users 500 entries * 200 bytes = 10TB**.
Bandwidth Estimation:
1M writes/sec 500 bytes payload = 500 MB/s** incoming to the DB cluster.

Blueprint

Concise Summary: A microservices-based architecture utilizing a sharded Postgres cluster managed via a proxy layer (like Vitess or Citus logic) to handle ultra-scale metadata.
Major Components:
PGBouncer: A lightweight connection pooler to manage thousands of client connections per DB node.
Sharded Postgres Cluster: Metadata is split by user_id across multiple primary instances.
Redis Cache: Offloads catalog and session lookups from the database.
Kafka: Acts as a buffer/write-behind layer for high-frequency watch history updates.
Simplicity Audit: This design uses Postgres for its reliability while introducing sharding only where necessary (user data). The MVP avoids complex multi-master setups in favor of a simpler primary-replica model per shard.
Architecture Decision Rationale:
Why Postgres?: Proven reliability, excellent JSONB support for semi-structured metadata, and robust indexing.
Functional Satisfaction: Sharding ensures the "ultra-scale" user base doesn't hit single-node limits.
Non-functional Satisfaction: Redis and Kafka protect the DB from the 1M QPS heartbeat load, ensuring low latency.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: CDN (Cloudfront/Akamai) caches static assets and video fragments. Global DNS directs users to the nearest regional data center.
Security & Perimeter: API Gateway handles JWT authentication and rate-limiting. It prevents "Retry Storms" by implementing circuit breakers at the entry point.

Service

Topology & Scaling: Services are stateless Docker containers on K8s. Scaling is triggered by CPU (>60%) or Request Count.
API Schema Design:
POST /v1/playback/progress: Updates current timestamp. Async via Kafka.
GET /v1/catalog/{id}: Returns video metadata. Cache-first.
Resilience: Services use exponential backoff for DB connections.

Storage

Access Pattern: High-frequency writes (Progress) and read-heavy catalog. 1:10 write-to-read ratio for user profiles, but 100:1 for progress heartbeats.
Database Table Design:
Table: Users: user_id (UUID, PK), email, profile_json (JSONB).
Table: WatchHistory: history_id (BIGSERIAL), user_id (FK), video_id, last_position, updated_at. Partitioned by updated_at (range) and sharded by user_id (hash).
Technical Selection: Postgres (v15+) for its native declarative partitioning and logical replication.
Distribution Logic: Hash Sharding on `user_id. This ensures all data for a single user resides on one shard, enabling efficient joins for "My List" or "History."
Reliability: WAL (Write Ahead Log) archiving to S3 for Point-in-Time Recovery (PITR).

Cache

Purpose: Reduce read pressure on Postgres for hot catalog items (Trending movies).
Key-Value Schema: cat:{video_id} -> JSON string of metadata. TTL: 1 hour.
Technical Selection: Redis Cluster.
Failure Handling: If Redis fails, the system falls back to Postgres with strict rate-limiting to prevent DB meltdown (Cache Stampede protection).

Messaging

Purpose: Write-behind pattern. Playback heartbeats are too frequent for direct DB inserts.
Event Schema: {user_id, video_id, timestamp, device_id}.
Throughput: Kafka handles the 1M QPS burst. History Processor consumes in batches (e.g., INSERT ... ON CONFLICT DO UPDATE) to reduce Postgres IOPS.
Technical Selection: Kafka. High throughput and persistence allow for replaying events if the DB shard is offline.

Data Processing

Processing Model: Stream processing (History Processor) aggregates multiple heartbeats for the same user within a 10s window, sending only the final state to the DB.
Technical Selection: Custom Go-based worker or Flink for windowed aggregations.
Wrap Up

Advanced Topics

Trade-offs: We trade off real-time consistency in watch progress for system stability. A user might see a slightly stale "resume" point if they switch devices instantly, due to the Kafka buffer.
Postgres vs. NoSQL: While Cassandra or DynamoDB are great for high-write "Watch History," Postgres with Sharding + Kafka is chosen to keep the ecosystem unified and leverage SQL's powerful querying for complex catalog filtering.
Connection Scaling: Postgres handles connections poorly (one process per connection). Using PGBouncer in Transaction Mode is mandatory to support 100k+ application-level connections.
Vacuum Management: In high-write environments, autovacuum must be tuned aggressively (increase autovacuum_max_workers and vacuum_cost_limit) to prevent transaction ID wraparound and table bloat.