The Question
Design

Music Streaming System Design

Design a globally scalable music streaming platform capable of serving millions of concurrent users. The system must support low-latency audio playback, efficient track search, and high-volume event logging for royalty processing and user analytics, while ensuring the high availability of the content catalog.
CDN
PostgreSQL
Elasticsearch
S3
Kafka
Questions & Insights

Clarifying Questions

Scale and Traffic: What is the target Monthly Active User (MAU) count and concurrent playback volume?
Assumption: 500M MAU, 50M Daily Active Users (DAU), with a peak of 10M concurrent listeners.
Content Catalog: How many tracks are we hosting, and what is the typical growth rate?
Assumption: 100 million tracks.
Core MVP Scope: Does the MVP include social features (friend feeds), real-time collaborative playlists, or just core search and playback?
Assumption: Core playback, search, and basic library management (playlists/likes).
Audio Quality: Are we supporting multiple bitrates or lossless audio?
Assumption: Standard AAC/Ogg (96kbps, 160kbps, 320kbps) to optimize for mobile bandwidth.
Geographic Distribution: Is this a global rollout or regional?
Assumption: Global distribution requiring a heavy reliance on CDNs for low-latency streaming.

Thinking Process

Key Points: High availability for playback, extremely low latency for "Time to First Byte" (TTFB), and cost-efficient storage of massive binary blobs.
Logical Flow:
How do we store and serve 100M+ high-quality audio files globally? (Object Storage + CDN).
How do we model track metadata and user libraries for high-speed retrieval? (Relational DB for consistency + Cache for reads).
How do we ensure search is instantaneous as the user types? (Search Indexing).
How do we capture play-logs for royalties and recommendations without blocking the user? (Asynchronous Event Processing).

Bonus Points

Content-Addressable Storage (CAS): Using hash-based storage IDs to prevent duplicate uploads of the same audio file across different albums/re-releases, saving petabytes of storage.
Edge Bitrate Adaptation: Implementing HLS (HTTP Live Streaming) or DASH to dynamically adjust audio quality based on the user's current network throughput.
Fan-out for Artists: Using a push-model with a dedicated "Inbox" pattern for celebrity artists to notify millions of followers of new releases without overwhelming the primary database.
Regional Data Pinning: Complying with GDPR/CCPA by ensuring user profile data stays within specific geographic boundaries while content (music) is replicated globally.
Design Breakdown

Functional Requirements

Users can search for tracks, artists, and albums.
Users can stream audio in high quality.
Users can create and manage playlists.
Artists/Admins can upload new tracks.

Non-Functional Requirements

Low Latency: Playback should start in < 200ms.
High Availability: 99.99% uptime (Music is a "utility" for many).
Scalability: Must handle massive spikes during major album releases.
Data Integrity: User playlists and library "likes" must never be lost.

Estimation

Storage: 100M tracks 3 bitrates 5MB/track (avg) \approx 1.5 PB.
Bandwidth: 10M concurrent users * 160 kbps \approx 1.6 Tbps total egress.
Metadata Storage: 100M tracks * 1KB metadata \approx 100 GB (Fits easily in a modern DB).
QPS (Queries Per Second): 50M DAU * 20 requests/day / 86400s \approx 11,500 Average QPS. Peak QPS \approx 50,000.

Blueprint

Concise Summary: A microservices-based architecture leveraging a Multi-CDN strategy for media delivery and a partitioned relational database for metadata management.
Major Components:
CDN (Content Delivery Network): Globally distributed edge nodes to cache and serve audio files near the user.
API Gateway: Entry point for authentication, rate limiting, and request routing.
Metadata Service: Manages artists, albums, and tracks using a relational schema for strict consistency.
Search Service: Provides full-text search capabilities using a dedicated search index.
Audio Processing Service: Transcodes uploaded tracks into multiple bitrates and stores them in Object Storage.
Simplicity Audit: The MVP avoids P2P networking (which Spotify used historically but phased out) and complex real-time recommendation engines in favor of simple, scalable cloud-native components.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: Decoupling the "heavy" media delivery (CDN) from the "light" metadata management (API) allows each to scale independently.
Functional Requirement Satisfaction: Search is handled by Elasticsearch; Streaming is handled by CDN; Playlists are handled by Postgres.
Non-functional Requirement Satisfaction: CDN ensures low latency; Read-replicas and caching ensure high availability and scalability.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Services are deployed in Kubernetes clusters across 3 Availability Zones. Horizontal Pod Autoscaling (HPA) is used to scale based on CPU/Request count.
API Spec:
GET /v1/search?q={query}: Returns track/artist/album list.
GET /v1/tracks/{id}: Returns metadata and a signed CDN URL for the audio file.
POST /v1/library/playlists: Create a new playlist.
POST /v1/play-events: Async heartbeat for "now playing" and royalty tracking.

Storage

Data Model:
Tracks: id (UUID), artist_id, album_id, duration, s3_path, genre.
Playlists: id, user_id, title, visibility.
Playlist_Tracks: playlist_id, track_id, position.
Database Logic:
Postgres is partitioned by user_id for library data to allow horizontal scaling.
Read-replicas are used for global artist/track lookups.

Cache

Data Structures: Redis Strings for Track Metadata; Redis Sorted Sets for "Trending Tracks" (per region).
TTL & Eviction: 24-hour TTL for metadata. LRU (Least Recently Used) eviction policy.
Logic: API checks Redis first; on miss, fetches from Postgres and hydrates the cache.

Messaging

Topic Structure: play_logs, user_signup, track_upload.
Delivery Guarantees: At-least-once delivery for royalty reporting.
Consumers: An analytics worker consumes play_logs to update user history and calculate monthly artist payouts.

Analytics

Data Modeling: Star schema in BigQuery.
ETL Flow: Kafka Connect streams events directly from Kafka to BigQuery.
Purpose: Used for business intelligence, royalty calculations, and periodic recommendation model training.
Wrap Up

Advanced Topics

Monitoring:
CloudWatch/Datadog: Monitor 404s/5xx on API and CDN cache hit ratios.
Critical Metric: "Playback Start Latency" (Time from clicking play to audio start).
Trade-offs:
Consistency vs. Availability: We choose Eventual Consistency for "Like" counts and "Play" counts to maintain high availability under load.
Bottlenecks:
Database Writes: During a "New Music Friday," playlist additions spike. Handled via DB sharding and write-ahead logging.
Failure Handling:
CDN Failover: If the primary CDN fails, the client is configured with a secondary CDN URL.
Circuit Breakers: Implemented in the API Gateway to prevent cascading failures if the Metadata service slows down.
Alternatives & Optimization:
Storage: Could use MinIO for on-premise object storage to reduce S3 costs, but S3 is preferred for MVP speed.
Protocols: Use gRPC for inter-service communication to reduce serialization overhead compared to JSON/REST.