DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Social Media Feed Design

Design a high-scale social media platform similar to Instagram that allows users to upload photos, follow others, and view a real-time feed. The system must support 500 million daily active users and handle the 'celebrity problem' where certain accounts have millions of followers. Focus on the end-to-end architecture for media storage, metadata management, and the feed generation strategy (push vs. pull). Detail how you would ensure sub-second latency for feed retrieval and high availability for uploads.
PostgreSQL
Redis
S3
Kafka
CDN
ZSET
JWT
Kubernetes
Questions & Insights

Clarifying Questions

Scale: What is the target Daily Active User (DAU) count and the expected Read/Write ratio?
Media Content: Does the MVP support both images and videos, or strictly images? What is the average file size?
Feed Logic: Is the feed strictly chronological for the MVP, or do we need to support an algorithmic ranking engine?
Social Graph: What is the maximum number of people a user can follow, and what is the maximum number of followers a user can have (e.g., celebrity handling)?
Privacy: Do we need to support private accounts and "Close Friends" lists for the MVP?
Assumptions:
DAU: 500 Million.
Read/Write Ratio: 100:1 (Heavy read-centric system).
Media: Images only for MVP, average size 1MB (optimized to 200KB for mobile).
Feed: Chronological feed of followed users.
Scale: High celebrity fan-out must be handled (up to 50M followers).

Thinking Process

The core challenge of Instagram is the Fan-out process: how to distribute a single post to millions of followers efficiently without crashing the system or causing high latency.
How do we handle high-volume media uploads without blocking the metadata service? (Decouple media storage using S3 and handle metadata in a structured DB).
How do we generate the home feed for 500M users with sub-200ms latency? (Move from "Compute-on-read" to "Compute-on-write" using a Feed Cache).
How do we handle "Celebrity" accounts where a single post triggers 50M cache updates? (Implement a Hybrid Fan-out model: Push for normal users, Pull/On-demand for celebrities).
How do we ensure the system is globally performant? (Utilize a Geo-distributed CDN for media and edge-caching for popular metadata).

Bonus Points

Hybrid Fan-out Model: Implementing a "Celebrity Threshold." Users with >100k followers have their posts pulled at read-time by followers, while normal users have their posts pushed to follower caches at write-time.
Media Optimization Pipeline: Using a "DAG" based approach for image processing to generate multiple resolutions (thumbnails, HD, standard) and formats (WebP/HEIC) asynchronously.
Z-Order Curves/Geo-sharding: If location-based search is required, using Z-order curves or Geo-hashes to shard the spatial data.
Consistent Hashing with V-Nodes: To ensure even distribution across the Feed Cache cluster and prevent "hot shards" when a celebrity's feed is accessed frequently.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can upload images with captions.
Users can follow/unfollow other users.
Users can view a chronological feed of posts from people they follow.
Users can search for other users by username.
Scope Control:
In-Scope: Image uploads, Following, Chronological Feed, Basic Profile.
Out-of-Scope: Video/Reels, Stories, DMs, Comments/Likes, Algorithmic Discovery, Tagging.

Non-Functional Requirements

Scale: Support 500M DAU and 5B+ total posts per year.
Latency: Feed retrieval < 200ms; Image upload < 2s (excluding network transfer).
Availability & Reliability: 99.99% availability (CAP Theorem: Favor Availability over Consistency - AP).
Consistency: Eventual consistency for the Feed and Follower counts is acceptable.
Fault Tolerance: No single point of failure; multi-AZ deployment for data stores.
Security & Privacy: Secure media access via signed URLs; TLS for all transit.

Estimation

Traffic Estimation:
500M DAU, 1 post/day = ~5,800 Write QPS.
100 Feed Views/day = ~580,000 Read QPS.
Storage Estimation:
500M images/day * 1MB/image = 500 TB/day.
500 TB * 365 days = ~182 PB/year (Before compression/resizing).
Bandwidth Estimation:
Inbound: 5,800 * 1MB = 5.8 GB/s.
Outbound: 580,000 * 200KB (avg thumbnail) = 116 GB/s.

Blueprint

Concise Summary: The system uses a microservices architecture where media is decoupled from metadata. A hybrid push/pull fan-out mechanism ensures high-speed feed delivery via Redis, while S3 and a Sharded Relational Database provide durable storage.
Major Components:
API Gateway: Handles authentication, rate limiting, and request routing.
Media Service: Manages image uploads to S3 and triggers the processing pipeline.
Post Service: Manages post metadata (captions, timestamps, locations) in PostgreSQL.
Follow Service: Stores the social graph (User A follows User B).
Fanout Processor: Asynchronously pushes new post IDs to the Redis caches of followers.
Feed Service: Aggregates post metadata from the cache and DB to serve the user's timeline.
Simplicity Audit: This architecture avoids complex machine learning for the MVP and uses industry-standard components (Redis/Kafka) for the heavy lifting of real-time distribution.
Architecture Decision Rationale:
Why this architecture?: The Push model for feed generation is the most responsive for the majority of users, and the hybrid pull model prevents celebrity-induced system collapse.
Functional Requirement Satisfaction: Meets all core CRUD requirements for posts and follows while enabling a fast feed.
Non-functional Requirement Satisfaction: Scales horizontally via sharding and asynchronous processing; provides high availability by using AP-oriented stores where possible.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
CDN (CloudFront/Akamai): Caches processed images closer to users to reduce latency and egress costs.
DNS: Latency-based routing to the nearest regional Data Center.
Security & Perimeter:
API Gateway: Performs JWT validation.
Rate Limiting: Limits uploads (e.g., 100/hour) to prevent spam/DDoS.

Service

Topology & Scaling:
Stateless Services: All services are deployed in Kubernetes (EKS/GKE) across multiple Availability Zones.
Scaling: Media service scales on CPU (image processing); Feed service scales on QPS.
API Schema Design:
POST /v1/posts: Upload metadata and media reference. Protocol: REST.
GET /v1/feed: Retrieve current user's feed. Protocol: REST/gRPC.
Resilience & Reliability:
Circuit Breakers: Prevents a slow Follow Service from taking down the Feed Service.
Retries: Exponential backoff for Fanout Processor tasks.

Storage

Access Pattern:
Posts: Write-heavy initially, then read-heavy for 24-48 hours.
Follows: Read-heavy for social graph lookups.
Database Table Design:
Users: user_id (PK), username, email, created_at.
Posts: post_id (PK), user_id (FK), media_url, caption, created_at.
Follows: follower_id, followee_id, created_at. (Composite PK on follower_id, followee_id).
Technical Selection:
PostgreSQL: Used for Users, Posts, and Follows metadata due to strong ACID requirements for the social graph.
Distribution Logic:
Sharding: Shard Posts by user_id to keep a user's content on one shard. Shard Follows by follower_id.

Cache

Purpose & Justification: Redis is used to store the "Feed" of each user (list of Post IDs) to avoid expensive JOIN queries across millions of rows at read-time.
Key-Value Schema:
Key: feed:{user_id}
Value: Sorted Set (ZSET) where member = post_id and score = timestamp.
Failure Handling: If Redis fails, the system falls back to a "Pull" from the SQL database (slower but functional).

Messaging

Purpose & Decoupling: Kafka is used to decouple the "Post Upload" from the "Feed Update."
Event / Topic Schema: PostCreated event containing post_id, author_id, and timestamp.
Technical Selection: Kafka. High throughput and ability to replay messages if a fanout worker fails.

Data Processing

Processing Model: Fanout Workers (Custom Go/Python workers).
Processing Logic:
Consumer reads PostCreated event.
Queries Follow Service for all follower_ids.
If follower_count < 100k: Pushes post_id into each follower's Redis ZSET.
If follower_count >= 100k: Marks as "Celebrity Post" (No push, let followers pull during GET /feed).
Technical Selection: Custom microservices for low latency.
Wrap Up

Advanced Topics

Trade-offs: We choose Eventual Consistency for the feed. When a user posts, it might take a few seconds for all 1 million followers to see it in their cache. This is a classic Availability over Consistency (AP) trade-off.
Reliability: If the Fanout Processor lags, users see stale feeds. We monitor "Kafka Lag" as a primary health indicator.
Bottleneck Analysis:
Hot Shards: Celebrities. Solved by the Hybrid Pull model.
Storage Growth: S3 lifecycle policies move older, unviewed images to "Infrequent Access" or "Glacier" to save costs.
Security: Signed URLs for S3 ensure that only authenticated users can view private media, and images are not publicly scrapable by URL guessing.