The Question
DesignScalable Photo Sharing Newsfeed Design
Design a highly scalable photo-sharing platform similar to Instagram. The system must support hundreds of millions of users who can upload photos, follow others, and view a chronologically ordered newsfeed. Focus specifically on the architectural challenges of feed generation at scale, handling 'celebrity' accounts with millions of followers, and ensuring low-latency media delivery under high global traffic.
S3
Redis
Kafka
Cassandra
CDN
NoSQL
ZSet
API Gateway
PostgreSQL
JWT
Questions & Insights
Clarifying Questions
Scale and Traffic: What is the scale of Daily Active Users (DAU) and how many photos are uploaded/viewed per day? (Assumption: 500M DAU, 50M photo uploads/day, 5B feed views/day).
Social Graph Complexity: What is the average and maximum number of followers a user can have? (Assumption: Average 200, Max 10M for celebrities).
Feed Ordering: Is the feed strictly chronological or algorithm-based? (Assumption: MVP will use reverse-chronological ordering).
Media Specs: Are we handling video or just photos, and are there specific latency requirements for "Time to First Image"? (Assumption: Photos only for MVP, <200ms latency for feed loading).
Consistency: Is eventual consistency acceptable for newsfeed updates? (Assumption: Yes, it is better to have high availability than immediate consistency for a social feed).
Thinking Process
The core challenge is the "Fan-out" problem: how to efficiently deliver a single photo upload to millions of followers' feeds without crashing the system or causing massive lag.
How do we handle high-volume write and read traffic? We decouple the heavy lifting of image storage (Object Store) from metadata (NoSQL) and utilize a hybrid fan-out strategy (Push for regular users, Pull for celebrities).
How do we ensure low-latency feed retrieval? We pre-compute the newsfeed and store it in a high-speed memory cache (Redis) so the Read path is a simple O(1) or O(K) lookup.
How do we manage the massive storage requirements? We use Object Storage with a Content Delivery Network (CDN) to offload 90% of the read traffic from our origin servers.
How do we handle "Celebrity" users? We avoid pushing updates to millions of feeds at once; instead, we fetch celebrity posts dynamically when a follower requests their feed.
Bonus Points
Hybrid Fan-out Model: Implementing a "push" model for 99% of users but a "pull" model for high-fan-out (celebrity) accounts to prevent write amplification.
Intelligent CDN Tiering: Using Origin Shielding and Request Collapsing at the Edge to handle viral content spikes and reduce egress costs.
Bitmap Indexing for Seen State: Using Redis Bitmaps or Bloom Filters to track "seen" posts efficiently, ensuring users don't see the same photo repeatedly while minimizing memory footprint.
Availability vs. Consistency (PACELC): Explicitly choosing PA (Availability over Consistency during partition) for the feed service, while using a more consistent model for the social graph.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can upload photos.
Users can follow/unfollow other users.
Users can view a newsfeed of photos from people they follow (Reverse-chronological).
Scope Control:
In-scope: Photo uploads, social graph management, feed generation, image delivery.
Out-of-scope: Comments, likes, image editing, video, AI-based content moderation, direct messaging.
Non-Functional Requirements
Scale: Support 500M DAU and 50M uploads/day.
Latency: Feed generation/load in < 200ms; photo upload < 2s.
Availability & Reliability: 99.99% availability (SLA). No data loss for uploaded photos.
Consistency: Eventual consistency for feed updates (lag of a few seconds is acceptable).
Fault Tolerance: System must survive the failure of a data center (Multi-region replication).
Security: Private photos (if applicable) and secure upload URLs.
Estimation
Traffic Estimation:
Upload QPS: 50M / 86400 \approx 600 avg; 1,200 peak.
Feed Read QPS: 5B / 86400 \approx 60k avg; 120k peak.
Storage Estimation:
Metadata: 50M \text{ photos} \times 500 \text{ bytes} \approx 25GB/\text{day}.
Photos (Average 200KB per photo): 50M \times 200KB \approx 10TB/\text{day}.
Total storage for 1 year: \approx 3.6PB.
Bandwidth Estimation:
Ingress: 600 \text{ QPS} \times 200KB \approx 120MB/s.
Egress: Offloaded largely to CDN (Estimated 10\times read/write ratio \approx 1.2GB/s from CDN).
Blueprint
This architecture uses a decoupled, event-driven approach. Photos are stored in Object Storage, while metadata is stored in a distributed NoSQL database. A hybrid fan-out mechanism handles feed generation asynchronously via message queues.
Major Components:
API Gateway: Entry point for all requests, handling auth and rate limiting.
Upload Service: Coordinates photo storage and metadata persistence.
Newsfeed Service: Serves pre-computed feeds from Redis.
Fan-out Worker: Asynchronously updates follower feeds.
Object Storage (S3): Highly durable storage for raw and processed images.
Simplicity Audit: We use Redis for the feed cache to avoid complex SQL joins on the read path. We use a message queue to ensure uploads remain fast by deferring feed distribution.
Architecture Decision Rationale:
Why this architecture?: The separation of the Read path (Feed Service) and Write path (Upload Service) allows us to scale them independently. Redis provides the sub-millisecond latency required for the feed.
Functional Satisfaction: Covers upload, follow, and feed view.
Non-functional Satisfaction: Scalable via sharded NoSQL and Redis; available via async fan-out.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
CDN: Use AWS Cloudfront or Akamai for image delivery. Images are cached at the edge.
DNS: Latency-based routing to the nearest regional data center.
Security & Perimeter:
API Gateway: Handles JWT authentication, TLS termination, and request throttling (e.g., 100 uploads/min per user).
Service
Topology & Scaling:
Upload Service: Stateless, scales on CPU. Multi-AZ deployment.
Newsfeed Service: Stateless, scales on QPS. Heavily dependent on Cache hit ratio.
API Schema Design:
POST /v1/photos: Multipart upload. Returns photo_id.GET /v1/feed?limit=20&cursor=timestamp: Returns list of photo metadata.Resilience & Reliability:
Retries with exponential backoff on photo uploads.
Circuit breaker on the Social Graph DB to prevent cascading failures.
Storage
Access Pattern:
Metadata: Heavy write (uploads), heavy read (feed hydration).
Social Graph: Read-heavy (checking followers).
Database Table Design:
Photos Metadata (NoSQL/Cassandra):
photo_id (PK), user_id, image_url, created_at, location.Social Graph (PostgreSQL/DynamoDB):
user_id, follower_id, created_at. Indexed on user_id.Technical Selection:
S3: For binary data (Photos).
Cassandra: For metadata (Wide-column store handles write-heavy traffic and is easily sharded).
Distribution Logic: Shard Metadata DB by
user_id to keep a user's photos in the same partition.Cache
Purpose & Justification: Redis stores the "Newsfeed" (List of photo IDs) for each user to avoid expensive joins.
Key-Value Schema:
Key:
feed:{user_id}Value: Redis ZSet (Score = Timestamp, Value =
photo_id).TTL: 72 hours.
Failure Handling: If Redis fails, the Newsfeed service falls back to querying the Metadata DB directly (degraded performance).
Messaging
Purpose & Decoupling: Kafka or SQS decouples the upload confirmation from the feed distribution.
Event Schema:
{ "user_id": 123, "photo_id": 456, "timestamp": 1625... }.Throughput & Partitioning: Partition Kafka by
user_id of the uploader to ensure ordered processing for a single user's posts.Technical Selection: Kafka for its ability to replay messages and high throughput.
Data Processing
Processing Model: The "Fan-out Worker" consumes messages from Kafka.
Processing DAG:
Fetch followers from Social Graph DB.
For each follower (who is not a celebrity follower), update their Redis Feed Cache.
If the uploader is a celebrity, do nothing (followers will pull this post on demand).
Technical Selection: Custom Go/Python workers for low-latency processing.
Infrastructure (Optional)
Observability: Prometheus for RED metrics (Rate, Error, Duration). Jaeger for distributed tracing to find bottlenecks in fan-out.
Platform Security: mTLS between services. S3 buckets are private; images are accessed via CDN signed URLs.
Wrap Up
Advanced Topics
Trade-offs (Push vs. Pull):
Push (Fan-out on write): Great for low-latency reads. Problem: High write load for celebrities (The "Justin Bieber" problem).
Pull (Fan-out on read): Better for celebrities. Problem: Slower feed load if a user follows 1,000 celebrities.
The Winner: Hybrid approach. Regular users push to feeds; celebrity posts are merged at read-time.
Reliability: If the Fan-out queue is backed up, users might see a delay in their feed updating. This is an acceptable trade-off (Eventual Consistency).
Optimization: Use Thrift or Protobuf for service-to-service communication to reduce payload size and latency.
Security: Implement "Image Scanning" as an async process using a Lambda/Worker after upload to detect inappropriate content before the fan-out occurs.