The Question
DesignDesign a Scalable Social Media News Feed
Design a news feed system similar to Meta (Facebook) that supports 1 billion daily active users. The system must allow users to post updates, follow others, and view a personalized feed of content from their network in reverse-chronological order. Address the 'Celebrity Problem' (hot keys/fan-out spikes), ensure sub-200ms feed retrieval latency, and discuss the trade-offs between push-based (fan-out on write) and pull-based (fan-out on read) architectures. Detail how you would handle data consistency, social graph storage, and feed caching at scale.
Cassandra
Redis
Kafka
PostgreSQL
CDN
gRPC
ZSET
Bloom Filter
Questions & Insights
Clarifying Questions
What is the scale of the system? (Assumed: 1 Billion DAU, 10 Billion posts per day, average 500 friends/follows per user).
What is the feed ranking requirement for the MVP? (Assumed: Reverse chronological order or a simple weight-based score; complex ML-based ranking is out of scope for MVP).
Does the feed support media like images and videos? (Assumed: Yes, but the News Feed system manages metadata/links; actual binary storage is handled by a separate Blob storage/CDN).
Is the relationship model bilateral (Friends) or unilateral (Follows)? (Assumed: Both, represented as a directed graph).
What is the latency target for feed retrieval? (Assumed: P99 < 200ms).
Thinking Process
Core Bottleneck: The "Fan-out" problem. How do we deliver one post to millions of followers without crashing the database or causing massive lag?
Key Questions:
How do we store the massive volume of posts while maintaining high write availability?
How do we balance between "Push" (pre-computing feeds) and "Pull" (computing on the fly) to handle "Celebrities" vs. "Normal Users"?
How do we structure the Social Graph to quickly identify who should see a post?
How do we ensure the feed remains responsive even under heavy load?
Bonus Points
Hybrid Fan-out Strategy: Use "Push" (Fan-out on write) for active, low-follower users and "Pull" (Fan-out on read) for celebrities/high-follower accounts to prevent write amplification.
Feed Pagination via Sockets/Cursors: Implement stateful cursors to prevent duplicate posts when the feed is updated while the user is scrolling.
Edge Side Composition: Discussing how GraphQL or a BFF (Backend for Front End) layer can aggregate post metadata with user profile data at the edge.
Storage Tiering: Moving older posts to cold storage (HDFS/S3) while keeping recent posts in high-performance NoSQL (Cassandra/ScyllaDB).
Design Breakdown
Functional Requirements
Core Use Cases:
Users can create posts (text, links, media references).
Users can view a feed of posts from friends and followed entities.
Users can add/remove friends or follow/unfollow entities.
Scope Control:
In-scope: Post creation, Feed generation, Fan-out logic, Basic Ranking.
Out-of-scope: Complex ML-based Ad insertion, Real-time comments/likes (separate service), Search, Media transcoding.
Non-Functional Requirements
Scale: Support 1B+ users with millions of concurrent requests.
Latency: Feed loading must be near-instant (< 200ms).
Availability & Reliability: Highly available (99.99%); users should always see some content even if it's slightly stale.
Consistency: Eventual consistency is acceptable (it’s okay if a friend's post shows up 2 seconds late).
Fault Tolerance: System must survive regional data center failures.
Security: Privacy settings (e.g., "Friends only") must be strictly enforced.
Estimation
Traffic:
1B DAU / 86400 sec ≈ 12,000 avg QPS.
Peak QPS (5x) ≈ 60,000 QPS for writes.
Read/Write ratio (10:1) ≈ 600,000 Peak Read QPS.
Storage:
10B posts/day * 500 bytes/post = 5 TB/day.
3 years retention ≈ 5.5 PB.
Bandwidth:
Write: 60k QPS * 500B ≈ 30 MB/s.
Read: 600k QPS * 2KB (feed metadata) ≈ 1.2 GB/s.
Blueprint
Concise Summary: A hybrid architecture utilizing a "Push" model for standard users via a Fan-out worker and a "Pull" model for celebrities. Feeds are pre-computed and stored in a fast Cache layer (Redis).
Major Components:
API Gateway: Handles authentication and routing.
Post Service: Manages creation and storage of post metadata.
Social Graph Service: Manages friend/follow relationships.
Feed Service: Aggregates and serves the feed to users.
Fan-out Worker: Asynchronously pushes new posts to followers' feed caches.
Redis (Feed Store): Stores pre-computed feed IDs for active users.
Cassandra (Post Store): Highly scalable NoSQL for persistent post data.
Simplicity Audit: This design avoids complex real-time ML pipelines by using a pre-computed "Feed Cache," ensuring low-latency reads which is the primary user pain point.
Architecture Decision Rationale:
Hybrid Model: Essential for Facebook-scale to avoid the "Thundering Herd" on celebrity posts while keeping feeds instant for 99% of users.
Functional Satisfaction: Covers posting, friending, and viewing.
Non-functional Satisfaction: Cassandra provides horizontal write scaling; Redis provides sub-millisecond read latency.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Use a Global CDN for static assets (images/JS). DNS uses latency-based routing to the nearest Data Center.
Security & Perimeter: API Gateway handles JWT validation, SSL termination, and IP-based rate limiting to prevent scrapers.
Service
Topology & Scaling: Stateless microservices deployed on Kubernetes across multiple Availability Zones (AZs). Scaling is based on CPU and Request Count.
API Schema Design:
POST /v1/posts: Create a post. Protocol: REST. Request: {content, media_ids, privacy}. Idempotency: client_request_id.GET /v1/feed: Get news feed. Protocol: REST. Response: {posts: [], next_cursor}. SLA: 200ms.Resilience & Reliability: Implementation of exponential backoff for Fan-out workers and circuit breakers on the Graph Service to prevent cascading failures during database lag.
Storage
Access Pattern:
Posts: Heavy write, heavy read by ID.
Graph: Complex joins (who follows whom) but mostly read-heavy.
Database Table Design:
Cassandra (Posts): Partition Key:
post_id. Columns: author_id, content, timestamp, media_links.Postgres (Graph): Table
relationships (follower_id, followee_id, type). Index on follower_id and followee_id.Technical Selection: Cassandra is chosen for its "Write-Optimized" nature and seamless cross-region replication. Postgres is used for the Graph due to ACID requirements for relationship integrity.
Cache
Purpose & Justification: Redis stores the "News Feed" as a list of Post IDs for each active user. This avoids expensive DB queries on every page load.
Key-Value Schema:
Key:
feed:{user_id}Value:
ZSET (Sorted Set) where score = timestamp and value = post_id.TTL: 72 hours for active users.
Failure Handling: If Redis is empty (cold start), the Feed Service falls back to a "Pull" model by querying the Graph Service and Cassandra directly, then repopulating the cache.
Messaging
Purpose & Decoupling: Kafka decouples the Post Service from the Fan-out process. This ensures that even if the fan-out is slow (e.g., a user has 5k friends), the "Post Creation" remains fast.
Event Schema:
PostCreatedEvent {post_id, author_id, timestamp, privacy_level}.Throughput & Partitioning: Partitioned by
author_id to ensure ordering of posts from the same user.Technical Selection: Kafka for high-throughput and replayability in case of fan-out logic bugs.
Infrastructure (Optional)
Observability: Prometheus for RED metrics (Rate, Errors, Duration). Jaeger for tracing the lifecycle of a post from creation to feed appearance.
Distributed Coordination: Not heavily required for MVP, though Consul may be used for service discovery.
Wrap Up
Advanced Topics
Trade-offs (PACELC): We choose Availability (AP) over Consistency. It's better for a user to see an old feed than no feed at all.
The "Celebrity" Problem:
Problem: If a user with 50M followers posts, "Pushing" to 50M Redis lists causes a massive spike.
Optimization: We flag celebrities. When they post, we do NOT push. Instead, when their followers request their feed, the Feed Service "Pulls" the celebrity's latest posts and merges them with the pre-computed feed.
Read Path Optimization: The Feed Service uses a Bloom Filter to check if a user has already seen a post to prevent duplication in the UI.
Data Privacy: The Fan-out worker checks the Social Graph for "Blocked" relationships before adding a post ID to a user's feed cache.