The Question
DesignScalable News Feed System
Design a news feed system similar to Facebook or Twitter. The system should support 300 million daily active users who can post updates and follow other users. Key challenges include maintaining low-latency feed retrieval (under 300ms) while handling the 'celebrity problem' where a single user might have millions of followers. Explain your strategy for data consistency, storage selection for the social graph, and your approach to the push-vs-pull fan-out trade-off.
Kafka
Redis
PostgreSQL
Cassandra
ZSET
CDN
gRPC
Questions & Insights
Clarifying Questions
Scale and Traffic: What is the target Daily Active User (DAU) count and the average number of follows per user?
Assumption: 300M DAU, average 200 follows per user, with some "celebrity" accounts having millions of followers.
Content Types: Does the feed support only text, or also media like images and videos?
Assumption: Support for text and media metadata (URLs). Media storage itself is out of scope for the feed logic.
Feed Ordering: Is the feed strictly chronological or algorithmically ranked?
Assumption: Chronological for the MVP, but the architecture should allow for ranking hooks.
Latency SLAs: What are the target latencies for posting a message and loading the feed?
Assumption: Post visibility within 2 seconds; Feed load time under 300ms.
Thinking Process
Core Bottleneck: The primary challenge is the "Fan-out" problem—efficiently delivering a single post to millions of followers' feeds without crashing the system or causing massive lag.
Strategy Steps:
Write Path: How do we ingest a post and trigger the distribution? (Async Fan-out).
Read Path: How do we serve the feed instantly? (Pre-computed Caching).
Optimization: How do we handle "Celebrities" (users with >100k followers) who break the standard push model? (Hybrid Push/Pull).
Storage: Where do we store the social graph and the posts for durability?
Bonus Points
Hybrid Fan-out Model: Implementing a "Push" model for regular users and a "Pull" model for celebrities to prevent "write amplification" and "thundering herd" issues in the cache.
Pagination Strategy: Using "Keyset Pagination" (Cursor-based) instead of Offset-based pagination to ensure consistent feed viewing while new items are being injected.
Write-Through Cache with TTL: Managing Feed Cache memory by only pre-computing feeds for "Active Users" (e.g., users logged in within the last 3 days).
Social Graph Partitioning: Sharding the Follower database by
user_id to ensure localized lookups for fan-out workers.Design Breakdown
Functional Requirements
Core Use Cases:
Users can create posts (text/media links).
Users can follow/unfollow other users.
Users can view a chronological news feed of people they follow.
Scope Control:
In-scope: Post creation, social graph management, feed generation, and retrieval.
Out-of-scope: Notifications, search, ad-insertion, and media transcoding.
Non-Functional Requirements
Scale: Support 300M DAU and 10k+ QPS for writes, 100k+ QPS for reads.
Latency: Feed retrieval < 300ms (P99).
Availability & Reliability: 99.99% availability (CAP theorem: prioritize Availability over Consistency - AP).
Consistency: Eventual consistency is acceptable; it is okay if a follower sees a post a few seconds late.
Fault Tolerance: The system should continue to serve the feed even if the fan-out worker lags.
Estimation
Traffic Estimation:
300M DAU. If each user checks feed 5 times/day = 1.5B requests/day (~17.5k QPS average, 35k QPS peak).
If 10% of users post once/day = 30M posts/day (~350 QPS average, 1k QPS peak).
Storage Estimation:
Post size: 500 bytes (text + metadata). 30M posts * 500 bytes = 15GB/day. 5.5TB/year.
Bandwidth Estimation:
Outgoing: 1.5B feed views 20 posts 500 bytes = 15TB/day (~1.4 Gbps).
Blueprint
Concise Summary: A hybrid system that uses a Push-model (fan-out on write) for most users to ensure low-latency reads, while falling back to a Pull-model for celebrities to maintain write performance.
Major Components:
API Gateway: Entry point for authentication and rate limiting.
Post Service: Handles incoming posts and persists them to the Metadata DB.
Follow Service: Manages the social graph (who follows whom).
Fan-out Worker: An asynchronous worker that pushes post IDs to the feed caches of followers.
Feed Service: Aggregates and serves the final feed from the Cache and Post DB.
Redis Cache: Stores pre-computed feed IDs for active users.
Simplicity Audit: This design avoids complex ML ranking pipelines and uses a standard MQ-based fan-out, which is the industry standard for MVP-scale social feeds.
Architecture Decision Rationale:
Why this architecture?: Pure "Pull" models (computing feed on-the-fly) are too slow for 200+ follows. Pure "Push" models (writing to every follower) break for celebrities. Hybrid is the most robust.
Functional Satisfaction: Covers posting, following, and viewing.
Non-functional Satisfaction: Redis ensures <300ms latency; Kafka ensures write scalability.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
Global DNS (Route 53) directs users to the nearest regional data center.
API Gateway handles SSL termination and JWT-based authentication.
Security & Perimeter:
Rate limiting (e.g., 5 posts per minute per user) to prevent spam.
WAF to protect against common injection attacks.
Service
Topology & Scaling:
Stateless microservices deployed on Kubernetes, auto-scaled based on CPU/Request count.
Regional deployments to minimize latency.
API Schema Design:
POST /v1/posts: Create a post. { "content": "string", "media_ids": [] }GET /v1/feed: Retrieve feed. Returns list of Post objects. Supports cursor for pagination.POST /v1/follow/{user_id}: Follow a user.Resilience & Reliability:
Retries with exponential backoff for Fan-out workers.
Dead Letter Queues (DLQ) for failed fan-out tasks.
Storage
Access Pattern:
Social Graph: Heavy read/write of relationships. (Relational/Graph needed).
Post Metadata: Write once, read many. (NoSQL preferred for scaling).
Database Table Design:
Follows Table (SQL):
follower_id (PK/Shard Key), followee_id, created_at.Posts Table (NoSQL/Cassandra):
post_id (PK), author_id, content, media_urls, created_at.Technical Selection:
PostgreSQL: For Follows (to ensure ACID when following/unfollowing).
Cassandra: For Posts (high write throughput and easy horizontal scaling).
Cache
Purpose & Justification: Pre-computed feeds are stored in Redis to avoid expensive multi-table joins on every feed load.
Key-Value Schema:
Key:
feed:{user_id}Value: Redis
ZSET (Sorted Set) where score is timestamp and member is post_id.TTL: 72 hours (only cache for active users).
Failure Handling: If Redis is cold or empty, Feed Service falls back to a "Pull" query against the PostDB and Social Graph DB.
Messaging
Purpose & Decoupling: Kafka decouples the Post Service from the Fan-out logic. Post Service returns success once the post is persisted and the message is in Kafka.
Event / Topic Schema:
Topic:
post-publishedPayload:
{ "author_id": "uuid", "post_id": "uuid", "timestamp": "long" }Technical Selection: Kafka for its high throughput and message replayability.
Data Processing
Processing Model: The Fan-out Workers are the "Push" engine.
Processing DAG:
Read message from Kafka.
Fetch all
follower_ids from GraphDB (excluding followers of celebrities).For each follower, update their
feed:{user_id} ZSET in Redis.Technical Selection: Python or Go workers using a library like Sarama or Kafka-python.
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency. A user might not see their own post in their feed for a second or two. This allows the system to remain highly available and performant.
Reliability & Failure Handling:
Celebrity Handling: When a user follows a celebrity, the Fan-out worker skips the push. When the follower views their feed, the Feed Service pulls the celebrity's latest posts and merges them with the Redis-cached feed.
Bottleneck Analysis:
Hot Shards: High-frequency posters could overwhelm a single Kafka partition. We partition by
author_id to distribute load but use an internal buffer for high-volume users.Security & Privacy:
Block/Mute lists: Feed Service filters out posts from blocked users during the final aggregation.