The Question
Design

Meta News Feed Design

Design a large-scale social media activity feed system similar to Meta's News Feed. The system must support hundreds of millions of users, handle the 'celebrity fan-out' problem, and ensure that users can view their personalized feed with sub-second latency while maintaining eventual consistency for new posts.
Redis
Cassandra
Kafka
Hybrid Fan-Out
PostgreSQL
Questions & Insights

Clarifying Questions

Scale: What is the scale of the system in terms of Daily Active Users (DAU) and the average number of follows per user?
Assumption: 300M DAU, average 500 follows per user.
Content Types: Does the feed support only text/images, or do we need to handle complex media like auto-playing video and live streams?
Assumption: Text, images, and links for the MVP. Video is handled via external CDN links.
Ordering: Is the feed strictly chronological or algorithmically ranked?
Assumption: Simple weighted ranking (recency + engagement score) for the MVP.
Latency: What are the target p99 latencies for posting and viewing the feed?
Assumption: < 500ms for feed retrieval; < 2s for post propagation to followers.

Thinking Process

Core Bottleneck: The primary challenge is the "Fan-out" problem—efficiently delivering a single post to millions of followers without crashing the system or causing massive lag.
Key Questions for Design Evolution:
How do we store posts and the social graph (who follows whom) to minimize join operations?
Which delivery model fits our scale: Push (Fan-out on write), Pull (Fan-out on load), or a Hybrid?
How do we ensure the feed stays fast for users with thousands of follows?
How do we handle "Hot Users" (celebrities) whose fan-out would overwhelm the message queue?

Bonus Points

Hybrid Fan-out: Using a "Push" model for regular users and a "Pull" model for celebrities to prevent "Thundering Herd" issues on the write path.
Read-Repair & Lazy Loading: Optimizing feed caches by only regenerating them when a user actually logs in, rather than eagerly for 2B people.
Feed Pagination via Tailing Cursors: Using stateful cursors instead of offset-based pagination to prevent duplicate items when new posts are added.
Multi-layered Cache Strategy: Utilizing a "Social Graph Cache" (Redis) to quickly fetch friend lists before querying the "Feed Cache."
Design Breakdown

Functional Requirements

Users can create posts (text/images).
Users can see a News Feed containing posts from people/pages they follow.
Users can follow/unfollow others.
The feed should support pagination.

Non-Functional Requirements

High Availability: The feed is the core product; it must be available even if the ranking engine is degraded.
Low Latency: Viewing the feed should be near-instantaneous.
Eventual Consistency: It is acceptable if a post takes 1-5 seconds to appear in all followers' feeds.
Scalability: Must handle surges in traffic (e.g., major news events).

Estimation

DAU: 300M
Read QPS (Feed View): Assuming a user checks 10x/day: (300M 10) / 86400 ≈ 35,000 QPS**.
Write QPS (Post): Assuming 1 post/day: 300M / 86400 ≈ 3,500 QPS.
Storage: 300M posts/day 10KB (avg metadata + text) ≈ 3TB/day**.
Fan-out Load: 3,500 posts/sec 500 followers ≈ 1.75M write operations/sec** to feed caches.

Blueprint

Concise Summary: A hybrid-cloud architecture utilizing a "Fan-out-on-Write" approach for standard users and "Fan-out-on-Load" for celebrities, backed by an in-memory Feed Cache for low-latency retrieval.
Major Components:
Post Service: Handles the creation and persistence of new content.
Social Graph Service: Manages follow/unfollow relationships and fetches friend lists.
Fan-out Worker: Asynchronously pushes post IDs into the Feed Caches of followers.
Feed Service: Aggregates post metadata and serves the ranked list to the user.
Feed Cache (Redis): Stores the "Top N" post IDs for every active user's feed.
Simplicity Audit: This design avoids complex Spark/Flink streaming for the MVP, relying instead on a simple Message Queue (Kafka) and horizontally scalable Workers to handle the fan-out logic.
Architecture Decision Rationale:
Why this architecture?: The Push model ensures reads are extremely fast (O(1) lookups in Redis), which is critical for a read-heavy system like Meta.
Functional Satisfaction: Meets all core posting and viewing requirements through decoupled microservices.
Non-functional Satisfaction: Kafka provides the durability and asynchronous processing needed to handle write spikes without impacting the user who is posting.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Services are deployed as Docker containers on Kubernetes (EKS/GKE).
Feed Service is scaled based on CPU/Request count (Read-heavy).
Fan-out Workers are scaled based on Message Queue lag (Write-heavy).
API Schema Design:
POST /v1/posts: Create a new post. (Body: text, media_ids).
GET /v1/feed: Retrieve current user's feed. (Query: limit, cursor).
POST /v1/follow/{userId}: Follow a user.

Storage

Database Table Design:
Posts Table (Cassandra): post_id (K), author_id, content, media_urls, timestamp. (Wide-column for high write throughput).
Social Graph (Postgres): follower_id, followee_id, created_at. (Index on both IDs).
Technical Selection:
Cassandra: Used for post storage because it handles high-volume writes and scales linearly.
Postgres: Used for the social graph in the MVP due to ACID requirements for follow relationships.
Database Logic: Sharding the Social Graph by follower_id to ensure all "follows" for a single user are on the same shard.

Cache

Key-Value Schema:
Key: feed:{user_id} | Value: List<post_id> (Redis ZSet, score = timestamp).
TTL: 72 hours (only cache active users' feeds).
Technical Selection: Redis is used for its native support of Sorted Sets (ZSet), which makes inserting and retrieving ranked post IDs O(log N).

Messaging

Event/Topic Schema:
Topic: post-published
Payload: { "post_id": "uuid", "author_id": "uuid", "timestamp": "long", "is_celebrity": "bool" }
Technical Selection: Kafka is chosen for its high throughput and ability to replay events if the Fan-out Worker fails.

Data Processing

Processing DAG:
Consume: Read post-published event from Kafka.
Fetch: Query Social Service for all follower_ids of the author_id.
Filter: Identify "Active" followers (logged in within last 30 days).
Update: For each active follower, ZADD the post_id into their Redis feed:{user_id}.
Technical Selection: Manual Worker Service (Go/Java). Spark/Flink is bypassed for the MVP to reduce infrastructure complexity.
Wrap Up

Advanced Topics

Monitoring:
Prometheus/Grafana: Track "Fan-out Lag" (time from post creation to appearing in follower's cache).
Cache Hit Ratio: Monitor how often the Feed Service falls back to the DB.
Trade-offs:
Storage vs. Computation: We choose to duplicate data (Post IDs in millions of Redis lists) to save computation time during read requests.
Bottlenecks: Celebrity users (millions of followers).
Failure Handling:
If Redis fails, the Feed Service performs a "Pull" on-demand from the Metadata DB (Slower, but functional).
Kafka partitions ensure that even if one worker dies, others pick up the load.
Alternatives: Using a Graph Database (Neo4j) for the Social Graph if the relationship logic becomes more complex than simple "follows" (e.g., "Friends of Friends").