DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Design a Scalable Social Media News Feed

Design a news feed system similar to Meta (Facebook) that supports 1 billion daily active users. The system must allow users to post updates, follow others, and view a personalized feed of content from their network in reverse-chronological order. Address the 'Celebrity Problem' (hot keys/fan-out spikes), ensure sub-200ms feed retrieval latency, and discuss the trade-offs between push-based (fan-out on write) and pull-based (fan-out on read) architectures. Detail how you would handle data consistency, social graph storage, and feed caching at scale.
Cassandra
Redis
Kafka
PostgreSQL
CDN
gRPC
ZSET
Bloom Filter
Questions & Insights

Clarifying Questions

What is the scale of the system? (Assumed: 1 Billion DAU, 10 Billion posts per day, average 500 friends/follows per user).
What is the feed ranking requirement for the MVP? (Assumed: Reverse chronological order or a simple weight-based score; complex ML-based ranking is out of scope for MVP).
Does the feed support media like images and videos? (Assumed: Yes, but the News Feed system manages metadata/links; actual binary storage is handled by a separate Blob storage/CDN).
Is the relationship model bilateral (Friends) or unilateral (Follows)? (Assumed: Both, represented as a directed graph).
What is the latency target for feed retrieval? (Assumed: P99 < 200ms).

Thinking Process

Core Bottleneck: The "Fan-out" problem. How do we deliver one post to millions of followers without crashing the database or causing massive lag?
Key Questions:
How do we store the massive volume of posts while maintaining high write availability?
How do we balance between "Push" (pre-computing feeds) and "Pull" (computing on the fly) to handle "Celebrities" vs. "Normal Users"?
How do we structure the Social Graph to quickly identify who should see a post?
How do we ensure the feed remains responsive even under heavy load?

Bonus Points

Hybrid Fan-out Strategy: Use "Push" (Fan-out on write) for active, low-follower users and "Pull" (Fan-out on read) for celebrities/high-follower accounts to prevent write amplification.
Feed Pagination via Sockets/Cursors: Implement stateful cursors to prevent duplicate posts when the feed is updated while the user is scrolling.
Edge Side Composition: Discussing how GraphQL or a BFF (Backend for Front End) layer can aggregate post metadata with user profile data at the edge.
Storage Tiering: Moving older posts to cold storage (HDFS/S3) while keeping recent posts in high-performance NoSQL (Cassandra/ScyllaDB).
Design Breakdown

Functional Requirements

Core Use Cases:
Users can create posts (text, links, media references).
Users can view a feed of posts from friends and followed entities.
Users can add/remove friends or follow/unfollow entities.
Scope Control:
In-scope: Post creation, Feed generation, Fan-out logic, Basic Ranking.
Out-of-scope: Complex ML-based Ad insertion, Real-time comments/likes (separate service), Search, Media transcoding.

Non-Functional Requirements

Scale: Support 1B+ users with millions of concurrent requests.
Latency: Feed loading must be near-instant (< 200ms).
Availability & Reliability: Highly available (99.99%); users should always see some content even if it's slightly stale.
Consistency: Eventual consistency is acceptable (it’s okay if a friend's post shows up 2 seconds late).
Fault Tolerance: System must survive regional data center failures.
Security: Privacy settings (e.g., "Friends only") must be strictly enforced.

Estimation

Traffic:
1B DAU / 86400 sec ≈ 12,000 avg QPS.
Peak QPS (5x) ≈ 60,000 QPS for writes.
Read/Write ratio (10:1) ≈ 600,000 Peak Read QPS.
Storage:
10B posts/day * 500 bytes/post = 5 TB/day.
3 years retention ≈ 5.5 PB.
Bandwidth:
Write: 60k QPS * 500B ≈ 30 MB/s.
Read: 600k QPS * 2KB (feed metadata) ≈ 1.2 GB/s.

Blueprint

Concise Summary: A hybrid architecture utilizing a "Push" model for standard users via a Fan-out worker and a "Pull" model for celebrities. Feeds are pre-computed and stored in a fast Cache layer (Redis).
Major Components:
API Gateway: Handles authentication and routing.
Post Service: Manages creation and storage of post metadata.
Social Graph Service: Manages friend/follow relationships.
Feed Service: Aggregates and serves the feed to users.
Fan-out Worker: Asynchronously pushes new posts to followers' feed caches.
Redis (Feed Store): Stores pre-computed feed IDs for active users.
Cassandra (Post Store): Highly scalable NoSQL for persistent post data.
Simplicity Audit: This design avoids complex real-time ML pipelines by using a pre-computed "Feed Cache," ensuring low-latency reads which is the primary user pain point.
Architecture Decision Rationale:
Hybrid Model: Essential for Facebook-scale to avoid the "Thundering Herd" on celebrity posts while keeping feeds instant for 99% of users.
Functional Satisfaction: Covers posting, friending, and viewing.
Non-functional Satisfaction: Cassandra provides horizontal write scaling; Redis provides sub-millisecond read latency.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a Global CDN for static assets (images/JS). DNS uses latency-based routing to the nearest Data Center.
Security & Perimeter: API Gateway handles JWT validation, SSL termination, and IP-based rate limiting to prevent scrapers.

Service

Topology & Scaling: Stateless microservices deployed on Kubernetes across multiple Availability Zones (AZs). Scaling is based on CPU and Request Count.
API Schema Design:
POST /v1/posts: Create a post. Protocol: REST. Request: {content, media_ids, privacy}. Idempotency: client_request_id.
GET /v1/feed: Get news feed. Protocol: REST. Response: {posts: [], next_cursor}. SLA: 200ms.
Resilience & Reliability: Implementation of exponential backoff for Fan-out workers and circuit breakers on the Graph Service to prevent cascading failures during database lag.

Storage

Access Pattern:
Posts: Heavy write, heavy read by ID.
Graph: Complex joins (who follows whom) but mostly read-heavy.
Database Table Design:
Cassandra (Posts): Partition Key: post_id. Columns: author_id, content, timestamp, media_links.
Postgres (Graph): Table relationships (follower_id, followee_id, type). Index on follower_id and followee_id.
Technical Selection: Cassandra is chosen for its "Write-Optimized" nature and seamless cross-region replication. Postgres is used for the Graph due to ACID requirements for relationship integrity.

Cache

Purpose & Justification: Redis stores the "News Feed" as a list of Post IDs for each active user. This avoids expensive DB queries on every page load.
Key-Value Schema:
Key: feed:{user_id}
Value: ZSET (Sorted Set) where score = timestamp and value = post_id.
TTL: 72 hours for active users.
Failure Handling: If Redis is empty (cold start), the Feed Service falls back to a "Pull" model by querying the Graph Service and Cassandra directly, then repopulating the cache.

Messaging

Purpose & Decoupling: Kafka decouples the Post Service from the Fan-out process. This ensures that even if the fan-out is slow (e.g., a user has 5k friends), the "Post Creation" remains fast.
Event Schema: PostCreatedEvent {post_id, author_id, timestamp, privacy_level}.
Throughput & Partitioning: Partitioned by author_id to ensure ordering of posts from the same user.
Technical Selection: Kafka for high-throughput and replayability in case of fan-out logic bugs.

Infrastructure (Optional)

Observability: Prometheus for RED metrics (Rate, Errors, Duration). Jaeger for tracing the lifecycle of a post from creation to feed appearance.
Distributed Coordination: Not heavily required for MVP, though Consul may be used for service discovery.
Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability (AP) over Consistency. It's better for a user to see an old feed than no feed at all.
The "Celebrity" Problem:
Problem: If a user with 50M followers posts, "Pushing" to 50M Redis lists causes a massive spike.
Optimization: We flag celebrities. When they post, we do NOT push. Instead, when their followers request their feed, the Feed Service "Pulls" the celebrity's latest posts and merges them with the pre-computed feed.
Read Path Optimization: The Feed Service uses a Bloom Filter to check if a user has already seen a post to prevent duplication in the UI.
Data Privacy: The Fan-out worker checks the Social Graph for "Blocked" relationships before adding a post ID to a user's feed cache.