DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable News Feed System

Design a news feed system similar to Facebook or Twitter. The system should support 300 million daily active users who can post updates and follow other users. Key challenges include maintaining low-latency feed retrieval (under 300ms) while handling the 'celebrity problem' where a single user might have millions of followers. Explain your strategy for data consistency, storage selection for the social graph, and your approach to the push-vs-pull fan-out trade-off.
Kafka
Redis
PostgreSQL
Cassandra
ZSET
CDN
gRPC
Questions & Insights

Clarifying Questions

Scale and Traffic: What is the target Daily Active User (DAU) count and the average number of follows per user?
Assumption: 300M DAU, average 200 follows per user, with some "celebrity" accounts having millions of followers.
Content Types: Does the feed support only text, or also media like images and videos?
Assumption: Support for text and media metadata (URLs). Media storage itself is out of scope for the feed logic.
Feed Ordering: Is the feed strictly chronological or algorithmically ranked?
Assumption: Chronological for the MVP, but the architecture should allow for ranking hooks.
Latency SLAs: What are the target latencies for posting a message and loading the feed?
Assumption: Post visibility within 2 seconds; Feed load time under 300ms.

Thinking Process

Core Bottleneck: The primary challenge is the "Fan-out" problem—efficiently delivering a single post to millions of followers' feeds without crashing the system or causing massive lag.
Strategy Steps:
Write Path: How do we ingest a post and trigger the distribution? (Async Fan-out).
Read Path: How do we serve the feed instantly? (Pre-computed Caching).
Optimization: How do we handle "Celebrities" (users with >100k followers) who break the standard push model? (Hybrid Push/Pull).
Storage: Where do we store the social graph and the posts for durability?

Bonus Points

Hybrid Fan-out Model: Implementing a "Push" model for regular users and a "Pull" model for celebrities to prevent "write amplification" and "thundering herd" issues in the cache.
Pagination Strategy: Using "Keyset Pagination" (Cursor-based) instead of Offset-based pagination to ensure consistent feed viewing while new items are being injected.
Write-Through Cache with TTL: Managing Feed Cache memory by only pre-computing feeds for "Active Users" (e.g., users logged in within the last 3 days).
Social Graph Partitioning: Sharding the Follower database by user_id to ensure localized lookups for fan-out workers.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can create posts (text/media links).
Users can follow/unfollow other users.
Users can view a chronological news feed of people they follow.
Scope Control:
In-scope: Post creation, social graph management, feed generation, and retrieval.
Out-of-scope: Notifications, search, ad-insertion, and media transcoding.

Non-Functional Requirements

Scale: Support 300M DAU and 10k+ QPS for writes, 100k+ QPS for reads.
Latency: Feed retrieval < 300ms (P99).
Availability & Reliability: 99.99% availability (CAP theorem: prioritize Availability over Consistency - AP).
Consistency: Eventual consistency is acceptable; it is okay if a follower sees a post a few seconds late.
Fault Tolerance: The system should continue to serve the feed even if the fan-out worker lags.

Estimation

Traffic Estimation:
300M DAU. If each user checks feed 5 times/day = 1.5B requests/day (~17.5k QPS average, 35k QPS peak).
If 10% of users post once/day = 30M posts/day (~350 QPS average, 1k QPS peak).
Storage Estimation:
Post size: 500 bytes (text + metadata). 30M posts * 500 bytes = 15GB/day. 5.5TB/year.
Bandwidth Estimation:
Outgoing: 1.5B feed views 20 posts 500 bytes = 15TB/day (~1.4 Gbps).

Blueprint

Concise Summary: A hybrid system that uses a Push-model (fan-out on write) for most users to ensure low-latency reads, while falling back to a Pull-model for celebrities to maintain write performance.
Major Components:
API Gateway: Entry point for authentication and rate limiting.
Post Service: Handles incoming posts and persists them to the Metadata DB.
Follow Service: Manages the social graph (who follows whom).
Fan-out Worker: An asynchronous worker that pushes post IDs to the feed caches of followers.
Feed Service: Aggregates and serves the final feed from the Cache and Post DB.
Redis Cache: Stores pre-computed feed IDs for active users.
Simplicity Audit: This design avoids complex ML ranking pipelines and uses a standard MQ-based fan-out, which is the industry standard for MVP-scale social feeds.
Architecture Decision Rationale:
Why this architecture?: Pure "Pull" models (computing feed on-the-fly) are too slow for 200+ follows. Pure "Push" models (writing to every follower) break for celebrities. Hybrid is the most robust.
Functional Satisfaction: Covers posting, following, and viewing.
Non-functional Satisfaction: Redis ensures <300ms latency; Kafka ensures write scalability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
Global DNS (Route 53) directs users to the nearest regional data center.
API Gateway handles SSL termination and JWT-based authentication.
Security & Perimeter:
Rate limiting (e.g., 5 posts per minute per user) to prevent spam.
WAF to protect against common injection attacks.

Service

Topology & Scaling:
Stateless microservices deployed on Kubernetes, auto-scaled based on CPU/Request count.
Regional deployments to minimize latency.
API Schema Design:
POST /v1/posts: Create a post. { "content": "string", "media_ids": [] }
GET /v1/feed: Retrieve feed. Returns list of Post objects. Supports cursor for pagination.
POST /v1/follow/{user_id}: Follow a user.
Resilience & Reliability:
Retries with exponential backoff for Fan-out workers.
Dead Letter Queues (DLQ) for failed fan-out tasks.

Storage

Access Pattern:
Social Graph: Heavy read/write of relationships. (Relational/Graph needed).
Post Metadata: Write once, read many. (NoSQL preferred for scaling).
Database Table Design:
Follows Table (SQL): follower_id (PK/Shard Key), followee_id, created_at.
Posts Table (NoSQL/Cassandra): post_id (PK), author_id, content, media_urls, created_at.
Technical Selection:
PostgreSQL: For Follows (to ensure ACID when following/unfollowing).
Cassandra: For Posts (high write throughput and easy horizontal scaling).

Cache

Purpose & Justification: Pre-computed feeds are stored in Redis to avoid expensive multi-table joins on every feed load.
Key-Value Schema:
Key: feed:{user_id}
Value: Redis ZSET (Sorted Set) where score is timestamp and member is post_id.
TTL: 72 hours (only cache for active users).
Failure Handling: If Redis is cold or empty, Feed Service falls back to a "Pull" query against the PostDB and Social Graph DB.

Messaging

Purpose & Decoupling: Kafka decouples the Post Service from the Fan-out logic. Post Service returns success once the post is persisted and the message is in Kafka.
Event / Topic Schema:
Topic: post-published
Payload: { "author_id": "uuid", "post_id": "uuid", "timestamp": "long" }
Technical Selection: Kafka for its high throughput and message replayability.

Data Processing

Processing Model: The Fan-out Workers are the "Push" engine.
Processing DAG:
Read message from Kafka.
Fetch all follower_ids from GraphDB (excluding followers of celebrities).
For each follower, update their feed:{user_id} ZSET in Redis.
Technical Selection: Python or Go workers using a library like Sarama or Kafka-python.
Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency. A user might not see their own post in their feed for a second or two. This allows the system to remain highly available and performant.
Reliability & Failure Handling:
Celebrity Handling: When a user follows a celebrity, the Fan-out worker skips the push. When the follower views their feed, the Feed Service pulls the celebrity's latest posts and merges them with the Redis-cached feed.
Bottleneck Analysis:
Hot Shards: High-frequency posters could overwhelm a single Kafka partition. We partition by author_id to distribute load but use an internal buffer for high-volume users.
Security & Privacy:
Block/Mute lists: Feed Service filters out posts from blocked users during the final aggregation.