DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Design a Scalable News Feed System

Design the core architecture for a global social media news feed similar to Facebook or Instagram. The system must handle over 2 billion Daily Active Users (DAU). Key requirements include: supporting various media types (text, images, video), generating a personalized feed with low latency (<200ms), and managing the 'celebrity problem' (high-profile accounts with millions of followers). Explain your strategy for feed generation (push vs. pull), how you handle real-time interaction counters (likes/comments), and your data sharding approach to ensure high availability and horizontal scalability.
Redis
Kafka
PostgreSQL
Cassandra
Neo4j
Flink
CDN
gRPC
GraphQL
Questions & Insights

Clarifying Questions

Scale of Operation: What is the target Daily Active User (DAU) count and average number of friends/followings per user?
Assumption: 2 Billion DAU, average 300-500 friends, peak load of 100k+ writes/sec and 1M+ reads/sec.
Content Freshness: Is "real-time" defined as sub-second delivery or is a few seconds of lag acceptable for the feed?
Assumption: Near real-time (1-2 seconds) for typical users; sub-second for interactions like likes.
Ranking Complexity: Should we implement a heavy Machine Learning (ML) ranking engine or a simple heuristic-based ranking (e.g., chronological + weight) for the MVP?
Assumption: Heuristic-based ranking for the MVP, with architectural hooks for future ML integration.
Media Handling: Does the system handle raw media processing, or is that outsourced to a dedicated media service?
Assumption: The feed system stores metadata/URLs; media processing (transcoding/compression) is handled by a separate pipeline.

Thinking Process

Core Bottleneck: The primary challenge is the Fan-out process—how to deliver a single post to millions of followers without crashing the database or causing massive latency.
Progressive Strategy:
How do we store and retrieve the social graph (who follows whom)?
How do we balance "Push" (Fan-out on write) for regular users vs. "Pull" (Fan-out on load) for celebrities/influencers?
How do we cache the pre-computed feeds to ensure <200ms latency?
How do we integrate real-time interactions (likes/comments) without re-calculating the whole feed?

Bonus Points

Hybrid Fan-out Strategy: Implementing a "Celebrity" flag to prevent "Hot Key" issues in Redis and write amplification in the database.
Consistent Hashing with Vnodes: Using advanced sharding for the Feed Cache to ensure rebalancing doesn't cause a massive cache miss storm.
Read-Repair & Lazy Updates: Using a "Pull" mechanism for inactive users to save storage and compute costs (YAGNI on inactive data).
Graph Partitioning: Discussing how to shard the social graph based on "User Clusters" to keep friend-data on the same physical shard for faster queries.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can create posts with text and media metadata.
Users can view a personalized feed of posts from friends/followings.
Users can "Like" or "Comment" on posts.
Real-time notification/update when a new post is available in the feed.
Scope Control:
In-Scope: Feed generation, fan-out logic, basic ranking, media metadata storage.
Out-of-Scope: Ad-engine integration, deep-learning video transcoding, full-text search across all historical posts.

Non-Functional Requirements

Scale: Support 2B+ users and billions of posts.
Latency: Feed retrieval P99 < 200ms.
Availability & Reliability: 99.99% availability (AP in CAP theorem); feed should never be "empty" if the ranking service is down.
Consistency: Eventual consistency is acceptable for feed updates.
Fault Tolerance: Graceful degradation (show chronological feed if personalized ranking fails).

Estimation

Traffic Estimation:
2B DAU * 10 feed refreshes/day = ~230,000 Read QPS.
200M Posts/day (10% users post) = ~2,300 Write QPS.
Peak QPS = 5x Average (~1.1M Read, ~11k Write).
Storage Estimation:
Metadata: 200M posts * 1KB = 200GB/day.
1 Year Metadata = ~73TB (Sharding required).
Bandwidth Estimation:
Outgoing: 230k QPS 20 posts 2KB (metadata only) = ~9.2 GB/s.

Blueprint

Concise Summary: A hybrid Fan-out architecture where regular user posts are pushed to friend-specific caches, while high-profile (celebrity) posts are pulled at query time to avoid write amplification.
Major Components:
Post Service: Handles incoming writes, media metadata, and persists posts to the DB.
Fan-out Worker: Asynchronously distributes post IDs to the "Home Feeds" of followers.
Feed Service: Aggregates post IDs from the cache, fetches metadata, and applies ranking.
Social Graph Service: Manages follow/friend relationships using a graph-optimized store.
Interaction Service: High-throughput counter service for likes and comments.
Simplicity Audit: We use a simple Redis-based pre-computed feed for the MVP instead of a complex real-time streaming ML pipeline.
Architecture Decision Rationale:
Why this?: Hybrid fan-out is industry-standard for social media to balance write-heavy vs read-heavy loads.
Functional Satisfaction: Covers posts, feed, and interactions seamlessly.
Non-functional Satisfaction: Redis ensures low latency; Kafka ensures async processing to keep post-creation latency low.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing
CDN: Used for all static media (images/video). Feed metadata is dynamic and bypasses CDN but uses regional Load Balancers.
Security & Perimeter
API Gateway: Performs JWT validation, rate limiting (prevents bots from scraping feeds), and TLS termination.

Service

Topology & Scaling
Stateless Services: All services are deployed in K8s clusters across multiple Availability Zones (AZs).
Scaling: Feed Service scales on CPU/Memory; Fan-out workers scale on MQ lag.
API Schema Design
POST /v1/posts: Create a post. (gRPC)
GET /v1/feed: Retrieve current user's feed. (REST/GraphQL)
POST /v1/interactions: Like/Comment. (gRPC)
Resilience & Reliability
Circuit Breakers: If the Ranking Service is slow, the Feed Service falls back to a simple chronological sort.
Retries: Exponential backoff for Fan-out workers to ensure eventual delivery.

Storage

Access Pattern
Posts: Write once, read many. Strong consistency for the author, eventual for followers.
Graph: Heavy read for friendship verification.
Database Table Design
Post Table: post_id (PK), user_id, content_body, media_urls[], timestamp.
Graph Table: user_id, friend_id, created_at.
Technical Selection
Post DB: Sharded PostgreSQL or Cassandra for high-volume metadata.
Graph DB: Neo4j or sharded MySQL with specialized indexing for adjacency lists.
Distribution Logic
Sharding: Posts sharded by post_id (Snowflake ID) to ensure uniform distribution.

Cache

Purpose & Justification: Pre-computing the feed is the only way to meet the 200ms requirement.
Key-Value Schema:
Key: feed:{user_id}.
Value: Sorted Set (PostID as value, Timestamp as score).
TTL: 72 hours for active users; evicted for inactive users.
Failure Handling: If Redis fails, the system falls back to querying the Post DB directly for friends' posts (performance hit but functional).

Messaging

Purpose & Decoupling: Kafka decouples the critical "Post Creation" path from the heavy "Fan-out" path.
Throughput & Partitioning:
Topic: post-published.
Partition Key: author_id.
Failure Handling: Dead-letter queues (DLQ) for posts that fail to fan out after 3 retries.

Data Processing

Processing Model: Spark/Flink jobs consume interaction logs (likes/clicks) to update the Feature Store.
Correctness Guarantees: Periodic batch jobs reconcile any missed counts in the real-time interaction service.
Technical Selection: Flink for real-time feature aggregation.

Infrastructure (Optional)

Observability
Prometheus/Grafana: Monitoring "Fan-out Lag"—the time between post creation and appearance in friend feeds.
Distributed Coordination
Etcd: Configuration management for "Celebrity List" (updated dynamically).
Wrap Up

Advanced Topics

Trade-offs: We prioritize Availability over Consistency. A user might not see a friend's post for a few seconds (Eventual Consistency), but the feed must always load.
Reliability: Hot Shard Prevention. For "Celebrities" (users with >100k followers), we DO NOT push to friend caches. Instead, we pull their recent posts at query time and merge them into the feed.
Bottleneck Analysis: The "Feed Cache" size can explode. We limit the cache to the latest 500 post IDs per user to optimize memory.
Security & Privacy: Visibility checks (e.g., "Friends Only") are applied at the Feed Service layer during the retrieval of post metadata.