The Question
Design

Scalable Social Media Feed

Design a high-throughput social media platform that allows users to post short text updates, follow other users, and view a real-time aggregated feed of content from their network. Ensure the system can handle millions of concurrent users and significant variations in user popularity (e.g., accounts with millions of followers vs. standard users).
Redis
Kafka
PostgreSQL
Hybrid Fan-Out
Snowflake ID
Questions & Insights

Clarifying Questions

What is the scale of the system (DAU and QPS)?Assumption: 200M Daily Active Users (DAU), 500M tweets per day, and a 1:100 write-to-read ratio (~6k write TPS, ~600k read QPS).
What are the core MVP features?Assumption: Tweeting, Following/Unfollowing, Home Timeline (tweets from people you follow), and User Timeline (your own tweets).
Do we need to support media (images/video)?Assumption: For the MVP, we will support text and metadata. Media is handled via external Blob storage references.
Is search or "trending" a priority?Assumption: No, these are secondary. Focus on the core feed and social graph first.
What is the consistency vs. availability trade-off?Assumption: High availability and low latency are critical. Eventual consistency for timelines is acceptable.

Thinking Process

The core challenge of Twitter is the Fan-out problem: how to efficiently deliver a single tweet to millions of followers' feeds with sub-second latency.
Storage Strategy: How do we store the massive volume of tweets and the social graph? (Answer: Sharded RDBMS for tweets, NoSQL/Graph for relationships).
Read Strategy: How do we serve the Home Timeline without performing massive JOINs across tables? (Answer: Pre-compute feeds in a Cache layer).
The Celebrity Problem: How do we handle users with 50M+ followers where pre-computation (push) would crash the system? (Answer: Hybrid Push/Pull model).
Global Scale: How do we ensure low latency for a global user base? (Answer: Multi-region deployment and Edge Caching).

Bonus Points

Snowflake ID Generation: Using a custom distributed ID generator (like Twitter's Snowflake) to ensure tweet IDs are k-sortable and unique without a central bottleneck.
Cache Stampede Mitigation: Implementing "Lease-based" caching or "Cache-aside with locks" to prevent a surge of requests from hitting the DB when a celebrity tweet's cache expires.
Selective Fan-out: Using a "User Tiering" system to treat "Power Users" (celebrities) differently by using a pull-model (on-demand merge) while standard users use a push-model.
Design Breakdown

Functional Requirements

Users can post tweets (text-based, 280 chars).
Users can follow/unfollow other users.
Users can view a "Home Timeline" (aggregated tweets from everyone they follow, sorted by time).
Users can view a "User Timeline" (history of their own tweets).

Non-Functional Requirements

High Availability: The system must be available 99.99% of the time.
Low Latency: Timeline generation should be < 200ms.
Scalability: Must handle spikes (e.g., major sporting events or breaking news).
Eventual Consistency: It is okay if a tweet takes a few seconds to appear in all followers' feeds.

Estimation

Writes: 500M tweets / 86400s ≈ 5,800 TPS.
Reads: 500M 100 ≈ 600,000 QPS**.
Storage (Tweets): 500M tweets/day 200 bytes ≈ 100 GB/day. Over 5 years ≈ 182 TB**.
Bandwidth: 5,800 writes 200 bytes ≈ 1.1 MB/s** (inbound text). Reads are significantly higher (100x).

Blueprint

Concise Summary: A microservices-based architecture using a Hybrid Fan-out approach. Most tweets are pushed to followers' caches asynchronously, while celebrity tweets are pulled and merged at read-time.
Major Components:
Tweet Service: Handles incoming writes, persists tweets to the database, and assigns unique IDs.
Social Graph Service: Manages follow/unfollow relationships and provides "Follower" lists for fan-out.
Timeline Service: Serves pre-computed feeds from Redis to users.
Fan-out Workers: Asynchronous workers that update follower caches when a new tweet arrives.
Simplicity Audit: This is the simplest architecture that addresses the "Celebrity Problem" while maintaining the extreme read performance required for a social feed.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: It separates the heavy write path (fan-out) from the read path (timeline retrieval), allowing independent scaling.
Functional Requirement Satisfaction: Meets all core posting and viewing requirements.
Non-functional Requirement Satisfaction: Uses Redis for low-latency reads and Kafka for resilient asynchronous processing.

High Level Architecture

Sub-system Deep Dive

Service

Topology: Services are deployed as Docker containers in an Autoscaling Group (K8s).
API Spec:
POST /v1/tweet: Create a tweet. Returns tweet_id.
GET /v1/timeline/home: Returns a list of tweet objects for the authenticated user.
POST /v1/follow/{user_id}: Follow a user.
Communication: Internal communication via gRPC for low latency; External via REST/JSON.

Storage

Data Model (Tweet DB):
Table: tweets (tweet_id: bigint, user_id: bigint, content: varchar, created_at: timestamp).
Sharding: Sharded by user_id to keep a user's tweets on one node.
Data Model (Graph DB):
Table: follows (follower_id: bigint, followee_id: bigint, created_at: timestamp).
Indexing: Compound index on (follower_id, followee_id) and (followee_id).

Cache

Data Structure: Redis Lists or Sorted Sets (ZSET).
Logic: Key = timeline:{user_id}, Value = [tweet_id_1, tweet_id_2, ...].
TTL: 72 hours for active users. Inactive users' caches are evicted to save memory.
Eviction: LRU (Least Recently Used).

Messaging

Topology: Kafka clusters partitioned by user_id of the author.
Delivery: At-least-once delivery to ensure no tweet is lost in the fan-out process.
Consumers: Fan-out workers read from the new_tweets topic, fetch the author's followers from the Graph Service, and update their Redis timelines.
Wrap Up

Advanced Topics

Monitoring: Prometheus for metrics (QPS, Error Rate, Latency), Jaeger for distributed tracing of the fan-out process.
Trade-offs: We trade Consistency for Availability. A user might follow someone and not see their tweet for 1-2 seconds (Eventual Consistency).
Bottlenecks: The Social Graph Service can become a bottleneck for users with millions of followers. This is mitigated by the Hybrid Pull approach.
Failure Handling:
Redis Failure: If a cache node dies, the Timeline Service can reconstruct the feed from the Tweet DB (Pull model fallback).
Kafka Lag: Monitor "Consumer Lag" to ensure fan-out isn't falling behind during peaks.
Alternatives:
Graph DB (Neo4j): Could be used for complex "Follow suggestions" but is overkill for simple "Follow" lists in an MVP.
Cassandra: Could replace the SQL Tweet DB for better write scalability, but requires more complex data modeling.