The Question
DesignSocial Media Microblogging Platform
Design a social media platform similar to Twitter. The system should support posting short messages, following other users, delivering personalized real-time feeds, handling viral content fan-out, and serving hundreds of millions of daily active users at low latency.
PostgreSQL
Redis
Kafka
RabbitMQ
Kubernetes
Questions & Insights
Thinking Process
Core Bottleneck: The "Fan-out" problem. Delivering a single tweet to millions of followers (e.g., a celebrity tweet) creates a massive write spike that can stall the system if handled synchronously.
Architectural Shift: Move from "Pull on Read" (expensive SQL joins) to "Push on Write" (pre-computing timelines in memory).
Questions for Progression:
How do we store tweets such that they are retrievable by ID and by user?
How do we minimize feed latency for the 99% of users who are not celebrities?
How do we handle "Mega-Celebrities" whose fan-out would overwhelm the pre-computation buffers?
How do we ensure the system remains available even if the pre-computation (Fan-out) lag increases?
Bonus Points
Hybrid Fan-out Strategy: Use "Push" (Fan-out) for regular users to keep their followers' feeds instant, but use "Pull" (On-demand merge) for celebrities (e.g., >100k followers) to prevent "Thundering Herd" write amplification.
Geo-Sharding & Edge Caching: Sharding the Timeline Cache by user's geographic proximity to reduce cross-region latency for feed retrieval.
Feed Ranking: Implementing a lightweight "scoring" mechanism (Recency + Engagement) during the fan-out process rather than a pure chronological sort.
Storage Cost Optimization: Use Cold Storage (S3/Glacier) for tweets older than 30 days, keeping only the recent "Hot" tweets in the primary DB and Redis.
Design Breakdown
Functional Requirements
Users can post tweets (text-based for MVP).
Users can follow other users.
Users can view a "Home Timeline" (posts from people they follow).
Users can view a "User Timeline" (their own posts).
Non-Functional Requirements
High Availability: The system must be available for reading feeds even if some background processing is delayed.
Low Latency: Home Timeline generation must be < 200ms.
Scalability: Support 300M Daily Active Users (DAU).
Eventual Consistency: It is acceptable if a tweet takes a few seconds to appear in all followers' feeds.
Estimation
DAU: 300 Million.
Write Throughput (Tweets): 500M tweets/day \approx 6,000 Tweets Per Second (TPS).
Read Throughput (Timeline views): 300M users * 10 views/day \approx 35,000 TPS.
Storage (Tweets): 500M tweets * 280 bytes \approx 140 GB/day.
Fan-out Load: Avg 200 followers/user. 6,000 TPS * 200 = 1.2M write ops/sec to Redis Timeline caches.
Blueprint
Concise Summary: A write-heavy pre-computation architecture where tweets are asynchronously pushed into the Redis-based timelines of followers.
Major Components:
Tweet Service: Handles incoming tweet writes and persists them to the primary database.
Social Graph Service: Manages the follow/unfollow relationships and provides follower lists for fan-out.
Timeline Service: Serves pre-computed feeds from Redis to the end-user.
Fan-out Workers: Asynchronous consumers that propagate new tweets to the cached timelines of followers.
Simplicity Audit: This architecture uses a standard "Push" model which solves the read-latency requirement immediately. It avoids complex machine learning ranking for the MVP, using simple chronological ordering.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: Services are stateless Docker containers deployed on K8s, scaled horizontally based on CPU/Request count.
API Spec:
POST /v1/tweet: Accepts tweet body; returns tweet_id.GET /v1/timeline/home: Returns list of tweet objects; paginated via max_id.POST /v1/follow/{userId}: Updates social graph.Storage
Data Model:
Tweet DB (PostgreSQL):
tweet_id (PK), user_id (FK), content, created_at.Graph DB (PostgreSQL):
follower_id, followee_id. Composite PK on both.Database Logic: Sharded by
user_id to ensure all tweets for a single user live on the same node, making "User Timeline" queries efficient.Cache
Implementation: Redis clusters sharded by
user_id.Data Structure: Redis
List or Sorted Set (ZSET). Key: timeline:{user_id}, Value: tweet_id, Score: timestamp.TTL & Eviction: Timelines are cached for active users only (e.g., logged in within last 3 days). TTL of 72 hours. Limit list size to 1,000 latest
tweet_ids.Messaging
Implementation: Kafka or RabbitMQ.
Topic Structure:
tweet-published topic.Delivery Guarantees: At-least-once delivery to ensure no follower's feed misses a tweet. Idempotent writes to Redis (ZADD) handle retries.
Data Processing
Component: Fan-out Workers (Python/Go consumers).
Logic:
Consume
tweet_id and author_id from Kafka.Query Social Graph Svc for
follower_ids.For each
follower_id, insert tweet_id into their Redis ZSET.Optimization: For celebrities (e.g., >100k followers), skip the fan-out and mark the tweet for "Pull" retrieval by the Timeline Service.
Wrap Up
Advanced Topics
Trade-offs: We prioritize Availability over Consistency (AP in CAP). If the Fan-out workers are delayed, a user might see their own tweet on their profile but not in their followers' feeds immediately.
Bottlenecks: The Celebrity Fan-out is the primary bottleneck. If a user has 50M followers, a single write causes 50M Redis operations.
Failure Handling:
Redis Failure: If a Redis shard dies, the Timeline Service falls back to a "Pull" model (SQL join) for those specific users, albeit at higher latency.
Kafka Lag: Monitor "Consumer Lag" metrics; auto-scale Fan-out workers if lag exceeds 5 seconds.
Alternatives & Optimization:
Alternative: Use a Graph Database (Neo4j) for the Social Graph. Decision: Stick to PostgreSQL for MVP to reduce operational complexity (YAGNI).
Optimization: Use "Seen Markers" (Bloom Filters) to ensure users don't see the same tweet twice if the cache is rebuilt.