The Question

Scalable Microblogging Platform (Twitter)

Design a microblogging service similar to Twitter. The system must support posting short text updates, following other users, and viewing a real-time home timeline. Key challenges include handling high-volume read traffic (600k+ QPS), managing the 'fan-out' process for users with millions of followers, and ensuring sub-200ms latency for timeline retrieval. Address the trade-offs between push and pull models for timeline generation and define a storage strategy that scales to billions of tweets.

Kafka

Redis

Cassandra

PostgreSQL

CDN

Snowflake ID

Microservices

Questions & Insights

Clarifying Questions

What is the expected scale (DAU and traffic)?

Assumption: 300 Million Daily Active Users (DAU), ~6,000 tweets/sec (write), and ~600,000 timeline views/sec (read).

What are the core features for the MVP?

Assumption: Tweeting (text + media), following/unfollowing, and viewing a reverse-chronological home timeline.

How do we handle "Celebrities" (users with millions of followers)?

Assumption: A hybrid fan-out approach is required to prevent "the Justin Bieber effect" from breaking the write path.

What is the consistency vs. availability trade-off?

Assumption: High availability is preferred over strong consistency for the timeline. It is acceptable if a tweet takes a few seconds to appear for all followers.

Thinking Process

To design a scalable microblogging platform, we must solve the Fan-out problem.

How do we deliver a tweet to 100 million followers without crashing? We use an asynchronous fan-out service triggered by a message queue to update pre-computed timelines in a cache.

How do we handle celebrities? We switch from a "Push" (Pre-computed) model to a "Pull" (On-demand) model for users with massive follower counts.

How do we ensure low latency for the Home Timeline? We pre-compute the timeline and store it in an in-memory data store (Redis) so the read path is a simple O(1) or O(N) cache lookup.

How do we store massive amounts of tweet metadata? We use a distributed NoSQL or sharded Relational database optimized for high-write throughput.

Bonus Points

Hybrid Fan-out Strategy: Implementing a "Celebrity Status" flag. For normal users, we push to follower caches. For celebrities, we pull at read-time and merge.

Read-Repair & Lazy Loading: Optimizing the timeline by only loading the top 200 tweets and lazily fetching older content.

Geo-Sharding: Sharding user data based on geographical affinity to reduce cross-region latency.

Media Optimization: Using a "Bit-rate Adaptive" storage strategy and Edge-CDNs for media delivery.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can post tweets (text + media).

Users can follow/unfollow other users.

Users can view a "Home Timeline" (reverse-chronological tweets from people they follow).

Users can view a "User Timeline" (tweets from a specific user).

Scope Control:

In-scope: Tweeting, Following, Home/User Timeline, Media uploads.

Out-of-scope: Direct Messages, Search, Trending Topics, Retweets (for MVP), Analytics.

Non-Functional Requirements

Scale: Support 300M DAU and 100k+ QPS at peak.

Latency: Timeline generation < 200ms.

Availability & Reliability: 99.99% (Highly Available); social media is not mission-critical for strong consistency.

Consistency: Eventual consistency for the timeline.

Fault Tolerance: No single point of failure; graceful degradation if the fan-out service lags.

Estimation

Traffic Estimation:

Writes (Tweets): 6k tweets/sec * 86400s

\approx

500M tweets/day.

Reads (Timeline): 300M DAU * 20 visits

\approx

6B views/day (

\approx

70k average QPS, 140k peak).

Storage Estimation:

Text: 500M tweets * 280 bytes

\approx

140 GB/day.

Media: 10% of tweets have media (50M images/day * 1MB)

\approx

50 TB/day.

Bandwidth Estimation:

Ingress: 50 TB/day

\approx

580 MB/s.

Egress: (Reads are 100x writes)

\approx

50 GB/s (handled by CDN).

Blueprint

The architecture centers on a Hybrid Fan-out model. The write path is decoupled via a Message Queue to ensure the user isn't blocked while their followers' timelines are updated. The read path is optimized by serving pre-computed timelines from Redis.

Major Components:

API Gateway: Handles authentication, rate limiting, and request routing.

Tweet Service: Manages tweet creation and persists metadata to the database.

Social Graph Service: Manages follow/unfollow relationships and provides follower lists for fan-out.

Fan-out Workers: Asynchronously updates follower timelines in the cache.

Timeline Service: Aggregates and serves the Home/User timelines.

Redis Cluster: Stores pre-computed Home Timelines for fast retrieval.

Simplicity Audit: This design avoids complex real-time stream processing engines (like Flink) in favor of simple workers and a cache, which is sufficient for a reverse-chronological MVP.

Architecture Decision Rationale:

Why this?: Twitter is extremely read-heavy (

100:1

read/write ratio). Pre-computing timelines shifts the heavy lifting from the read path to the write path.

Functional: Meets all core posting and viewing requirements.

Non-functional: Redis ensures sub-100ms latency; Kafka ensures the system can handle write spikes.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

CDN: Used for all media (images/videos). Static assets are cached at the edge.

Load Balancer: L7 Load Balancer handles SSL termination and routes traffic based on path (/v1/tweets vs /v1/timeline).

Security & Perimeter:

API Gateway: Validates JWTs, performs rate limiting (e.g., 100 tweets/hour per user), and scrubs PII from logs.

Service

Topology & Scaling:

Stateless microservices deployed across multiple Availability Zones (AZs).

Scaling is based on CPU and Request Count (HPA in Kubernetes).

API Schema Design:

POST /v1/tweets: { text, media_ids }. Protocol: REST. Idempotency Key required in header.

GET /v1/timeline/home: Returns list of Tweet objects. Supporting pagination via since_id.

Resilience & Reliability:

Circuit Breakers: If the Fan-out service is down, the Tweet service continues to save to DB, but timelines might be delayed.

Storage

Access Pattern:

High write volume for Tweets. Extremely high read volume for Social Graph (following lists).

Database Table Design:

Tweets Table: tweet_id (PK), user_id, content, created_at.

Follows Table: follower_id, followee_id, created_at. Unique index on (follower_id, followee_id).

Technical Selection:

Tweet DB: Sharded PostgreSQL or Cassandra. Cassandra is ideal for high-write throughput and time-series-like data (tweets).

Social Graph DB: PostgreSQL (Relational is better for mapping followers) or a Graph DB like Neo4j (though overkill for MVP).

Distribution Logic:

Shard Tweets by user_id to ensure all tweets from a user are co-located, making "User Timeline" fetches efficient.

Cache

Purpose & Justification: To avoid expensive SQL JOIN operations across millions of rows for every timeline request.

Key-Value Schema:

Key: timeline:{user_id}

Value: List of tweet_id (Redis List or Sorted Set).

TTL: 72 hours (inactive users' caches are evicted).

Failure Handling: If Redis fails, the Timeline Service falls back to "Pull" mode (querying the Tweet DB directly for followed users), albeit with higher latency.

Messaging

Purpose & Decoupling: Decouples tweet ingestion from follower delivery.

Event Schema: tweet_id, author_id, timestamp.

Technical Selection: Kafka. Required for high throughput and durability.

Failure Handling: Dead-letter queues for tweets that fail fan-out (e.g., author has 0 followers or invalid state).

Data Processing

Processing Model: Stream processing (Consumers).

Processing Logic:

Fetch author_id from the queue.

Query Social Graph Service for all follower_ids.

Filter out "Celebrities" (Celebrity timelines are not pushed).

For each follower, insert tweet_id into their Redis timeline:{user_id}.

Technical Selection: Custom Go/Java workers for low-latency processing.

Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability (AP). If the system partitions, it's better to show an old timeline than no timeline at all.

The Celebrity Problem:

Problem: A user with 50M followers (e.g., Elon Musk) would require 50M Redis writes per tweet, causing massive lag.

Optimization: Use a "Hybrid" model. For celebrities, we don't fan out. Instead, when a follower views their timeline, the system fetches the celebrity's recent tweets from the Tweet DB and merges them with the pre-computed Redis results.

Data Sharding: Sharding the Tweet DB by tweet_id using Snowflake IDs (K-ordered) ensures that we maintain chronological order across shards without a global lock.

Storage Optimization: Cold storage (S3) for tweets older than 1 year to keep the active DB size manageable.