DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Microblogging Platform (Twitter)

Design a microblogging service similar to Twitter. The system must support posting short text updates, following other users, and viewing a real-time home timeline. Key challenges include handling high-volume read traffic (600k+ QPS), managing the 'fan-out' process for users with millions of followers, and ensuring sub-200ms latency for timeline retrieval. Address the trade-offs between push and pull models for timeline generation and define a storage strategy that scales to billions of tweets.
Kafka
Redis
Cassandra
PostgreSQL
CDN
S3
Snowflake ID
Microservices
Questions & Insights

Clarifying Questions

What is the expected scale (DAU and traffic)?
Assumption: 300 Million Daily Active Users (DAU), ~6,000 tweets/sec (write), and ~600,000 timeline views/sec (read).
What are the core features for the MVP?
Assumption: Tweeting (text + media), following/unfollowing, and viewing a reverse-chronological home timeline.
How do we handle "Celebrities" (users with millions of followers)?
Assumption: A hybrid fan-out approach is required to prevent "the Justin Bieber effect" from breaking the write path.
What is the consistency vs. availability trade-off?
Assumption: High availability is preferred over strong consistency for the timeline. It is acceptable if a tweet takes a few seconds to appear for all followers.

Thinking Process

To design a scalable microblogging platform, we must solve the Fan-out problem.
How do we deliver a tweet to 100 million followers without crashing? We use an asynchronous fan-out service triggered by a message queue to update pre-computed timelines in a cache.
How do we handle celebrities? We switch from a "Push" (Pre-computed) model to a "Pull" (On-demand) model for users with massive follower counts.
How do we ensure low latency for the Home Timeline? We pre-compute the timeline and store it in an in-memory data store (Redis) so the read path is a simple O(1) or O(N) cache lookup.
How do we store massive amounts of tweet metadata? We use a distributed NoSQL or sharded Relational database optimized for high-write throughput.

Bonus Points

Hybrid Fan-out Strategy: Implementing a "Celebrity Status" flag. For normal users, we push to follower caches. For celebrities, we pull at read-time and merge.
Read-Repair & Lazy Loading: Optimizing the timeline by only loading the top 200 tweets and lazily fetching older content.
Geo-Sharding: Sharding user data based on geographical affinity to reduce cross-region latency.
Media Optimization: Using a "Bit-rate Adaptive" storage strategy and Edge-CDNs for media delivery.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can post tweets (text + media).
Users can follow/unfollow other users.
Users can view a "Home Timeline" (reverse-chronological tweets from people they follow).
Users can view a "User Timeline" (tweets from a specific user).
Scope Control:
In-scope: Tweeting, Following, Home/User Timeline, Media uploads.
Out-of-scope: Direct Messages, Search, Trending Topics, Retweets (for MVP), Analytics.

Non-Functional Requirements

Scale: Support 300M DAU and 100k+ QPS at peak.
Latency: Timeline generation < 200ms.
Availability & Reliability: 99.99% (Highly Available); social media is not mission-critical for strong consistency.
Consistency: Eventual consistency for the timeline.
Fault Tolerance: No single point of failure; graceful degradation if the fan-out service lags.

Estimation

Traffic Estimation:
Writes (Tweets): 6k tweets/sec * 86400s \approx 500M tweets/day.
Reads (Timeline): 300M DAU * 20 visits \approx 6B views/day (\approx 70k average QPS, 140k peak).
Storage Estimation:
Text: 500M tweets * 280 bytes \approx 140 GB/day.
Media: 10% of tweets have media (50M images/day * 1MB) \approx 50 TB/day.
Bandwidth Estimation:
Ingress: 50 TB/day \approx 580 MB/s.
Egress: (Reads are 100x writes) \approx 50 GB/s (handled by CDN).

Blueprint

The architecture centers on a Hybrid Fan-out model. The write path is decoupled via a Message Queue to ensure the user isn't blocked while their followers' timelines are updated. The read path is optimized by serving pre-computed timelines from Redis.
Major Components:
API Gateway: Handles authentication, rate limiting, and request routing.
Tweet Service: Manages tweet creation and persists metadata to the database.
Social Graph Service: Manages follow/unfollow relationships and provides follower lists for fan-out.
Fan-out Workers: Asynchronously updates follower timelines in the cache.
Timeline Service: Aggregates and serves the Home/User timelines.
Redis Cluster: Stores pre-computed Home Timelines for fast retrieval.
Simplicity Audit: This design avoids complex real-time stream processing engines (like Flink) in favor of simple workers and a cache, which is sufficient for a reverse-chronological MVP.
Architecture Decision Rationale:
Why this?: Twitter is extremely read-heavy (100:1 read/write ratio). Pre-computing timelines shifts the heavy lifting from the read path to the write path.
Functional: Meets all core posting and viewing requirements.
Non-functional: Redis ensures sub-100ms latency; Kafka ensures the system can handle write spikes.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
CDN: Used for all media (images/videos). Static assets are cached at the edge.
Load Balancer: L7 Load Balancer handles SSL termination and routes traffic based on path (/v1/tweets vs /v1/timeline).
Security & Perimeter:
API Gateway: Validates JWTs, performs rate limiting (e.g., 100 tweets/hour per user), and scrubs PII from logs.

Service

Topology & Scaling:
Stateless microservices deployed across multiple Availability Zones (AZs).
Scaling is based on CPU and Request Count (HPA in Kubernetes).
API Schema Design:
POST /v1/tweets: { text, media_ids }. Protocol: REST. Idempotency Key required in header.
GET /v1/timeline/home: Returns list of Tweet objects. Supporting pagination via since_id.
Resilience & Reliability:
Circuit Breakers: If the Fan-out service is down, the Tweet service continues to save to DB, but timelines might be delayed.

Storage

Access Pattern:
High write volume for Tweets. Extremely high read volume for Social Graph (following lists).
Database Table Design:
Tweets Table: tweet_id (PK), user_id, content, created_at.
Follows Table: follower_id, followee_id, created_at. Unique index on (follower_id, followee_id).
Technical Selection:
Tweet DB: Sharded PostgreSQL or Cassandra. Cassandra is ideal for high-write throughput and time-series-like data (tweets).
Social Graph DB: PostgreSQL (Relational is better for mapping followers) or a Graph DB like Neo4j (though overkill for MVP).
Distribution Logic:
Shard Tweets by user_id to ensure all tweets from a user are co-located, making "User Timeline" fetches efficient.

Cache

Purpose & Justification: To avoid expensive SQL JOIN operations across millions of rows for every timeline request.
Key-Value Schema:
Key: timeline:{user_id}
Value: List of tweet_id (Redis List or Sorted Set).
TTL: 72 hours (inactive users' caches are evicted).
Failure Handling: If Redis fails, the Timeline Service falls back to "Pull" mode (querying the Tweet DB directly for followed users), albeit with higher latency.

Messaging

Purpose & Decoupling: Decouples tweet ingestion from follower delivery.
Event Schema: tweet_id, author_id, timestamp.
Technical Selection: Kafka. Required for high throughput and durability.
Failure Handling: Dead-letter queues for tweets that fail fan-out (e.g., author has 0 followers or invalid state).

Data Processing

Processing Model: Stream processing (Consumers).
Processing Logic:
Fetch author_id from the queue.
Query Social Graph Service for all follower_ids.
Filter out "Celebrities" (Celebrity timelines are not pushed).
For each follower, insert tweet_id into their Redis timeline:{user_id}.
Technical Selection: Custom Go/Java workers for low-latency processing.
Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability (AP). If the system partitions, it's better to show an old timeline than no timeline at all.
The Celebrity Problem:
Problem: A user with 50M followers (e.g., Elon Musk) would require 50M Redis writes per tweet, causing massive lag.
Optimization: Use a "Hybrid" model. For celebrities, we don't fan out. Instead, when a follower views their timeline, the system fetches the celebrity's recent tweets from the Tweet DB and merges them with the pre-computed Redis results.
Data Sharding: Sharding the Tweet DB by tweet_id using Snowflake IDs (K-ordered) ensures that we maintain chronological order across shards without a global lock.
Storage Optimization: Cold storage (S3) for tweets older than 1 year to keep the active DB size manageable.