DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Global URL Shortener Design

Design a globally distributed URL shortening service similar to Bitly. The system must support 10 billion persistent mappings, handle 1 million read requests per second with sub-20ms latency, and 10,000 write requests per second. Focus specifically on a collision-free ID generation strategy that works across multiple data centers and how to maintain high availability and performance during regional outages.
Cassandra
Redis
Zookeeper
Kafka
Geo-DNS
NoSQL
Base62 Encoding
Bloom Filter
Questions & Insights

Clarifying Questions

Scale & Traffic: What is the expected scale for total stored URLs and the peak Read/Write QPS?
Latency Targets: What are the P99 latency requirements for both shortening a URL and performing a redirect?
Customization: Do we need to support custom "vanilla" aliases (e.g., bit.ly/my-link), or only auto-generated IDs?
Retention: Should URLs expire after a certain period, and if so, what is the default TTL?
Geographic Distribution: Is the traffic globally distributed, and do we need to ensure local low-latency redirects (e.g., a user in Asia redirected by a server in Asia)?
Assumptions for MVP:
Total Storage: 10 Billion URLs.
Read/Write Ratio: 100:1 (Read-heavy).
Read QPS: 1 Million redirects per second.
Write QPS: 10,000 shortening requests per second.
ID Generation: Base62 (a-z, A-Z, 0-9). 7 characters provide ~3.5 trillion combinations (62^7), sufficient for 10B records.
Availability: 99.99% (High availability is critical).

Thinking Process

Core Bottleneck: High read throughput (1M RPS) and ensuring globally unique, collision-free ID generation without a central bottleneck.
Progressive Strategy:
Unique ID Generation: How do we generate 7-character IDs at scale without collisions or high-latency locking? (Solution: Distributed Range Allocator).
Low Latency Reads: How do we handle 1M RPS globally? (Solution: Multi-region caching and Geo-routing).
Data Persistence: How do we store 10B records while maintaining performance? (Solution: Partitioned NoSQL/Wide-column store).
Availability: How do we handle regional failures? (Solution: Global Server Load Balancing (GSLB) and multi-region replication).

Bonus Points

Bloom Filters: Implement at the caching layer to prevent "Cache Penetration" for non-existent short URLs, saving DB lookups.
Range-Based ID Allocation: Using a distributed coordinator (e.g., Zookeeper) to hand out batches of 100,000 IDs to local application servers to minimize cross-region latency for ID generation.
Read-Repair & Anti-Entropy: For multi-region NoSQL, using read-repair to ensure eventual consistency across regions for the redirect mappings.
Edge-Redirects: Utilizing Cloudflare Workers or Lambda@Edge to perform redirects at the CDN POP, reducing latency to near-zero.
Design Breakdown

Functional Requirements

Core Use Cases:
Shorten: User provides a long URL and receives a short, unique URL.
Redirect: User clicks a short URL and is redirected (302 Found) to the original long URL.
Custom Alias: (Optional for MVP) User provides a custom string for the short URL.
Scope Control:
In-Scope: Shortening, Redirecting, Analytics (basic hit counts), ID generation.
Out-of-Scope: User accounts/auth, sophisticated marketing dashboards, link editing/deletion.

Non-Functional Requirements

Scale: Support 1M+ Read QPS and 10k Write QPS.
Latency: < 20ms for redirects (from cache), < 200ms for shortening.
Availability & Reliability: 99.99% uptime; no single point of failure.
Consistency: Eventual consistency is acceptable for redirects; Strong consistency for ID range allocation.
Fault Tolerance: Regional failover via Geo-DNS.

Estimation

Traffic Estimation:
Write: 10k/sec.
Read: 1M/sec.
Storage Estimation:
10B URLs (500 bytes per record: Short URL + Long URL + Metadata) = 5 TB**.
Bandwidth Estimation:
Write: 10k * 500B = 5 MB/s.
Read: 1M * 500B = 500 MB/s (High egress, requires CDN/Global Load Balancer).

Blueprint

Concise Summary: A geo-distributed service using a distributed range-based ID generator to ensure unique short IDs, backed by a NoSQL store for persistence and a global Redis layer for high-speed redirects.
Major Components:
Range Manager (Zookeeper): Coordinates blocks of unique IDs to prevent collisions across distributed workers.
URL Service: Handles shortening and redirect logic; fetches and caches ranges of IDs locally.
Distributed NoSQL (Cassandra/DynamoDB): Stores the short_id -> long_url mapping across multiple regions.
Cache Layer (Redis): Stores frequently accessed mappings to handle the 1M RPS load.
Simplicity Audit: This design avoids complex hashing/collision-retry logic by using pre-allocated ranges, ensuring writes are always O(1).
Architecture Decision Rationale:
ID Generation: Range-based allocation is chosen over Snowflake IDs to keep the URL length strictly to 7 characters without complex time-based bits.
Database: NoSQL (Cassandra) is used for its linear scalability and multi-region master-less replication.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery: Use a Global Server Load Balancer (GSLB) to route users to the nearest regional data center based on IP.
Security: Implement Rate Limiting at the API Gateway level (e.g., 100 requests/minute per IP for shortening) to prevent scrapers from exhausting the ID space.
SSL/TLS termination at the Load Balancer to reduce the cryptographic load on the URL Service.

Service

Topology & Scaling: Stateless URL Service deployed in multiple regions (US-East, EU-West, AP-South). Horizontal scaling triggered by CPU/Request Count.
ID Generation Logic:
Each instance requests a range (e.g., 1 to 100,000) from Zookeeper.
Once the local range is 80% exhausted, the instance requests the next available range.
Base62 encoding converts the numeric ID to a 7-character string.
API Schema Design:
POST /v1/shorten: { "long_url": "...", "ttl": 3600 } -> Returns { "short_url": "..." }.
GET /{short_id}: Performs 302 redirect.
Resilience: Circuit breakers on DB/Cache calls; if Cache is down, fallback to DB.

Storage

Access Pattern: Heavy read-by-key (short_id). Write-heavy during peak shortening hours.
Database Table Design:
short_id (Primary Key, String)
long_url (String)
created_at (Timestamp)
expiry_at (Timestamp, TTL Index)
Technical Selection: Cassandra.
Rationale: Excellent for high-volume writes and scales horizontally by adding nodes. Multi-region replication is native.
Distribution Logic: Partitioned by short_id hash. This ensures even distribution across the cluster.

Cache

Purpose: Absorb 1M+ RPS. Redirects are static, making them perfect for caching.
Key-Value Schema: Key: short_id, Value: long_url. TTL: 24 hours (LRU eviction).
Technical Selection: Redis Cluster.
Failure Handling: Use a "Write-through" or "Cache-aside" strategy. If a key is missing, fetch from DB and populate cache.

Messaging

Purpose: Asynchronous processing of analytics. We don't want to block the redirect while logging click data.
Event Schema: short_id, timestamp, geo_location, referrer.
Technical Selection: Kafka. High throughput for high-volume clickstream data.
Wrap Up

Advanced Topics

Consistency vs. Availability (CAP): We prioritize Availability and Partition Tolerance (AP). If a link is shortened in the US and someone in Europe clicks it immediately, they might get a 404 for a few milliseconds until the record replicates. This is acceptable.
ID Generation Alternative: We could use MD5/SHA hashes of the URL, but collisions require a "check-and-retry" logic which is slower than range-based allocation.
Optimization (Custom Aliases): For custom aliases, we store them in a separate table or a specific partition. Since custom aliases are user-provided, we must check for collisions in the DB before committing.
Storage Optimization: We can use a compact binary format for storing long URLs to save space, but with 5TB, a standard NoSQL approach is manageable and more flexible for the MVP.