The Question
DesignScalable URL Shortening Service
Design a global URL shortening service similar to Bitly. The system must support generating shortened aliases for long URLs and perform high-speed redirections. Consider a scale of 100 million new URLs per month and 10 billion redirections. Address the challenges of unique ID generation without collisions, low-latency redirection, and how to handle database growth over a 5-year period while maintaining high availability.
NoSQL
DynamoDB
Cassandra
Redis
ZooKeeper
CDN
Base62
Bloom Filter
API Gateway
Questions & Insights
Clarifying Questions
Scale & Traffic: What is the expected volume of new URLs created per month and the number of redirections (reads)?
Key Constraints: What is the desired length of the shortened URL? Should they be human-readable or random?
Features: Do we need to support custom aliases, link expiration, or private links?
Analytics: Is real-time click-tracking/analytics required for the MVP, or can we stick to basic redirection?
Availability vs. Consistency: In the event of a network partition, do we prioritize being able to create new links or ensuring that a newly created link is immediately available globally?
Assumptions for this design:
Scale: 100M new URLs/month, 10B redirections/month (100:1 read/write ratio).
Storage: 5-year retention, resulting in ~6 billion records.
Latency: Redirection should be < 20ms at the tail (p99).
Availability: 99.99% (highly available redirection).
Thinking Process
Core Bottleneck: Generating unique, non-colliding short IDs at scale without introducing a single point of failure or high latency during the write path.
Efficiency: How do we handle 10B redirects monthly without hammering the database? (Caching strategy).
Scale: How do we store 6B records efficiently while maintaining fast lookups?
Progressive Logic:
How do we generate a unique 7-character string? (Base62 encoding + ID Generation).
How do we prevent ID collisions during concurrent writes? (Key Generation Service/KGS).
How do we optimize the read path (Redirection) for global users? (CDN/Cache).
How do we shard the data to ensure the system scales horizontally?
Bonus Points
Key Generation Service (KGS): Using a dedicated service with pre-generated keys in a "range-based" allocation via ZooKeeper to eliminate database round-trips for ID uniqueness.
Bloom Filters: Implementing Bloom Filters in the cache layer to quickly identify non-existent short URLs, preventing "cache-miss attacks" on the database.
Geo-Proximity Routing: Utilizing Anycast IP or Latency-based DNS to route users to the nearest PoP (Point of Presence) for redirection.
Cold Storage Migration: Strategy for moving expired or "dead" links (no hits in 1 year) to cheaper S3-based cold storage to keep the hot database size manageable.
Design Breakdown
Functional Requirements
Core Use Cases:
Shorten a long URL to a unique short alias (e.g.,
tiny.url/abc123).Redirect a short URL to the original long URL with minimum latency (301/302 Redirect).
Scope Control:
In-scope: Shortening, redirection, basic expiration.
Out-of-scope: Detailed user analytics dashboard, custom UI, link editing, malware scanning of destination URLs.
Non-Functional Requirements
Scale: Must handle 4,000+ RPS (Read) and 40+ RPS (Write) on average, with 10x peaks.
Latency: Redirection (Read) must be extremely fast (<20ms).
Availability: High availability is critical; if the service is down, all links are broken.
Consistency: Eventual consistency is acceptable; it is okay if a newly created link takes 1-2 seconds to be "visible" globally.
Fault Tolerance: No single point of failure in ID generation or storage.
Estimation
Traffic Estimation:
Writes: 100M / (30 days 24h 3600s) ≈ 40 URLs/sec.
Reads: 10B / (30 days 24h 3600s) ≈ 4,000 URLs/sec.
Storage Estimation:
Record size: URL ID (8 bytes) + Long URL (500 bytes) + CreatedAt (4 bytes) + Expiry (4 bytes) ≈ 520 bytes.
Total Storage (5 years): 6B records * 520 bytes ≈ 3.12 TB.
Bandwidth Estimation:
Incoming (Write): 40 * 520 bytes ≈ 21 KB/s.
Outgoing (Read): 4,000 * 520 bytes ≈ 2 MB/s.
Blueprint
Concise Summary: A microservices-based architecture utilizing a Key Generation Service for ID assignment and a distributed NoSQL store for high-performance lookups.
Major Components:
API Gateway: Handles rate limiting, authentication, and request routing.
URL Write Service: Manages the creation of short URLs and interacts with the KGS.
URL Read Service: Handles high-volume redirection lookups.
Key Generation Service (KGS): A stateful service that provides unique IDs to the Write Service using a range-based approach.
Redis Cache: Stores hot URL mappings to minimize database load.
NoSQL DB (DynamoDB/Cassandra): Persistent store for URL mappings, chosen for horizontal scalability.
Simplicity Audit: This design avoids complex distributed locking by using a range-based ID generator (KGS), making it highly performant and easy to scale.
Architecture Decision Rationale:
Why NoSQL?: Our data is simple (Key-Value), and we need to scale to billions of records. NoSQL offers better horizontal scaling than RDBMS for this specific schema.
Functional Requirement Satisfaction: KGS ensures unique IDs; Cache + NoSQL ensures fast redirection.
Non-functional Requirement Satisfaction: Multi-node deployment ensures High Availability; Redis ensures low Latency.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
Use a global CDN (Cloudflare/Fastly) to cache the redirection logic at the edge for top-tier "viral" links.
Latency-based DNS routing to direct users to the nearest regional data center.
Security & Perimeter:
API Gateway handles Rate Limiting (e.g., 100 requests/min per IP for creation) to prevent exhaustion attacks.
SSL Termination at the Load Balancer.
Service
Topology & Scaling:
Stateless services (Read/Write) deployed across multiple Availability Zones (Multi-AZ).
Auto-scaling based on CPU and Request Count.
API Schema Design:
POST /v1/shorten: Shortens a URL. Protocol: REST. Returns short_url.GET /{short_id}: Redirects to long URL. Returns 301 (Permanent Redirect).Resilience & Reliability:
Implement retries with exponential backoff for KGS requests.
Circuit breakers on the Database connection to prevent cascading failures.
Storage
Access Pattern: 100:1 Read/Write ratio. Primary key lookup is the dominant pattern.
Database Table Design:
short_id (String, Partition Key)long_url (String)user_id (String, Global Secondary Index)created_at (Timestamp)expiry_at (Timestamp)Technical Selection: DynamoDB or Cassandra.
Rationale: Predictable performance at scale, managed (if DynamoDB), and optimized for key-value lookups.
Distribution Logic:
Hash-based sharding on
short_id ensures even distribution across the cluster.Cache
Purpose & Justification: Reduces database load for viral links.
Key-Value Schema: Key:
short_id, Value: long_url. TTL: 24 hours (sliding window for active links).Failure Handling: If Redis is down, fall back to Database. Use "Cache Aside" pattern.
Infrastructure (Optional)
Distributed Coordination:
ZooKeeper: Used by the KGS to manage ID ranges. Each KGS instance grabs a range of IDs (e.g., 1 million IDs at a time) to avoid constant locking of the database.
Wrap Up
Advanced Topics
Trade-offs: We choose Eventual Consistency for link creation across regions to ensure high availability for the 99.9% of users who are just clicking links.
Key Generation Alternatives:
UUID: Too long (36 chars), bad for "tiny" URLs.
MD5/SHA: Requires collision handling (Check-then-insert), which is slow at scale.
KGS (Range-based): Selected because it is the most efficient for generating short, unique, sequential-looking strings using Base62.
Security:
Short IDs are generated from a random range to prevent users from "guessing" the next URL in the sequence.
Bottleneck Analysis:
The KGS range allocation in ZooKeeper could be a bottleneck if ranges are too small. We will set large ranges (1M) to minimize coordination frequency.