The Question
Design

Distributed Web Crawler Design

Design a highly scalable, distributed system capable of crawling and indexing a significant portion of the web. The system must efficiently manage URL discovery, prioritize content fetching, and strictly adhere to website-specific politeness policies while handling petabytes of data.
Redis
Bloom Filter
S3
PostgreSQL
Distributed Workers
Questions & Insights

Clarifying Questions

What is the target scale and throughput? (e.g., How many URLs should be crawled per month?)
What data do we need to store? (Just the HTML content, or do we need to process media, extract metadata, and index keywords?)
What are the "politeness" requirements? (How many requests per second per domain are allowed, and do we need to respect robots.txt?)
How do we handle updates/freshness? (Is this a one-time crawl, or do we need to re-crawl pages periodically?)
Assumptions for MVP:
Scale: 1 billion pages per month (~400 URLs/sec).
Storage: Store raw HTML snapshots and URL metadata.
Politeness: Strict adherence to robots.txt and domain-level rate limiting.
Freshness: Monthly re-crawl cycle.
Deduplication: URL-based deduplication is sufficient for the MVP; content-based (checksum) can follow.

Thinking Process

The core challenge of a web crawler is not just fetching data, but managing the "Frontier" (the list of URLs to visit) while maintaining politeness and avoiding infinite loops (spider traps).
The URL Frontier: How do we efficiently store, prioritize, and distribute billions of URLs across workers while ensuring we don't hit the same domain too hard?
Politeness & Distribution: How do we map URLs to specific workers so that a single worker handles a specific domain, making rate-limiting trivial to implement locally?
Deduplication: How do we avoid crawling the same URL multiple times without a massive database lookup for every discovery?
Fault Tolerance: If a worker node dies mid-crawl, how do we ensure the URLs it was processing are not lost?

Bonus Points

Crawl Smartness: Implementing an "Importance Score" (based on PageRank or change frequency) to prioritize high-value pages over obscure ones.
DNS Optimization: Implementing a local DNS cache or custom DNS resolver to avoid the bottleneck of synchronous DNS lookups which can stall high-throughput fetchers.
Checkpointing: Using a distributed state management system (like Zookeeper) to coordinate large-batch crawl "states" to allow resuming from failures.
Browser Rendering: Utilizing headless browsers (e.g., Playwright/Puppeteer) for a subset of "High Value" JavaScript-heavy sites, while using standard HTTP clients for the rest to save costs.
Design Breakdown

Functional Requirements

Crawl the web starting from a set of seed URLs.
Extract all outgoing links from crawled HTML.
Store the raw HTML content in a persistent object store.
Respect robots.txt and domain-level rate limits.
Prevent duplicate URL crawling.

Non-Functional Requirements

Scalability: Must be horizontally scalable to handle billions of URLs.
Robustness: Must handle "spider traps," malformed HTML, and server timeouts.
Politeness: Must not overwhelm target web servers.
Extensibility: Must be easy to add new processors (e.g., image extractors) later.

Estimation

Throughput: 1B pages / 30 days ≈ 385 pages/sec.
Storage: 1B pages * 100KB/page (avg) = 100TB per month.
Metadata: 1B URLs * 500 bytes = 500GB metadata storage.
Bandwidth: 400 pages/sec * 100KB ≈ 40 MB/s (320 Mbps) sustained.

Blueprint

Concise Summary: A distributed worker-based architecture where a centralized "URL Frontier" manages the queue, and "Fetcher Workers" handle the I/O, using a Bloom Filter for fast URL deduplication.
Major Components:
URL Frontier (Redis): Acts as the distributed priority queue and storage for the URL crawl state.
Fetcher Service: Multi-process workers that perform DNS resolution, politeness checks, and HTML downloading.
HTML Storage (S3): Highly scalable object storage for raw page content.
URL Database (PostgreSQL): Stores the metadata and status of crawled URLs.
Duplicate Filter (Bloom Filter): A memory-efficient data structure to check if a URL has already been discovered.
Simplicity Audit: This architecture avoids complex stream processing frameworks (like Flink) in favor of a robust task queue and worker pattern. It uses a Bloom Filter to minimize expensive DB lookups for deduplication, which is the most common bottleneck in crawlers.
Architecture Decision Rationale:
Scalability: Adding fetchers increases throughput linearly.
Functional: Separation of "URL Management" (Frontier) from "Fetching" (Workers) allows for independent scaling of logic and I/O.
Non-functional: Using Redis for the Frontier provides the low-latency required for 400+ ops/sec while ensuring atomic operations for task leasing.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: The Fetcher Service is a fleet of stateless containers (Docker/K8s). Scaling is based on the length of the pending_urls queue in Redis.
API Schema Design:
Internal Task Lease: GET /frontier/next -> Returns {url, depth, priority}.
Internal Task Complete: POST /frontier/complete -> Submits {url, status_code, timestamp}.
Storage Flow: Fetchers stream data directly to S3 via Multipart Upload to minimize memory footprint.

Storage

Database Table Design:
URLs Table: url_hash (PK, ByteA), url_string (Text), last_crawled_at (Timestamp), status (Enum: Pending, Success, Failed), etag (String).
Technical Selection:
PostgreSQL: Used for metadata because of its reliability and support for indexing url_hash.
Amazon S3: Used for HTML content for its virtually infinite scaling and low cost.
Database Logic: Sharding the URLs table by hash(domain) ensures that all URLs for a specific site are co-located, helping with bulk updates and reporting.

Cache

Key-Value Schema:
Queue Key: queue:{domain_hash} -> List of URLs (Ensures domain-level isolation).
Politeness Key: wait:{domain_hash} -> TTL-based lock to enforce delay between requests to the same domain.
DNS Cache: dns:{hostname} -> IP Address (TTL 24h).
Technical Selection: Redis is chosen for its native data structures (Lists, Sets) and speed, which are essential for the high-frequency operations of the Frontier.

Data Processing

Processing DAG:
Fetch: Download HTML.
Parse: Clean HTML and extract <a> tags.
Normalize: Convert relative URLs to absolute and lowercase.
Filter: Pass through Bloom Filter.
Enqueue: Push new URLs back to Redis.
Technical Selection: Manual Service (Go/Python Workers). Using lightweight concurrency (Goroutines) allows one worker to manage thousands of concurrent I/O-bound requests efficiently.
Wrap Up

Advanced Topics

Monitoring:
Metrics: 4xx/5xx error rates, URLs/sec, Redis memory usage, Bloom Filter false positive rate.
Trade-offs:
Consistency vs. Scalability: We accept "At-least-once" delivery. A URL might be crawled twice if a worker fails, which is preferred over missing it.
Bottlenecks:
Redis Memory: As the frontier grows, Redis RAM might become expensive.
Failure Handling:
Dead Letter Queue: URLs that fail 3+ times are moved to a separate table for manual inspection to avoid wasting resources on broken sites.
Alternatives & Optimization:
SSTables: For much larger scales, switching the URL Frontier from Redis to an SSTable-based system (like RocksDB) on local NVMe drives can significantly reduce costs.