The Question

Scalable Web Crawler Design

Design a distributed web crawler capable of processing 1 billion pages per month. The system should efficiently discover new URLs, handle duplicate content detection, respect robots.txt and politeness constraints, and store both raw HTML and metadata. Address challenges such as DNS resolution bottlenecks, URL frontier management, and fault tolerance at scale.

Kafka

Redis

Cassandra

Bloom Filter

DNS Cache

NoSQL

Object Storage

Questions & Insights

Clarifying Questions

What is the primary scale and purpose of the crawler? (e.g., a search engine indexer, price aggregator, or data mining tool).

What is the target throughput? (e.g., How many billion pages per month? This determines the distribution strategy).

What types of content are we crawling? (HTML only, or also PDFs, images, and JavaScript-rendered content?).

What is the freshness requirement? (How often should we recrawl a known page?).

How do we handle politeness and `robots.txt`? (Are we crawling a few domains deeply or millions of domains broadly?).

Assumptions for this design:

Scale: 1 billion pages per month (~400 pages/second).

Content: HTML only (ignoring JS rendering for MVP).

Freshness: Monthly recrawl cycle.

Politeness: Strict adherence to robots.txt and domain-level rate limiting.

Storage: Store both raw HTML and extracted metadata.

Thinking Process

To design a production-grade web crawler, we focus on the loop of discovery, fetching, and extraction while managing the "frontier" of URLs.

Core Bottleneck: The URL Frontier must handle massive read/write volumes while maintaining politeness and priority.

Progressive Flow:

How do we fetch pages without overloading the network or the target servers? (Fetcher + DNS Cache).

How do we ensure we don't crawl the same content twice? (URL Deduplication + Bloom Filters).

How do we extract new links and feed them back into the system? (Parser + Link Extraction).

How do we persist data for downstream consumption? (Object Storage + Metadata DB).

Bonus Points

Custom DNS Resolver: Standard OS DNS calls are blocking and slow; a custom asynchronous DNS resolver with aggressive caching is mandatory at scale.

Simhash for Near-Duplicates: Many pages (e.g., mirrors or templated news) are nearly identical. Using Simhash on content helps avoid indexing redundant data.

Checkpointing & State Recovery: Large crawls are prone to failure; implementing fetcher state snapshots allows resuming without restarting from the seed.

Adaptive Politeness: Instead of fixed delays, dynamically adjust crawl rates based on target server latency (back off if the site slows down).

Design Breakdown

Functional Requirements

Core Use Cases:

Discover new URLs from seed lists.

Fetch HTML content from remote servers.

Extract links and store page metadata.

Respect robots.txt and crawl delays.

Scope Control:

In-scope: Distributed crawling, URL deduplication, basic parsing, storage.

Out-of-scope: JavaScript execution (headless browsers), image/video processing, sophisticated ranking/indexing (search engine logic).

Non-Functional Requirements

Scale: Support horizontal scaling of fetchers to handle 400+ RPS.

Latency: Fetchers must maximize parallel I/O; Parser must be non-blocking.

Availability & Reliability: Distributed architecture to prevent single-node failure from halting the crawl.

Consistency: Eventual consistency for URL discovery is acceptable.

Fault Tolerance: Handle "spider traps" (infinite URL loops) and malformed HTML.

Security: Prevent Server-Side Request Forgery (SSRF) by validating IP ranges.

Estimation

Traffic: 1 billion pages / 30 days ≈ 33 million pages/day ≈ 385 QPS average. Peak might be 1,000 QPS.

Storage:

Raw HTML: ~100 KB per page. 1B pages * 100 KB = 100 TB per month.

Metadata: ~1 KB per page. 1B pages * 1 KB = 1 TB per month.

Bandwidth: 400 QPS * 100 KB = 40 MB/s incoming traffic (well within a 1Gbps link).

Blueprint

Concise Summary: A distributed, asynchronous system centered around a URL Frontier (Kafka) that feeds a pool of Fetcher Workers, which in turn feed an Extraction Pipeline.

Major Components:

URL Frontier: A persistent message queue (Kafka) managing URL prioritization and politeness.

Fetcher Service: A cluster of stateless workers that perform the actual HTTP requests.

DNS Cache: A local high-speed cache (Redis) to minimize DNS lookups.

Content Parser: A processing layer that extracts links and generates content hashes.

Storage Layer: A combination of S3 for raw files and Cassandra for URL metadata.

Simplicity Audit: This design avoids complex distributed locking by using Kafka's partitioning for politeness and Bloom filters for deduplication, keeping components decoupled.

Architecture Decision Rationale:

Why this architecture?: Decoupling the Frontier from the Fetcher allows us to scale fetching independently of the URL management logic.

Functional Satisfaction: Covers the full cycle from discovery to storage while honoring site owner constraints.

Non-functional Satisfaction: Scalable via worker pools, fault-tolerant via Kafka's persistence, and low-latency via DNS caching.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not applicable for fetching; however, the Fetchers should be distributed across multiple regions to reduce latency to target websites.

Security:

SSRF Protection: Fetchers must check resolved IPs against a blacklist of internal ranges (10.x.x.x, 192.168.x.x) to prevent internal network scanning.

DNS Configuration:

Custom DNS Resolver: The fetcher uses a local DNS cache (Redis) and an async resolver to avoid blocking on standard getaddrinfo calls.

Service

Topology & Scaling:

Fetcher Workers: Stateless containers deployed in Kubernetes. Scaling is triggered by Kafka lag (URL Frontier depth).

Isolation: Workers are partitioned by domain hash to ensure that a single worker handles all requests for a domain, simplifying politeness management.

API Schema Design:

Since this is an internal pipeline, it uses a Message-driven interface.

Message Payload: { "url": "string", "depth": int, "priority": int, "checksum": "string" }.

Resilience & Reliability:

Timeouts: 5-10 second timeouts per fetch to prevent "slow loris" sites from hanging workers.

Retries: Max 3 retries with exponential backoff for 5xx errors; 4xx errors (except 429) are discarded.

Storage

Access Pattern: Write-heavy (every fetched page). Read-heavy for URL deduplication check.

Database Table Design (Cassandra):

url_metadata:

url_hash (Primary Key - Partition)

original_url (Text)

last_crawled_at (Timestamp)

content_hash (Text)

status_code (Int)

Technical Selection:

S3: Ideal for large-scale, low-cost storage of immutable raw HTML blobs.

Cassandra: Handles high-volume writes and wide-column queries needed for metadata at scale.

Distribution Logic: Partition by URL hash to avoid hot partitions for specific domains.

Cache

Purpose & Justification:

DNS Cache: Reduces latency for repeated visits to the same domain.

URL Deduplication (Bloom Filter): Prevents re-fetching the same URL.

Key-Value Schema:

DNS: domain_name -> {ip_list, ttl}.

Bloom Filter: Bitmask of seen URL hashes.

Technical Selection: Redis (cluster mode) for its speed and native Bloom filter support (via RedisBloom).

Messaging

Purpose & Decoupling: Kafka acts as the URL Frontier, decoupling discovery from fetching.

Event / Topic Schema:

crawl_queue: Partitioned by domain name to enforce politeness (one consumer per partition).

Failure Handling: Dead-letter queue (DLQ) for URLs that consistently fail or trigger security alerts.

Technical Selection: Kafka for high throughput and the ability to replay URL streams if the parser logic changes.

Data Processing

Processing Model: Streaming (per-page processing).

Processing Logic (Parser Worker):

Link Extraction: Regex or DOM parsing to find <a href="..."> tags.

Normalization: Converting relative URLs to absolute URLs.

Content Hashing: Generating a checksum of the HTML body to detect if content has changed since the last crawl.

Technical Selection: Custom Go or Python workers. Go is preferred for its high-concurrency model (goroutines) during extraction.

Wrap Up

Advanced Topics

Trade-offs (PACELC): We favor Availability and Partition Tolerance (AP). If the Bloom filter has a false positive, we miss one page (acceptable); if the system goes down, the crawl stops (unacceptable).

Reliability & Failure Handling:

Spider Traps: The Link Extractor limits the URL depth and path length to avoid infinite calendar/filter loops.

Robots.txt Cache: Fetchers cache robots.txt for 24 hours to avoid redundant hits on the host's root.

Bottleneck Analysis:

The Frontier: Centralized Kafka can become a bottleneck. We solve this by partitioning by Domain Hash.

IP Blocking: Target sites may block our IP. Optimization: Use a proxy pool or distribute workers across multiple cloud regions/providers.

Security & Privacy:

User-agent must clearly identify the bot and provide a link to a contact page (Politeness).

Compliance with GDPR/CCPA for personal data found on pages is managed at the indexing layer (downstream).