The Question

Scalable Distributed Web Crawler Design

Design a distributed web crawler capable of traversing and indexing the global web. The system must manage 10 billion+ documents monthly, ensuring strict adherence to politeness protocols (robots.txt), efficient URL deduplication, and high fault tolerance against spider traps and varied server behaviors. Discuss how you would handle URL prioritization and the trade-offs between storage costs and crawl freshness.

Kafka

Cassandra

Redis

Bloom Filter

SimHash

DNS Caching

Questions & Insights

Clarifying Questions

What is the scale of the "entire web"? (Assumption: We aim to crawl 10 billion pages with a refresh cycle of 30 days.)

What type of content are we crawling? (Assumption: HTML only for the MVP. We will ignore images, videos, and complex JavaScript rendering/SPAs for now.)

What are the freshness requirements? (Assumption: Recrawl frequency varies based on site importance; top-tier sites daily, others monthly.)

Are there legal or ethical constraints? (Assumption: Strict adherence to robots.txt and domain-level rate limiting to ensure politeness.)

What is the output of the system? (Assumption: Stored raw HTML in an object store and extracted metadata/links in a database for indexing by downstream services.)

Thinking Process

Core Bottleneck: The primary challenge is not fetching, but managing the "URL Frontier" (prioritization, politeness, and deduplication) at a trillion-scale without overloading external servers or internal databases.

Progressive Logic:

How do we store and prioritize billions of URLs to fetch next? (The URL Frontier).

How do we ensure we don't crawl the same page twice or get stuck in "spider traps"? (Deduplication & DNS Caching).

How do we maintain "politeness" so we don't DOS small websites? (Domain-based queueing).

How do we store petabytes of data efficiently? (Object Storage + Wide-column Metadata store).

Bonus Points

Bloom Filters/Quotient Filters: Use high-performance probabilistic data structures for URL deduplication to reduce disk I/O by 99% for known URLs.

Checkpointing & State Management: Implementing a "distributed snapshot" of the URL Frontier to allow for seamless recovery after a cluster failure.

DNS Prefetching & Custom Resolver: Bypassing standard OS DNS resolution to handle millions of queries per second and avoid DNS bottlenecks.

Etiquette-First Scheduling: Using a two-level priority queue (Front Queues for priority, Back Queues for politeness/IP-affinity).

Design Breakdown

Functional Requirements

Core Use Cases:

Discover and fetch HTML content from the web.

Parse HTML to extract new links and metadata.

Store crawled content for later processing.

Adhere to robots.txt and rate-limiting policies.

Scope Control:

In-scope: Distributed fetching, URL deduplication, politeness, basic HTML parsing.

Out-of-scope: JavaScript execution (headless browsers), image/video processing, full-text search indexing, CAPTCHA solving.

Non-Functional Requirements

Scale: Handle 10B+ URLs per month (~4,000 requests per second).

Latency: Not a primary concern for the crawler itself, but internal DNS and dedupe checks must be sub-millisecond.

Availability & Reliability: Highly distributed; if a worker fails, the URL must be re-queued.

Consistency: Eventual consistency is acceptable for metadata; strict ordering is not required.

Fault Tolerance: Graceful handling of malformed HTML, slow servers (timeouts), and "spider traps."

Estimation

Traffic Estimation:

10 Billion pages / 30 days

\approx

3,850 Pages Per Second (PPS).

Peak PPS: ~8,000.

Storage Estimation:

Avg page size: 100 KB.

10B * 100KB = 1 Petabyte (PB) per month.

Metadata (URL + headers + hash): ~500 bytes per page = 5 TB per month.

Bandwidth Estimation:

4,000 PPS * 100 KB = 400 MB/s (Inbound).

Blueprint

Concise Summary: A distributed worker-based system coordinated by a centralized URL Frontier. Workers fetch URLs, parse them for new links, and store results in an object store while updating a metadata database.

Major Components:

URL Frontier: The brain that manages URL prioritization and politeness queues.

Fetcher Workers: Stateless containers that perform DNS resolution and HTTP fetching.

Deduplication Engine: Uses Bloom filters and a key-value store to ensure URLs aren't crawled twice.

Storage Layer: S3 for raw content and Cassandra for URL metadata and status.

Simplicity Audit: This design avoids complex stream processing or headless browsers, focusing on high-throughput HTTP requests and disk-efficient deduplication.

Architecture Decision Rationale:

Why this architecture?: A decoupled architecture allows the Fetcher to scale independently of the Parser or Frontier.

Functional Satisfaction: Covers discovery, fetching, and storage.

Non-functional Satisfaction: Uses distributed databases (Cassandra) and object stores (S3) to handle petabyte-scale storage and high write throughput.

High Level Architecture

Sub-system Deep Dive

Service

Fetcher Workers (Go/Rust):

High-concurrency, non-blocking I/O workers.

Each worker fetches a batch of URLs from the Message Bus (Kafka).

Implements a strict timeout (e.g., 5 seconds) to avoid hanging on slow servers.

URL Frontier:

Front Queues: Sort URLs by priority (e.g., PageRank or domain authority).

Back Queues: Map URLs to queues based on domain hash. A "Back Queue Router" ensures only one worker fetches from a specific domain queue at a time to maintain politeness.

API Schema:

Internal GetNextURL(worker_id) -> returns (URL, priority, robots_metadata).

Internal SubmitResult(url, html_content, headers).

Resilience:

Exponential backoff for 5xx errors.

Dead-letter queues for URLs that consistently fail.

Storage

Access Pattern:

Write-heavy for raw content.

Read/Write for URL metadata (checking status).

Metadata DB (Cassandra):

Table: crawl_metadata

Fields: url_hash (PK), url_string, last_crawled_at, etag, status_code, content_hash.

Rationale: Cassandra handles high-volume writes and is partitionable by url_hash.

Raw Content (S3/GCS):

Store HTML as compressed files. Path structure: s3://crawler-bucket/YYYY/MM/DD/hash.html.gz.

Distribution Logic:

Consistent hashing on the URL to determine which metadata node handles the record.

Cache

DNS Cache:

Standard DNS is too slow. Workers maintain a local, high-TTL cache of IP addresses for frequently visited domains.

URL Deduplicator (Bloom Filter + Redis):

Purpose: Prevent redundant crawling.

Logic:

Check Bloom Filter (In-memory, very fast).

If hit, double-check Redis/DB (Avoid false positives).

If miss, it's definitely a new URL.

Messaging

Message Bus (Kafka):

Acts as the buffer between the Frontier and Fetchers.

Topics: urls_to_fetch, parsed_links, crawl_results.

Rationale: Decouples the speed of the URL discovery from the speed of the fetchers.

Data Processing

HTML Parser:

Extracts <a href="..."> tags.

Normalizes URLs (converts relative to absolute, removes fragments like #section).

Filters out unwanted file types (PDFs, ZIPs) and known spider traps (infinite calendar loops).

Technical Selection: Custom Go-based parser using net/html for speed.

Infrastructure (Optional)

Monitoring:

Metrics: Number of active workers, 4xx/5xx error rates, throughput (PPS), Queue depth (backlog).

Health Checks: If a worker node's network bandwidth saturates, trigger auto-scaling.

Wrap Up

Advanced Topics

Trade-offs:

BFS vs DFS: We use BFS (Breadth-First Search) combined with PageRank priority. Pure DFS risks getting stuck in a single site's "trap."

Consistency: We favor availability. It's okay if we crawl a page twice in extreme edge cases due to Bloom filter false positives or eventual consistency in the DB.

Reliability:

Spider Traps: Implement path depth limits and repetitive pattern detection in URLs to avoid infinite loops.

Security:

Sandbox Parsing: Parsing malformed HTML from the open web is risky. Parsers should run in restricted containers to prevent exploit execution.

Distinguishing Insights:

Content Hash Deduplication: Sometimes different URLs lead to the same content (mirrors). We compute a SimHash of the content to identify and skip duplicates even if the URLs are unique.