The Question

Distributed Web Crawler

Design a globally distributed web crawler capable of processing and storing 15 billion URLs. The system must prioritize page freshness, strictly adhere to domain-level politeness (robots.txt and rate limits), and handle petabytes of HTML content. Address specific challenges such as URL deduplication at scale, avoiding spider traps, and optimizing DNS resolution for high-throughput fetching. Discuss the architectural trade-offs between crawl speed and server politeness.

Kafka

Cassandra

Redis

Bloom Filter

Flink

DNS Caching

gRPC

Questions & Insights

Clarifying Questions

Scale and Scope: What is the target scale (e.g., 1 billion pages or 100 billion pages)? Are we crawling the entire internet or a specific set of domains?

Freshness: How often do pages need to be recrawled? Is there a priority for frequently updated sites (e.g., news)?

Content Type: Are we crawling only HTML, or do we need to process PDFs, images, and videos?

Politeness: What are the constraints for politeness (e.g., robots.txt compliance, rate limiting per domain)?

Data Sink: Where is the crawled data being stored, and what is the downstream consumer (e.g., a search engine indexer or a data lake)?

Assumptions for this design:

Scale: 15 billion URLs total, with a target of 1 billion pages crawled per month.

Politeness: Strict adherence to robots.txt and domain-level rate limiting is required.

Storage: Store raw HTML content and metadata (URL, timestamp, checksum).

Architecture: Distributed, stateless fetchers with a centralized (but partitioned) URL Frontier.

Thinking Process

Core Bottleneck: The URL Frontier is the primary bottleneck. It must manage trillions of URLs, prioritize them, and ensure we don't overwhelm any single host (Politeness).

How do we ensure we don't crawl the same page twice? (URL Deduplication).

How do we handle the sheer volume of data? (Distributed Fetching and Scalable Storage).

How do we manage host-level politeness in a distributed environment? (Mapping hosts to specific worker queues).

How do we detect and avoid "Spider Traps"? (Depth limiting and checksum-based content deduplication).

Bonus Points

DNS Resolver Caching: At scale, DNS lookups become a bottleneck. Implementing a custom, high-performance local DNS cache/resolver avoids hitting external DNS servers for every request.

Checksum-based Content Deduplication: Beyond URL deduplication, we use SimHash or MinHash to detect "near-duplicate" content to avoid storing and processing redundant data.

Dynamic Politeness: Implementing an adaptive rate-limiter that adjusts crawl frequency based on the target server's response latency (increase delay if latency spikes).

Checkpointing & Fault Tolerance: Use a persistent offset-based system for the Frontier to ensure that if a fetcher fails, the URL can be retried without losing its position in the crawl cycle.

Design Breakdown

Functional Requirements

Core Use Cases:

Discover new URLs from crawled pages.

Fetch and store HTML content from URLs.

Adhere to robots.txt and politeness rules.

Prioritize "important" URLs over others.

Scope Control:

In-scope: Distributed crawling, URL Frontier management, politeness, and basic content extraction.

Out-of-scope: Complex search indexing (PageRank calculation), JavaScript rendering (Headless Chrome), and image/video processing.

Non-Functional Requirements

Scale: Must handle 1,000+ fetches per second (approx. 100M+ per day).

Latency: Fetching is inherently slow (IO-bound); the system must maximize throughput via concurrency.

Availability & Reliability: Distributed architecture to prevent a single node failure from stopping the crawl.

Consistency: Eventual consistency for URL discovery; strong consistency for politeness state (to avoid DOSing sites).

Fault Tolerance: Retries with exponential backoff for transient network errors.

Estimation

Traffic Estimation:

1 Billion pages/month

\approx

400 pages/second (average).

Peak QPS: 1,200 pages/second.

Storage Estimation:

Average page size: 100 KB.

1 Billion pages

\times

100 KB = 100 TB/month.

Metadata (URL, hash, timestamp): ~500 bytes per URL. 15 Billion URLs

\approx

7.5 TB.

Bandwidth Estimation:

1,200 pages/sec

\times

100 KB/page

\approx

120 MB/s (960 Mbps).

Blueprint

Concise Summary: A distributed system where a URL Frontier manages a prioritized queue of URLs, Fetchers retrieve content respecting politeness, and Parsers extract new links to feed back into the cycle.

Major Components:

URL Frontier: The brain that schedules URLs, manages priority, and ensures domain-level politeness.

Fetcher Service: Stateless workers that perform HTTP requests and handle DNS resolution.

Content/URL Deduplicator: Uses Bloom Filters and Hash-based storage to avoid redundant crawls.

Storage (S3/NoSQL): Persists raw HTML and crawl metadata.

Simplicity Audit: This architecture uses a standard producer-consumer model with a message queue, which is the simplest way to decouple scheduling (Frontier) from execution (Fetcher).

Architecture Decision Rationale:

Why this architecture?: The decouple between the Frontier and Fetchers allows us to scale the number of workers independently of the scheduling logic.

Functional Satisfaction: Covers the full lifecycle: Seed -> Fetch -> Parse -> Discover.

Non-functional Satisfaction: Politeness is handled by the Frontier's queue mapping, while horizontal scaling is achieved by adding more Fetcher pods.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling

Fetcher Fleet: Stateless workers deployed in Kubernetes. Scaled based on CPU and Network I/O.

Isolation: Fetchers are partitioned by IP address range to avoid getting blacklisted globally.

API Schema Design

While primarily internal, the Frontier exposes a gRPC interface:

GetNextURLs(worker_id, batch_size) -> List<URL>

SubmitDiscoveredURLs(parent_url, list<new_urls>)

Resilience & Reliability

Retry Policy: 3 retries with exponential backoff (1s, 10s, 60s).

Circuit Breaker: If a specific domain returns 5xx errors repeatedly, the Frontier "freezes" that host queue for 1 hour.

Observability

Metrics: Successful vs. Failed fetches per domain, 403/404 rate, queue depth.

Tracing: Trace a URL from "discovered" to "stored."

Storage

Access Pattern

Metadata DB: High write (new URLs), high read (scheduling).

Object Store: Write-heavy (raw HTML).

Database Table Design

URL Metadata Table:

url_hash (PK, SHA-256)

url_string (Text)

priority (Int)

last_crawl_time (Timestamp)

etag/checksum (String)

Technical Selection

Metadata: Cassandra (Wide-column). Highly scalable for trillion-row datasets with predictable query patterns on url_hash.

Content: S3 / GCS. Cost-effective for petabyte-scale unstructured data.

Distribution Logic

Shard Cassandra by host_hash to ensure all URLs for a single domain are co-located, helping with politeness logic.

Cache

Purpose & Justification: URL Seen Cache prevents re-processing billions of already visited URLs.

Key-Value Schema:

Key: url_hash

Structure: Redis Bloom Filter or a bitset.

Technical Selection: Redis with the Bloom module. It provides memory-efficient membership testing.

Failure Handling: If Redis fails, the system falls back to the Metadata DB (slower but accurate).

Messaging

Purpose & Decoupling: Host Queues act as buffers. We ensure only one Fetcher thread hits one host at a time.

Throughput & Partitioning:

Use Kafka with partitions mapped to host hash ranges.

Technical Selection: Kafka. Essential for its high throughput and "at-least-once" delivery guarantees.

Data Processing

Processing Model: Stream processing for link extraction.

Processing DAG: Fetcher -> Raw HTML -> Parser -> Link Extraction -> Normalization (relative to absolute URLs) -> Deduplication -> Frontier.

Technical Selection: Flink. Handles the URL normalization and deduplication stream in real-time with stateful operators.

Wrap Up

Advanced Topics

Trade-offs (Consistency vs. Availability): We prioritize Availability (AP in CAP). If a URL is crawled twice due to a temporary network partition in the "seen" cache, it is acceptable.

Reliability: To prevent "Spider Traps" (infinite URL loops like calendars), we limit crawl depth per domain and ignore URLs with more than 5 query parameters.

Bottleneck Analysis: DNS resolution is a hidden killer. We solve this by using a local unbound or bind cache on every Fetcher node to reduce external RTT.

Security & Privacy: Ensure the Fetchers do not have access to internal network ranges (SSRF protection) via strict VPC egress rules.

Distinguishing Insights: Politeness is hierarchical. We use a two-level queue: Front-queues for priority (News > Blogs) and Back-queues for politeness (mapped to specific hosts with a next_available_time timestamp).