The Question

Distributed Web Crawler

Design a high-performance, distributed web crawler capable of processing billions of pages monthly. Your solution must address the complexities of URL discovery, domain-level politeness (rate limiting), deduplication of trillions of URLs, and efficient storage of petabytes of HTML content. Explain how you would handle DNS resolution bottlenecks, spider traps, and the architectural trade-offs between crawl freshness and politeness compliance.

Kafka

Redis

NoSQL

Bloom Filter

Cassandra

Zstandard

DNS Resolver

Questions & Insights

Clarifying Questions

Scale and Throughput: What is the target volume of pages to crawl (e.g., 1 billion pages per month)? What is the expected QPS for the fetchers?

Content Type: Are we crawling only static HTML, or do we need to execute JavaScript (SPA support) using headless browsers?

Politeness and Compliance: Do we need to strictly follow robots.txt and implement per-domain rate limiting to avoid DDoS-ing target servers?

Freshness: Is this a one-time crawl, or do we need a re-crawling strategy to keep the data fresh?

Data Usage: Are we storing the full HTML for indexing, or just extracting specific metadata/links?

Assumptions for MVP:

Scale: 1 billion URLs per month (approx. 400 QPS).

Content: Static HTML only (no JS rendering for MVP).

Politeness: Strict robots.txt compliance and domain-level throttling.

Storage: Store raw HTML in object storage and URL metadata in a NoSQL database.

Thinking Process

The URL Frontier: How do we manage the prioritization and distribution of billions of URLs while ensuring we don't hit the same domain too hard?

The "Seen" Problem: How do we efficiently check if a URL has already been crawled without performing a disk-seek for every discovery?

DNS Optimization: How do we prevent DNS resolution from becoming the bottleneck for millions of external requests?

Fault Tolerance: How do we handle worker crashes, network timeouts, and "tarpit" websites that never close connections?

Bonus Points

Bloom Filters & Counting Filters: Use of Probabilistic data structures to minimize memory usage for the "URL Seen" component.

Checkpointing & State Recovery: Implementing a mechanism to resume a massive crawl from the last known state in case of cluster-wide failure.

SimHash for Near-Duplicate Detection: Going beyond URL deduplication to detect pages with nearly identical content (e.g., same page with different tracking IDs).

Hybrid DNS Strategy: Implementing a local DNS cache and a dedicated DNS resolver fleet to bypass OS-level blocking and reduce latency.

Design Breakdown

Functional Requirements

Core Use Cases:

Seed URL ingestion to start the crawl.

Efficiently fetch HTML content from diverse web servers.

Parse HTML to extract links for further crawling.

Store extracted content and metadata for downstream processing.

Scope Control:

In-scope: Distributed fetching, politeness logic, URL deduplication, and basic parsing.

Out-of-scope: JavaScript execution (Puppeteer/Playwright), Search Indexing (Elasticsearch integration), and sophisticated NLP parsing.

Non-Functional Requirements

Scale: Must scale horizontally to handle billions of URLs.

Latency: High throughput is prioritized over low per-request latency.

Availability & Reliability: The system must be resilient to worker failures and network partitions.

Consistency: Eventual consistency is acceptable for URL discovery and metadata storage.

Fault Tolerance: Robust handling of malformed HTML, infinite loops (spider traps), and slow servers.

Security & Privacy: Respecting robots.txt and handling SSL/TLS certificates securely.

Estimation

Traffic Estimation:

1 Billion pages / month

\approx

385 pages per second (avg).

Peak QPS:

\approx

800 - 1,000 pages per second.

Storage Estimation:

Avg page size: 100 KB (compressed).

1 Billion pages * 100 KB = 100 TB per month.

URL Metadata: 1 Billion * 500 bytes

\approx

500 GB.

Bandwidth Estimation:

400 QPS * 100 KB = 40 MB/s (Inbound).

This is well within the limits of modern cloud networking.

Blueprint

This architecture utilizes a distributed URL Frontier to manage the lifecycle of URLs. It separates discovery from fetching to ensure scalability.

Major Components:

URL Frontier: Manages URL prioritization and ensures domain-level politeness.

Worker Fleet: Stateless nodes that fetch content and parse links.

URL Seen Service: A fast, cache-backed service to prevent redundant crawling.

Object Store: Long-term persistent storage for raw HTML content.

Message Queue: Decouples URL discovery from the fetching process.

Simplicity Audit: The design avoids complex distributed locks by using a partitioned message queue for politeness and a probabilistic filter for deduplication.

Architecture Decision Rationale:

Why this architecture?: Separating parsing and fetching allows us to scale the "heavy" network I/O (fetching) independently from the "heavy" CPU task (parsing/link extraction).

Functional Requirement Satisfaction: The Frontier ensures seeds are processed, and the Worker Fleet handles the scale of the web.

Non-functional Requirement Satisfaction: Using SQS/Kafka provides fault tolerance; if a worker dies, the message remains in the queue.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling

Crawler Workers: Stateless containers (Docker/K8s) deployed across multiple Availability Zones. Scaling is based on the URL Queue depth.

Parser Service: Decoupled CPU-intensive service to extract links and metadata.

API Schema Design

Internal Seed API: POST /v1/seeds (REST) - Body: { "urls": [...], "priority": 1 }.

Idempotency: URLs are canonicalized (lowercase, trailing slash removal) to ensure same-URL requests are idempotent.

Resilience & Reliability

Timeouts: Aggressive fetching timeouts (e.g., 10s) to prevent workers from being stuck on slow servers.

Retries: Exponential backoff for 5xx errors; 4xx errors are generally ignored or logged.

Security

Worker Identity: Workers use a specific User-Agent string (e.g., "MyBot/1.0") providing a link to a "Why am I crawling you?" page.

Storage

Access Pattern

URL Metadata: High write/read (Status updates).

Blob Store: Write-heavy (HTML content).

Database Table Design

URL Metadata (NoSQL - e.g., Cassandra/DynamoDB):

url_hash (Partition Key): SHA-256 of URL.

url_string: Original URL.

status: [PENDING, FETCHED, FAILED].

last_crawled_at: Timestamp.

etag: For conditional GETs in future crawls.

Technical Selection

Object Storage (S3/GCS): Best for storing 100TB+ of HTML content due to cost and durability.

NoSQL (DynamoDB): Scales horizontally to billions of rows; TTL support for old metadata.

Distribution Logic

Sharding based on url_hash to avoid hot partitions on specific domains.

Cache

Purpose & Justification: "URL Seen" check is the most frequent operation. Querying the main DB for every discovered link is too slow.

Key-Value Schema:

Structure: Redis Bloom Filter.

Naming: seen:urls.

Consistency: Eventual. It's okay to occasionally re-crawl a page (false negative), but we want to minimize it.

Failure Handling: If Redis fails, workers can fall back to the URL Metadata DB (slower) or skip the check temporarily.

Messaging

Purpose & Decoupling: The URL Queue acts as a buffer between discovery and fetching, allowing for load leveling.

Throughput & Partitioning:

Partitioning is done by Domain Hash. This is critical for politeness. One partition = One consumer group = Easy to implement "one request per second" logic for a single domain.

Technical Selection: Kafka. High throughput and the ability to replay messages are vital for large-scale crawls.

Data Processing

Processing Model: Streaming (using the Parser Service).

Processing DAG:

Fetch HTML -> 2. Store to S3 -> 3. Extract Links -> 4. Check "URL Seen" -> 5. Enqueue New URLs to Frontier.

Correctness Guarantees: At-least-once delivery via Kafka.

Technical Selection: A fleet of Go-based parsers for high-concurrency link extraction.

Infrastructure (Optional)

Observability:

Metrics: Track pages_per_second, 4xx_error_rate, 5xx_error_rate, and queue_lag.

Alerting: Alert if the 5xx_error_rate spikes, indicating we might be blocked or hitting a "crawler trap."

Wrap Up

Advanced Topics

Politeness vs. Freshness: For MVP, we prioritize politeness. Using a 1-second delay per domain means we crawl 86k pages/domain/day maximum.

Storage Optimization: HTML content is highly compressible. Using Zstandard (zstd) can reduce storage costs by 70-80%.

DNS Bottleneck: Standard getaddrinfo is blocking. Workers should use a custom asynchronous DNS resolver with a local cache (TTL-respecting) to avoid overloading public DNS providers.

Spider Traps: Implement safeguards like maximum URL length, maximum folder depth, and detection of dynamically generated calendars to avoid infinite crawling loops.

Staff-level Distinction: To handle 10x scale, one would implement a Mercator-style Frontier which uses a set of "Per-Domain Queues" and "Priority Queues" to manage politeness and importance mathematically.