The Question

Scalable Web Crawler Design

Design a distributed web crawler capable of indexing 1 billion unique pages per month. The system must efficiently handle URL discovery, ensure domain-level politeness (respecting robots.txt), and implement robust deduplication strategies for both URLs and page content. Address how the system handles the massive scale of storage, network-bound fetching bottlenecks, and the prevention of infinite loops or spider traps.

Kafka

Redis

Cassandra

Bloom Filter

gRPC

SimHash

NoSQL

Object Storage

Questions & Insights

Clarifying Questions

What is the primary goal of the crawler? (Assumed: A general-purpose search engine crawler to discover and index HTML content.)

What is the expected scale of the system? (Assumed: 1 Billion unique URLs per month.)

What are the "politeness" requirements? (Assumed: Must respect robots.txt and ensure we do not overwhelm any single domain with requests.)

How do we handle content updates? (Assumed: Content is recrawled based on a priority score; for MVP, we will focus on the initial discovery and basic recrawl logic.)

What types of content should we support? (Assumed: HTML only for the MVP; extensible for PDFs/images later.)

Thinking Process

The Core Loop: How do we move from a seed URL to a stored document while discovering new links?

The Frontier Bottleneck: How do we manage billions of URLs in a queue while maintaining domain-level politeness?

Duplicate Prevention: How do we avoid crawling the same URL or storing the same content twice?

Distributed Coordination: How do we scale fetchers horizontally without them stepping on each other's toes?

Bonus Points

SimHash/MinHash: Using fingerprinting to detect "near-duplicate" content (e.g., same article on different URLs with minor template changes).

DNS Caching: Implementing a custom DNS resolver or local cache to avoid the bottleneck of synchronous DNS lookups during the fetch phase.

Priority-Based Crawling: Implementing a "Freshness" score using PageRank or change-frequency heuristics to prioritize the URL Frontier.

Checkpointing: State-saving in the URL Frontier to allow for seamless recovery from system-wide restarts without losing the crawl progress.

Design Breakdown

Functional Requirements

Core Use Cases:

Seeds: Accept a list of starting URLs.

Fetching: Download HTML content via HTTP/HTTPS.

Extraction: Parse HTML to extract text content and new outbound links.

Deduplication: Ensure URLs are only crawled once per cycle.

Storage: Save raw HTML and extracted metadata.

Scope Control:

In-Scope: URL Frontier management, politeness, basic HTML parsing, and storage.

Out-of-Scope: JavaScript rendering (SPA crawling), PageRank calculation, and Image/Video processing.

Non-Functional Requirements

Scale: Support 1 Billion URLs/month (~400 URLs/second average).

Latency: Fetching is inherently high-latency (network-bound); system must be highly concurrent/asynchronous.

Availability & Reliability: Distributed architecture to prevent a single node failure from stopping the crawl.

Consistency: Eventual consistency for URL discovery is acceptable.

Fault Tolerance: Handle "spider traps" (infinite URL loops) and malformed HTML gracefully.

Security & Privacy: Respecting robots.txt and no-follow tags.

Estimation

Traffic Estimation:

1 Billion pages / 30 days ≈ 385 pages/sec.

Peak QPS: ~1,000 pages/sec.

Storage Estimation:

Average page size (compressed): 100 KB.

1 Billion * 100 KB = 100 TB per month.

Metadata (URL, Timestamp, Hash): 500 bytes * 1 Billion = 500 GB per month.

Bandwidth Estimation:

1,000 pages/sec * 100 KB = 100 MB/s (800 Mbps) sustained outbound bandwidth.

Blueprint

Concise Summary: A distributed master-worker architecture where a URL Frontier (Kafka-based) distributes tasks to Fetcher Services. Workers extract links and content, storing data in Object Storage and feeding new links back into the loop via a Deduplication Filter.

Major Components:

URL Frontier (Messaging): A persistent, partitioned message queue that manages the crawl order and politeness.

Fetcher Service (Service): Scalable workers that perform the heavy lifting of network I/O and HTML downloading.

Content Extractor (Data Processing): Parses HTML, extracts links, and generates content hashes for deduplication.

Storage Layer: A combination of Blob storage for raw HTML and NoSQL for URL metadata.

Cache Layer: High-speed lookup for robots.txt rules and URL "seen" status.

Simplicity Audit: This design avoids complex distributed graph databases by using a stream-based approach (Kafka) and a simple Bloom filter for URL deduplication, which is sufficient for MVP scale.

Architecture Decision Rationale:

Why this architecture?: It decouples the fetching (network-bound) from processing (CPU-bound) and storage (I/O-bound).

Functional Requirement Satisfaction: Handles discovery, fetching, and storage through a feedback loop.

Non-functional Requirement Satisfaction: Horizontal scalability is achieved by adding more fetcher pods; Kafka provides the necessary durability and load leveling.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

Fetcher Service: Stateless k8s deployment. Scaled based on CPU and Kafka consumer lag.

Link Filter: Part of the extractor logic; uses a distributed Bloom filter to verify if a URL is new.

API Schema Design:

Internal Fetcher API:

fetch(url, priority): gRPC-based internal call (though primarily driven by Kafka).

Idempotency: Based on URL hashing.

Resilience & Reliability:

Retry Policy: 3 retries with exponential backoff for 5xx errors; 0 retries for 4xx errors.

Timeout: Strict 10-second timeout per fetch to prevent hanging on slow servers.

Security:

Outbound traffic restricted to ports 80/443.

User-Agent identification (e.g., MyMVPCrawler/1.0).

Storage

Access Pattern:

High write throughput for raw content.

Metadata lookups by URL hash.

Database Table Design:

URL Metadata (Cassandra):

url_hash (Primary Key - Partition Key)

original_url (Text)

last_crawled_at (Timestamp)

etag (String for conditional GET)

status (Enum: Pending, Success, Failed)

Technical Selection:

Object Storage (S3/GCS): Best for unstructured raw HTML content due to cost-effectiveness at PB scale.

Cassandra: Wide-column store provides the necessary write-heavy performance and linear scalability for URL metadata.

Distribution Logic: Partition by url_hash to ensure uniform data distribution across nodes.

Cache

Purpose & Justification:

Politeness Cache: Store robots.txt rules and "last-crawl" timestamps per domain to avoid DDOSing sites.

URL Seen Cache: A bitset or Bloom filter to quickly reject already crawled URLs before hitting the DB.

Key-Value Schema:

domain:last_access: TTL set to the politeness delay (e.g., 1 second).

domain:robots_txt: Cached for 24 hours.

Technical Selection: Redis. High throughput and support for atomic operations like SET NX for locking/delaying domain access.

Messaging

Purpose & Decoupling: Kafka acts as the URL Frontier. It decouples the link discovery from the fetching speed.

Event / Topic Schema:

Topic: crawling_tasks.

Key: domain_name (Crucial: ensure all URLs from one domain go to the same partition to manage politeness sequentially).

Payload: {url: string, depth: int, priority: int}.

Failure Handling: Use a Dead Letter Queue (DLQ) for URLs that consistently fail or time out.

Data Processing

Processing Model: Stream processing.

Processing DAG:

Input: Raw HTML stream from Fetcher.

Extractor:

1. Normalize HTML.

2. Extract <a href="..."> tags.

3. Calculate Content Hash (Check for duplicates).

4. Extract Text/Metadata.

Output: Push unique links to Link Filter; push content to S3.

Technical Selection: Custom Python/Go Workers. For MVP, specialized frameworks like Flink are overkill; simple stateless workers consuming from Kafka are more cost-effective.

Wrap Up

Advanced Topics

Trade-offs:

Consistency vs. Availability: We prioritize Availability. If a URL is crawled twice due to a race condition in the Bloom filter, it is acceptable.

Reliability & Failure Handling:

Spider Traps: We implement a max_depth and max_url_length constraint in the Extractor to prevent infinite loops (e.g., calendars, dynamically generated paths).

Bottleneck Analysis:

DNS Resolution: Standard getaddrinfo is synchronous. Optimization: Use an asynchronous DNS library (e.g., c-ares) to handle thousands of concurrent lookups.

Security & Privacy:

Respect rel="nofollow" and noindex meta tags.

Implement a central "Blacklist" for sensitive or malicious domains.

Distinguishing Insights:

IP Rotation: To avoid being blocked, the Fetcher Service should run across multiple subnets or use a proxy rotation service.

Adaptive Crawling: Adjusting the crawl frequency of a domain based on how often its content changes (detected via ETag or Content Hash history).