The Question
Design

Scalable Web Crawler Design

Design a distributed web crawler capable of indexing 1 billion unique pages per month. The system must efficiently handle URL discovery, ensure domain-level politeness (respecting robots.txt), and implement robust deduplication strategies for both URLs and page content. Address how the system handles the massive scale of storage, network-bound fetching bottlenecks, and the prevention of infinite loops or spider traps.
Kafka
Redis
Cassandra
S3
Bloom Filter
gRPC
SimHash
NoSQL
Object Storage
Questions & Insights

Clarifying Questions

What is the primary goal of the crawler? (Assumed: A general-purpose search engine crawler to discover and index HTML content.)
What is the expected scale of the system? (Assumed: 1 Billion unique URLs per month.)
What are the "politeness" requirements? (Assumed: Must respect robots.txt and ensure we do not overwhelm any single domain with requests.)
How do we handle content updates? (Assumed: Content is recrawled based on a priority score; for MVP, we will focus on the initial discovery and basic recrawl logic.)
What types of content should we support? (Assumed: HTML only for the MVP; extensible for PDFs/images later.)

Thinking Process

The Core Loop: How do we move from a seed URL to a stored document while discovering new links?
The Frontier Bottleneck: How do we manage billions of URLs in a queue while maintaining domain-level politeness?
Duplicate Prevention: How do we avoid crawling the same URL or storing the same content twice?
Distributed Coordination: How do we scale fetchers horizontally without them stepping on each other's toes?

Bonus Points

SimHash/MinHash: Using fingerprinting to detect "near-duplicate" content (e.g., same article on different URLs with minor template changes).
DNS Caching: Implementing a custom DNS resolver or local cache to avoid the bottleneck of synchronous DNS lookups during the fetch phase.
Priority-Based Crawling: Implementing a "Freshness" score using PageRank or change-frequency heuristics to prioritize the URL Frontier.
Checkpointing: State-saving in the URL Frontier to allow for seamless recovery from system-wide restarts without losing the crawl progress.
Design Breakdown

Functional Requirements

Core Use Cases:
Seeds: Accept a list of starting URLs.
Fetching: Download HTML content via HTTP/HTTPS.
Extraction: Parse HTML to extract text content and new outbound links.
Deduplication: Ensure URLs are only crawled once per cycle.
Storage: Save raw HTML and extracted metadata.
Scope Control:
In-Scope: URL Frontier management, politeness, basic HTML parsing, and storage.
Out-of-Scope: JavaScript rendering (SPA crawling), PageRank calculation, and Image/Video processing.

Non-Functional Requirements

Scale: Support 1 Billion URLs/month (~400 URLs/second average).
Latency: Fetching is inherently high-latency (network-bound); system must be highly concurrent/asynchronous.
Availability & Reliability: Distributed architecture to prevent a single node failure from stopping the crawl.
Consistency: Eventual consistency for URL discovery is acceptable.
Fault Tolerance: Handle "spider traps" (infinite URL loops) and malformed HTML gracefully.
Security & Privacy: Respecting robots.txt and no-follow tags.

Estimation

Traffic Estimation:
1 Billion pages / 30 days ≈ 385 pages/sec.
Peak QPS: ~1,000 pages/sec.
Storage Estimation:
Average page size (compressed): 100 KB.
1 Billion * 100 KB = 100 TB per month.
Metadata (URL, Timestamp, Hash): 500 bytes * 1 Billion = 500 GB per month.
Bandwidth Estimation:
1,000 pages/sec * 100 KB = 100 MB/s (800 Mbps) sustained outbound bandwidth.

Blueprint

Concise Summary: A distributed master-worker architecture where a URL Frontier (Kafka-based) distributes tasks to Fetcher Services. Workers extract links and content, storing data in Object Storage and feeding new links back into the loop via a Deduplication Filter.
Major Components:
URL Frontier (Messaging): A persistent, partitioned message queue that manages the crawl order and politeness.
Fetcher Service (Service): Scalable workers that perform the heavy lifting of network I/O and HTML downloading.
Content Extractor (Data Processing): Parses HTML, extracts links, and generates content hashes for deduplication.
Storage Layer: A combination of Blob storage for raw HTML and NoSQL for URL metadata.
Cache Layer: High-speed lookup for robots.txt rules and URL "seen" status.
Simplicity Audit: This design avoids complex distributed graph databases by using a stream-based approach (Kafka) and a simple Bloom filter for URL deduplication, which is sufficient for MVP scale.
Architecture Decision Rationale:
Why this architecture?: It decouples the fetching (network-bound) from processing (CPU-bound) and storage (I/O-bound).
Functional Requirement Satisfaction: Handles discovery, fetching, and storage through a feedback loop.
Non-functional Requirement Satisfaction: Horizontal scalability is achieved by adding more fetcher pods; Kafka provides the necessary durability and load leveling.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Fetcher Service: Stateless k8s deployment. Scaled based on CPU and Kafka consumer lag.
Link Filter: Part of the extractor logic; uses a distributed Bloom filter to verify if a URL is new.
API Schema Design:
Internal Fetcher API:
fetch(url, priority): gRPC-based internal call (though primarily driven by Kafka).
Idempotency: Based on URL hashing.
Resilience & Reliability:
Retry Policy: 3 retries with exponential backoff for 5xx errors; 0 retries for 4xx errors.
Timeout: Strict 10-second timeout per fetch to prevent hanging on slow servers.
Security:
Outbound traffic restricted to ports 80/443.
User-Agent identification (e.g., MyMVPCrawler/1.0).

Storage

Access Pattern:
High write throughput for raw content.
Metadata lookups by URL hash.
Database Table Design:
URL Metadata (Cassandra):
url_hash (Primary Key - Partition Key)
original_url (Text)
last_crawled_at (Timestamp)
etag (String for conditional GET)
status (Enum: Pending, Success, Failed)
Technical Selection:
Object Storage (S3/GCS): Best for unstructured raw HTML content due to cost-effectiveness at PB scale.
Cassandra: Wide-column store provides the necessary write-heavy performance and linear scalability for URL metadata.
Distribution Logic: Partition by url_hash to ensure uniform data distribution across nodes.

Cache

Purpose & Justification:
Politeness Cache: Store robots.txt rules and "last-crawl" timestamps per domain to avoid DDOSing sites.
URL Seen Cache: A bitset or Bloom filter to quickly reject already crawled URLs before hitting the DB.
Key-Value Schema:
domain:last_access: TTL set to the politeness delay (e.g., 1 second).
domain:robots_txt: Cached for 24 hours.
Technical Selection: Redis. High throughput and support for atomic operations like SET NX for locking/delaying domain access.

Messaging

Purpose & Decoupling: Kafka acts as the URL Frontier. It decouples the link discovery from the fetching speed.
Event / Topic Schema:
Topic: crawling_tasks.
Key: domain_name (Crucial: ensure all URLs from one domain go to the same partition to manage politeness sequentially).
Payload: {url: string, depth: int, priority: int}.
Failure Handling: Use a Dead Letter Queue (DLQ) for URLs that consistently fail or time out.

Data Processing

Processing Model: Stream processing.
Processing DAG:
Input: Raw HTML stream from Fetcher.
Extractor:
1. Normalize HTML.
2. Extract <a href="..."> tags.
3. Calculate Content Hash (Check for duplicates).
4. Extract Text/Metadata.
Output: Push unique links to Link Filter; push content to S3.
Technical Selection: Custom Python/Go Workers. For MVP, specialized frameworks like Flink are overkill; simple stateless workers consuming from Kafka are more cost-effective.
Wrap Up

Advanced Topics

Trade-offs:
Consistency vs. Availability: We prioritize Availability. If a URL is crawled twice due to a race condition in the Bloom filter, it is acceptable.
Reliability & Failure Handling:
Spider Traps: We implement a max_depth and max_url_length constraint in the Extractor to prevent infinite loops (e.g., calendars, dynamically generated paths).
Bottleneck Analysis:
DNS Resolution: Standard getaddrinfo is synchronous. Optimization: Use an asynchronous DNS library (e.g., c-ares) to handle thousands of concurrent lookups.
Security & Privacy:
Respect rel="nofollow" and noindex meta tags.
Implement a central "Blacklist" for sensitive or malicious domains.
Distinguishing Insights:
IP Rotation: To avoid being blocked, the Fetcher Service should run across multiple subnets or use a proxy rotation service.
Adaptive Crawling: Adjusting the crawl frequency of a domain based on how often its content changes (detected via ETag or Content Hash history).