The Question
Design

Scalable Web Crawler Design

Design a distributed system capable of crawling and indexing billions of web pages. The system must efficiently manage URL discovery, ensure politeness to host servers, handle content deduplication, and store massive amounts of unstructured data while maintaining high horizontal scalability.
Kafka
Redis Bloom Filter
S3
Cassandra
Distributed Workers
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the target scale for the MVP?
Assumption: 1 billion pages per month, requiring a crawl rate of ~400 pages per second.
Content Type & Rendering: Do we need to execute JavaScript (SPA support) or just parse static HTML?
Assumption: MVP focuses on static HTML to minimize compute cost and complexity.
Freshness: How often should a page be re-crawled?
Assumption: Fixed schedule (e.g., weekly) for simplicity, with a prioritization queue for high-authority domains.
Storage Constraints: Are we storing the full HTML or just extracted metadata/text?
Assumption: Store compressed raw HTML for future processing and a metadata index for search.
Politeness: What is the policy for robots.txt and crawl frequency per host?
Assumption: Strict adherence to robots.txt and a minimum 1-second delay between requests to the same IP/domain.

Thinking Process

The URL Frontier: How do we manage billions of URLs, ensuring we don't visit the same page twice while maintaining domain-level politeness? (Solution: Use a distributed message queue partitioned by Hostname and a URL Filter/Deduplicator).
The Fetcher Efficiency: How do we handle high latency and timeouts from millions of external servers? (Solution: Asynchronous I/O with high concurrency and a DNS cache).
Data Integrity & Storage: Where do we put petabytes of raw HTML and the massive URL adjacency graph? (Solution: Object Storage for HTML and a wide-column NoSQL database for the URL index).
End-to-End Orchestration: How do we transition from a "Seed URL" to a "Parsed Link" back into the Frontier? (Solution: A decoupled pipeline using Messaging to bridge the Fetcher and the Parser).

Bonus Points

Crawl Smartness: Implement a "Check-sum" or "SimHash" comparison to detect "Near-Duplicates" (content that is 95% identical but has different URLs).
DNS Optimization: Implement a custom, high-performance local DNS resolver to avoid bottlenecking on standard system calls when resolving millions of domains.
Checkpointing: Use distributed snapshots of the URL Frontier state to allow the system to resume quickly after a cluster-wide failure without re-crawling the entire web.
Cost-Effective Storage: Use S3 Intelligent-Tiering or cold storage for old crawl data while keeping the URL metadata in a high-throughput store like Bigtable.
Design Breakdown

Functional Requirements

Fetch: Download HTML from a given URL.
Parse: Extract text and new outgoing links.
Deduplicate: Ensure a URL is only crawled once per cycle.
Store: Save raw content and metadata.
Politeness: Respect robots.txt and avoid overloading host servers.

Non-Functional Requirements

Scalability: Must be horizontally scalable to thousands of nodes.
Robustness: Handle malformed HTML, infinite loops (spider traps), and server timeouts gracefully.
Extensibility: Easily add new parsers (e.g., for PDF or Images) later.
Performance: High throughput with low overhead per URL.

Estimation

Monthly Volume: 1B pages.
Average Page Size: 100 KB (HTML + Metadata).
Storage: 1B * 100 KB = 100 TB / month.
QPS: 1,000,000,000 / (30 days * 86,400s) ≈ 385 URLs/second.
Bandwidth: 385 * 100 KB ≈ 38.5 MB/s (easily handled by a small cluster).
URL Index: 1B URLs * 128 bytes (URL + Hash) ≈ 128 GB in RAM/Cache for deduplication.

Blueprint

Concise Summary: A distributed worker-based architecture where a central URL Frontier (Kafka) distributes tasks to Fetchers, which store results in Object Storage (S3) and send links to a Parser for further URL discovery.
Major Components:
URL Frontier (Kafka): A persistent, partitioned message queue that manages the lifecycle of URLs to be crawled.
Fetcher Service: Highly concurrent workers that download content and enforce politeness rules.
Content Parser (Processing): Extracts links and metadata from downloaded HTML.
URL Deduplicator: A service using a Bloom filter and Database to prevent redundant crawling.
Metadata Store (NoSQL): Stores the state of every known URL (Crawled, Pending, Failed).
Simplicity Audit: This is the simplest architecture because it uses standard messaging patterns (Pub/Sub) to decouple components, avoiding complex distributed locking or state synchronization.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: It scales horizontally by adding more Fetcher/Parser nodes and uses Kafka to handle backpressure.
Functional Requirement Satisfaction: Meets all fetch, parse, and deduplication needs through specialized microservices.
Non-functional Requirement Satisfaction: High availability is provided by the distributed nature of Kafka and NoSQL; scalability is linear with the number of worker nodes.

High Level Architecture

Sub-system Deep Dive

Service

Fetcher Service:
Topology: Deployed as a fleet of stateless containers (K8s). Uses async/await to handle thousands of concurrent HTTP connections per node.
Politeness: Before fetching, the worker checks Redis for the "Last Crawl Time" of the specific host. If it's too recent, the URL is re-queued with a delay.
API Spec:
Internal only. Fetchers consume from Kafka crawling-tasks topic.
Parsers consume from raw-html internal stream or read directly from S3 triggers.

Storage

URL Metadata DB (Cassandra):
Schema: url_hash (PK), original_url, status (PENDING/FETCHED), last_crawl_time, content_hash.
Partitioning: Partitioned by url_hash to ensure even distribution across the cluster.
Content Store (S3):
HTML files are stored as objects. Key format: s3://bucket/YYYY-MM-DD/{url_hash}.html.

Cache

Politeness Cache (Redis): Stores keys as host:{hostname} with a value of the timestamp of the last request. TTL is set to the politeness interval.
URL Seen Filter (Redis): Uses a Bloom Filter to provide a fast "probably not seen" check. If the filter says "not seen," it's definitely new. If it says "seen," we double-check Cassandra.

Messaging

URL Frontier (Kafka):
Topics: crawl_queue (partitioned by Hostname to ensure all URLs for one domain go to the same consumer group for politeness management).
Guarantee: At-least-once delivery. Deduplication at the ingestion point handles potential duplicates.

Data Processing

Content Parser:
Mechanism: A stream processing job that receives HTML blobs.
Logic:
Parse DOM to extract href links.
Normalize URLs (convert relative to absolute, remove fragments).
Extract Page Title and Snippet for the Metadata DB.
Output: Pushes extracted URLs to the Deduplicator.
Wrap Up

Advanced Topics

Monitoring:
Prometheus/Grafana: Monitor "4xx/5xx" error rates per domain.
Kafka Lag: Critical metric to ensure Fetchers are keeping up with the discovery rate.
Trade-offs:
Consistency vs. Freshness: We sacrifice strict real-time consistency (knowing exactly when a page changed) for massive throughput and eventual discovery.
Bottlenecks:
DNS Resolution: Standard getaddrinfo is blocking. Optimization: Use a non-blocking library like c-ares.
Failure Handling:
Dead Letter Queues (DLQ): URLs that fail consistently (e.g., 404s) are moved to a DLQ to prevent blocking the main pipeline.
Alternatives & Optimization:
Alternative: Instead of Cassandra, one could use HBase if already in a Hadoop ecosystem.
Optimization: Use Zstandard compression for HTML storage in S3 to reduce storage costs by 70-80%.