The Question
Design

Scalable Web Crawler Design

Design a distributed system capable of crawling billions of web pages. The system must efficiently discover new URLs, handle deduplication, store massive amounts of raw content, and strictly adhere to web politeness protocols and rate limits.
Kafka
Redis Bloom Filter
S3
Distributed Workers
DNS Cache
Questions & Insights

Clarifying Questions

What is the target scale and throughput? (e.g., 1 billion pages per month, ~400 pages/sec).
What is the primary goal of the crawl? (e.g., full-text indexing for search, metadata extraction, or link discovery).
What are the constraints regarding "Politeness"? (e.g., respecting robots.txt, per-domain rate limiting, and delay between requests).
How do we handle content freshness? (e.g., is this a one-time crawl, or do we need to re-crawl updated pages periodically?).
What types of content are we crawling? (e.g., HTML only, or do we need to execute JavaScript via headless browsers?).
Assumptions for MVP:
Scale: 1 billion URLs/month.
Content: Static HTML (no JS execution for MVP).
Freshness: Periodic (weekly/monthly) recrawling.
Politeness: Strict adherence to robots.txt and domain-level rate limiting.

Thinking Process

The core challenge of a web crawler is managing a massive, ever-growing list of URLs while ensuring we don't overwhelm target servers or get blocked.
URL Frontier Management: How do we store and prioritize billions of URLs while ensuring politeness?
The Fetch-Parse Cycle: How do we efficiently download and extract new links without duplicating work?
Deduplication: How do we ensure we don't crawl the same URL twice or store identical content from different URLs?
Resilience: How does the system handle DNS timeouts, 404s, and "spider traps"?

Bonus Points

Bloom Filters for URL Seen Set: Using a scalable Bloom Filter or Cuckoo Filter to minimize memory usage when checking if a URL has been visited.
DNS Caching: Implementing a local DNS cache to avoid the bottleneck of synchronous DNS resolution for every fetch.
Check-pointing: Storing the state of the URL Frontier frequently so the crawl can resume from the last known state after a crash.
Domain-based Partitioning: Ensuring all URLs for a specific domain are handled by a specific queue to simplify rate-limiting logic.
Design Breakdown

Functional Requirements

URL Discovery: Extract links from crawled pages and add them to the queue.
Content Storage: Store the raw HTML or extracted text for downstream processing.
Deduplication: Avoid crawling the same URL multiple times.
Politeness: Respect robots.txt and implement per-host rate limiting.

Non-Functional Requirements

Scalability: Must handle billions of URLs using a distributed worker architecture.
Robustness: Handle malformed HTML, infinite loops (spider traps), and server timeouts.
Extensibility: Easily add new parsers for different file types (PDF, Images) in the future.

Estimation

Monthly Volume: 1 Billion pages.
QPS: 1,000,000,000 / (30 \text{ days} \times 86,400 \text{ sec}) \approx 385 \text{ pages/sec}.
Storage: 1 Billion pages \times 100 KB/page = 100 TB per month.
Bandwidth: 400 pages/sec \times 100 KB = 40 MB/s (easily handled by a few high-bandwidth nodes).

Blueprint

Concise Summary: A distributed worker-based system that uses a centralized messaging queue (Frontier) to manage URLs and a NoSQL store for content.
Major Components:
URL Management Service: Handles URL normalization, filtering (e.g., file extensions), and deduplication.
URL Frontier (Message Queue): Orchestrates the order of crawling and manages politeness delays.
Fetcher Workers: Distributed nodes that perform DNS resolution, fetching, and parsing.
Object Storage: High-capacity storage for the raw page content.
Simplicity Audit: This design avoids complex PageRank-based scheduling in favor of a simpler FIFO/Priority queue structure. It leverages managed services for storage and messaging to reduce operational overhead.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: Decoupling the "Discovery" (URL Frontier) from "Execution" (Fetchers) allows independent scaling.
Functional Requirement Satisfaction: The Frontier ensures politeness, while the Seen-set (Redis) ensures no duplicate crawls.
Non-functional Requirement Satisfaction: Using a distributed message queue allows the system to scale horizontally as the URL backlog grows.

High Level Architecture

Sub-system Deep Dive

Service

Fetcher Workers:
Topology: Stateless containers deployed in multiple availability zones.
Logic: Each worker pulls a URL from the Frontier, checks the Robots.txt Cache, fetches the content via HTTP/HTTPS, parses the HTML for new links, and pushes those links back to the URL Management Service.
API: Fetchers communicate with the Frontier via standard Pub/Sub protocols (e.g., Kafka Consumer API).

Storage

Data Model:
Metadata DB (Optional for MVP): Store URL, status (200, 404), crawl timestamp, and content hash.
Object Storage (S3): Key is the hash of the URL; Value is the compressed HTML.
Database Logic: S3 provides virtually infinite scaling for raw content. Metadata is stored in a NoSQL DB (e.g., MongoDB) to support fast lookups of crawl history.

Cache

Seen-Set (Redis): Uses a Bloom Filter to store hashes of visited URLs. This allows for an O(1) check with minimal memory.
Robots.txt Cache: Stores the parsed rules of robots.txt for each domain with a TTL of 24 hours to avoid redundant fetches of the same rules.

Messaging

URL Frontier (Kafka):
Structure: Topics are partitioned by domain hash. This ensures that all URLs for "example.com" go to the same partition, allowing a single consumer to manage rate-limiting (politeness) for that domain effectively.
Guarantees: At-least-once delivery.

Data Processing

Parser (Integrated in Fetcher):
Logic: Uses a streaming HTML parser (like BeautifulSoup or Go's net/html) to extract <a href="..."> tags.
Normalization: Converts relative URLs to absolute URLs and removes fragments (e.g., index.html#top -> index.html).
Wrap Up

Advanced Topics

Monitoring:
Metrics: Queue depth (Frontier lag), HTTP 4xx/5xx rates, Fetcher CPU/Memory, and Disk I/O.
Tools: Prometheus/Grafana for real-time metrics.
Trade-offs:
Consistency vs. Availability: We choose Availability. If the Seen-set (Redis) is slightly out of sync, the worst case is we crawl a page twice (acceptable).
Bottlenecks: DNS resolution is often the bottleneck. We mitigate this with a local DNS cache on fetcher nodes.
Failure Handling:
Exponential Backoff: For 5xx errors from target servers.
Dead Letter Queue (DLQ): For URLs that consistently fail after N retries.
Alternatives & Optimization:
Headless Browsers: For an advanced version, use Playwright/Puppeteer for JS-heavy sites, but it increases resource consumption by 10x.