DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Web Crawler Design

Design a distributed web crawler capable of indexing 1 billion unique pages per month. The system must efficiently handle URL discovery, ensure domain-level politeness (respecting robots.txt), and implement robust deduplication strategies for both URLs and page content. Address how the system handles the massive scale of storage, network-bound fetching bottlenecks, and the prevention of infinite loops or spider traps.
Kafka
Redis
Cassandra
S3
Bloom Filter
gRPC
SimHash
NoSQL
Object Storage
Questions & Insights

Clarifying Questions

What is the primary goal of the crawler? (Assumed: A general-purpose search engine crawler to discover and index HTML content.)
What is the expected scale of the system? (Assumed: 1 Billion unique URLs per month.)
What are the "politeness" requirements? (Assumed: Must respect robots.txt and ensure we do not overwhelm any single domain with requests.)
How do we handle content updates? (Assumed: Content is recrawled based on a priority score; for MVP, we will focus on the initial discovery and basic recrawl logic.)
What types of content should we support? (Assumed: HTML only for the MVP; extensible for PDFs/images later.)

Thinking Process

The Core Loop: How do we move from a seed URL to a stored document while discovering new links?
The Frontier Bottleneck: How do we manage billions of URLs in a queue while maintaining domain-level politeness?
Duplicate Prevention: How do we avoid crawling the same URL or storing the same content twice?
Distributed Coordination: How do we scale fetchers horizontally without them stepping on each other's toes?

Bonus Points

SimHash/MinHash: Using fingerprinting to detect "near-duplicate" content (e.g., same article on different URLs with minor template changes).
DNS Caching: Implementing a custom DNS resolver or local cache to avoid the bottleneck of synchronous DNS lookups during the fetch phase.
Priority-Based Crawling: Implementing a "Freshness" score using PageRank or change-frequency heuristics to prioritize the URL Frontier.
Checkpointing: State-saving in the URL Frontier to allow for seamless recovery from system-wide restarts without losing the crawl progress.
Design Breakdown

Functional Requirements

Core Use Cases:
Seeds: Accept a list of starting URLs.
Fetching: Download HTML content via HTTP/HTTPS.
Extraction: Parse HTML to extract text content and new outbound links.
Deduplication: Ensure URLs are only crawled once per cycle.
Storage: Save raw HTML and extracted metadata.
Scope Control:
In-Scope: URL Frontier management, politeness, basic HTML parsing, and storage.
Out-of-Scope: JavaScript rendering (SPA crawling), PageRank calculation, and Image/Video processing.

Non-Functional Requirements

Scale: Support 1 Billion URLs/month (~400 URLs/second average).
Latency: Fetching is inherently high-latency (network-bound); system must be highly concurrent/asynchronous.
Availability & Reliability: Distributed architecture to prevent a single node failure from stopping the crawl.
Consistency: Eventual consistency for URL discovery is acceptable.
Fault Tolerance: Handle "spider traps" (infinite URL loops) and malformed HTML gracefully.
Security & Privacy: Respecting robots.txt and no-follow tags.

Estimation

Traffic Estimation:
1 Billion pages / 30 days ≈ 385 pages/sec.
Peak QPS: ~1,000 pages/sec.
Storage Estimation:
Average page size (compressed): 100 KB.
1 Billion * 100 KB = 100 TB per month.
Metadata (URL, Timestamp, Hash): 500 bytes * 1 Billion = 500 GB per month.
Bandwidth Estimation:
1,000 pages/sec * 100 KB = 100 MB/s (800 Mbps) sustained outbound bandwidth.

Blueprint

Concise Summary: A distributed master-worker architecture where a URL Frontier (Kafka-based) distributes tasks to Fetcher Services. Workers extract links and content, storing data in Object Storage and feeding new links back into the loop via a Deduplication Filter.
Major Components:
URL Frontier (Messaging): A persistent, partitioned message queue that manages the crawl order and politeness.
Fetcher Service (Service): Scalable workers that perform the heavy lifting of network I/O and HTML downloading.
Content Extractor (Data Processing): Parses HTML, extracts links, and generates content hashes for deduplication.
Storage Layer: A combination of Blob storage for raw HTML and NoSQL for URL metadata.
Cache Layer: High-speed lookup for robots.txt rules and URL "seen" status.
Simplicity Audit: This design avoids complex distributed graph databases by using a stream-based approach (Kafka) and a simple Bloom filter for URL deduplication, which is sufficient for MVP scale.
Architecture Decision Rationale:
Why this architecture?: It decouples the fetching (network-bound) from processing (CPU-bound) and storage (I/O-bound).
Functional Requirement Satisfaction: Handles discovery, fetching, and storage through a feedback loop.
Non-functional Requirement Satisfaction: Horizontal scalability is achieved by adding more fetcher pods; Kafka provides the necessary durability and load leveling.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Fetcher Service: Stateless k8s deployment. Scaled based on CPU and Kafka consumer lag.
Link Filter: Part of the extractor logic; uses a distributed Bloom filter to verify if a URL is new.
API Schema Design:
Internal Fetcher API:
fetch(url, priority): gRPC-based internal call (though primarily driven by Kafka).
Idempotency: Based on URL hashing.
Resilience & Reliability:
Retry Policy: 3 retries with exponential backoff for 5xx errors; 0 retries for 4xx errors.
Timeout: Strict 10-second timeout per fetch to prevent hanging on slow servers.
Security:
Outbound traffic restricted to ports 80/443.
User-Agent identification (e.g., MyMVPCrawler/1.0).

Storage

Access Pattern:
High write throughput for raw content.
Metadata lookups by URL hash.
Database Table Design:
URL Metadata (Cassandra):
url_hash (Primary Key - Partition Key)
original_url (Text)
last_crawled_at (Timestamp)
etag (String for conditional GET)
status (Enum: Pending, Success, Failed)
Technical Selection:
Object Storage (S3/GCS): Best for unstructured raw HTML content due to cost-effectiveness at PB scale.
Cassandra: Wide-column store provides the necessary write-heavy performance and linear scalability for URL metadata.
Distribution Logic: Partition by url_hash to ensure uniform data distribution across nodes.

Cache

Purpose & Justification:
Politeness Cache: Store robots.txt rules and "last-crawl" timestamps per domain to avoid DDOSing sites.
URL Seen Cache: A bitset or Bloom filter to quickly reject already crawled URLs before hitting the DB.
Key-Value Schema:
domain:last_access: TTL set to the politeness delay (e.g., 1 second).
domain:robots_txt: Cached for 24 hours.
Technical Selection: Redis. High throughput and support for atomic operations like SET NX for locking/delaying domain access.

Messaging

Purpose & Decoupling: Kafka acts as the URL Frontier. It decouples the link discovery from the fetching speed.
Event / Topic Schema:
Topic: crawling_tasks.
Key: domain_name (Crucial: ensure all URLs from one domain go to the same partition to manage politeness sequentially).
Payload: {url: string, depth: int, priority: int}.
Failure Handling: Use a Dead Letter Queue (DLQ) for URLs that consistently fail or time out.

Data Processing

Processing Model: Stream processing.
Processing DAG:
Input: Raw HTML stream from Fetcher.
Extractor:
1. Normalize HTML.
2. Extract <a href="..."> tags.
3. Calculate Content Hash (Check for duplicates).
4. Extract Text/Metadata.
Output: Push unique links to Link Filter; push content to S3.
Technical Selection: Custom Python/Go Workers. For MVP, specialized frameworks like Flink are overkill; simple stateless workers consuming from Kafka are more cost-effective.
Wrap Up

Advanced Topics

Trade-offs:
Consistency vs. Availability: We prioritize Availability. If a URL is crawled twice due to a race condition in the Bloom filter, it is acceptable.
Reliability & Failure Handling:
Spider Traps: We implement a max_depth and max_url_length constraint in the Extractor to prevent infinite loops (e.g., calendars, dynamically generated paths).
Bottleneck Analysis:
DNS Resolution: Standard getaddrinfo is synchronous. Optimization: Use an asynchronous DNS library (e.g., c-ares) to handle thousands of concurrent lookups.
Security & Privacy:
Respect rel="nofollow" and noindex meta tags.
Implement a central "Blacklist" for sensitive or malicious domains.
Distinguishing Insights:
IP Rotation: To avoid being blocked, the Fetcher Service should run across multiple subnets or use a proxy rotation service.
Adaptive Crawling: Adjusting the crawl frequency of a domain based on how often its content changes (detected via ETag or Content Hash history).