DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Web Crawler Design

Design a web crawler capable of processing 1 billion URLs per month. The system should automatically discover new links, store the page content for future indexing, and strictly adhere to politeness constraints (robots.txt and per-domain rate limits). Detail the architecture for the URL Frontier, deduplication strategy, and how you would handle distributed workers while maintaining scalability and fault tolerance.
Kafka
S3
Redis
NoSQL
Cassandra
Bloom Filter
Kubernetes
DNS Caching
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the target scale? (Assumption: 1 billion URLs per month, roughly 400 pages per second).
Content Type: Are we crawling only static HTML or do we need to execute JavaScript (SPA)? (Assumption: MVP focuses on static HTML; JS rendering is out of scope for YAGNI).
Freshness: How often do we need to recrawl pages? (Assumption: Recrawl period is 30 days; prioritization based on change frequency).
Politeness: Do we need to strictly follow robots.txt and per-domain rate limits? (Assumption: Yes, strict adherence to politeness is a functional requirement).
Storage: Do we store the full content or just metadata and extracted links? (Assumption: Store full compressed HTML in object storage and metadata in a database).

Thinking Process

Core Bottleneck: The primary challenge is the URL Frontier. We must manage billions of URLs while ensuring we don't visit the same URL twice and don't overwhelm a single host (Politeness).
Strategy Steps:
Seed & Frontier: Start with seed URLs, put them in a distributed queue (Frontier) that handles prioritization and deduplication.
Polite Fetching: Fetchers pull URLs from the Frontier using a mapping that ensures only one worker fetches from one domain at a time (or adheres to a specific delay).
Processing Pipeline: Extract content, store it, and extract new links.
Feedback Loop: Push newly discovered links back into the Frontier after a "seen" check.

Bonus Points

Bloom Filters: Use a multi-stage Bloom filter (or Cuckoo filter) in-memory for the "URL Seen" check to minimize disk I/O for the billions of duplicate links encountered.
Checksum/Simhash: Implement "Near-Duplicate" detection using Simhash to avoid storing and processing pages with the same content but different URLs (e.g., tracking parameters).
DNS Caching: Implement a dedicated, highly available DNS resolver/cache to avoid the latency and load of millions of DNS lookups on public resolvers.
Checkpointing: State-managed fetchers that can resume from a specific offset in the Frontier to handle mass worker failures.
Design Breakdown

Functional Requirements

Core Use Cases:
Discover new URLs starting from seeds.
Fetch and store HTML content.
Extract links and metadata from pages.
Respect robots.txt and domain-specific crawl delays.
Scope Control:
In-scope: Distributed crawling, link extraction, basic deduplication, politeness.
Out-of-scope: JavaScript execution (Puppeteer/Playwright), image/video processing, advanced search indexing (focus is on the crawler, not the engine).

Non-Functional Requirements

Scale: Support 1 billion URLs/month (~400 QPS).
Latency: High-throughput oriented; individual page latency is less important than aggregate throughput.
Availability & Reliability: Distributed architecture to prevent single points of failure; retry logic for failed fetches.
Consistency: Eventual consistency for the "URL Seen" database is acceptable.
Security & Privacy: User-agent identification and compliance with standard web scraping ethics.

Estimation

Traffic: 1B URLs / 30 days \approx 385 QPS. Peak QPS \approx 800.
Storage: 100KB per compressed page \times 1B pages = 100TB per month.
Bandwidth: 400 QPS \times 100KB \approx 40MB/s (320 Mbps) incoming.
Metadata: 1KB per URL metadata \times 1B \approx 1TB per month.

Blueprint

Concise Summary: A distributed worker-based architecture using Kafka for URL orchestration and S3 for content storage.
Major Components:
URL Frontier (Kafka): Acts as the task orchestrator, decoupling discovery from fetching.
Fetcher Service: Stateless workers that download HTML, respecting politeness rules via Redis-based locking.
URL Seen Store (Redis + Bloom Filter): High-speed deduplication layer.
Document Store (S3): Persistent storage for crawled content.
Simplicity Audit: This design avoids complex distributed graph processing by treating the crawl as a stream of URLs. We use managed services (S3/Kafka) to reduce operational overhead.
Architecture Decision Rationale:
Why this?: Kafka provides the necessary buffering and ordering (by partition) to handle spikes and ensure we can scale fetchers horizontally.
Functional Satisfaction: Link extractor feeds Kafka, creating a recursive discovery loop.
Non-functional Satisfaction: Scalable by adding more Kafka partitions and Fetcher instances.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling
Fetcher Workers: Stateless containers deployed in Kubernetes. Scaled based on Kafka consumer lag.
Isolation: Fetchers use a local cache for DNS to prevent external network bottlenecks.
API Schema Design (Internal Control API)
POST /crawl-seed: Start a new crawl. Protocol: REST.
GET /health: Monitor worker status.
Resilience & Reliability
Retries: 3 retries for 5xx errors with exponential backoff. 4xx errors (except 429) are logged and dropped.
Circuit Breaker: If a specific domain returns >10\% errors, temporarily pause crawling that domain.
Observability
Metrics: URLs per second, Success/Failure rate by domain, Consumer lag, Network throughput.
Tracing: Trace a URL from discovery to storage using a unique crawl_id.

Storage

Access Pattern: Write-heavy (saving content); Read-heavy for "Seen" checks.
Database Table Design (Metadata DB):
url_hash (PK): SHA-256 of the URL.
last_crawled_at: Timestamp.
content_hash: To detect content changes.
s3_path: Pointer to the raw HTML.
Technical Selection:
NoSQL (Cassandra/DynamoDB): High write throughput and easy partitioning by url_hash.
S3: Industry standard for cost-effective, high-volume object storage.
Distribution Logic: Partition by url_hash to avoid hot spots for popular domains.

Cache

Purpose & Justification:
Robots Cache: Store robots.txt results for 24h to avoid fetching them before every page crawl.
URL Seen Filter: Prevent recrawling the same URL in the same cycle.
Key-Value Schema:
domain:robots -> String (rules).
url_hash -> Boolean (or Bloom Filter bit array).
Technical Selection: Redis. High IOPS for the millions of "Seen" checks required per minute.
Failure Handling: If Redis fails, the system may recrawl some URLs (acceptable) until the cache is restored.

Messaging

Purpose & Decoupling: Kafka decouples the Link Extractor (producer) from the Fetcher (consumer).
Throughput & Partitioning:
Partition by Domain Name. This is critical: all URLs from example.com go to the same partition, allowing a single consumer to manage the crawl rate (politeness) for that domain sequentially.
Failure Handling: Dead-letter queue for malformed URLs that fail extraction repeatedly.
Technical Selection: Kafka. High throughput and persistence allow for "replay" if the Extractor logic changes.

Data Processing

Processing Model: Stream processing (via Link Extractor workers).
Processing DAG: Fetcher Output -> HTML Parser -> Link Normalizer (absolute vs relative) -> Seen Check -> Kafka.
Scalability: Horizontal scaling of Extractor workers based on Kafka topic depth.
Technical Selection: Custom Go/Python workers for efficiency in string parsing.

Infrastructure (Optional)

Distributed Coordination:
Redis Locks: Used for cross-worker politeness (ensure only N workers are hitting the same Top-Level Domain simultaneously).
Platform Security:
VPC Isolation: Fetchers run in a private subnet.
Wrap Up

Advanced Topics

Trade-offs: We prioritize Throughput over Strict Freshness. A URL found now might not be crawled for minutes or hours depending on the queue depth.
Reliability: Using Kafka as the Frontier provides natural "back-pressure." If fetchers slow down, the queue grows, but links are not lost.
Bottleneck Analysis:
DNS: Public DNS can throttle us. Solution: Use a local Unbound or CoreDNS cluster.
Memory: The Bloom filter for 1B URLs might take ~1.2GB of RAM (10 bits per URL). This fits easily in a single Redis node.
Security: Crawler identifies itself via User-Agent: MyCrawler/1.0 (+http://example.com/bot).