DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Distributed Web Crawler Design

Design a globally distributed web crawler capable of indexing 1 billion pages per day. The system must address the challenges of URL discovery at scale, ensuring 'politeness' to avoid overwhelming target servers, deduplicating content, and managing the storage of petabytes of data. Explain how you would handle DNS resolution bottlenecks, prioritize URLs based on 'freshness' requirements, and ensure the system is resilient to network failures or worker crashes.
Kafka
Cassandra
Redis
S3
Bloom Filter
gRPC
Simhash
DNS Caching
Questions & Insights

Clarifying Questions

What is the target scale and refresh rate? (Assumption: We need to crawl 1 billion pages per day with a refresh cycle of 30 days for a total of 30 billion URLs.)
What types of content are we indexing? (Assumption: MVP focuses on HTML text, metadata, and link extraction; media like video/audio is out of scope.)
What are the constraints regarding "Politeness"? (Assumption: We strictly follow robots.txt and implement a per-host rate limit to avoid DOS-ing small sites.)
Do we need to handle JavaScript rendering? (Assumption: For MVP, we use static HTML fetching. Headless browsers like Puppeteer are skipped due to massive resource costs.)
How do we handle duplicate content? (Assumption: We use checksumming (Simhash) to detect near-duplicates at the content level.)

Thinking Process

Core Bottleneck: The primary challenge is the URL Frontier. It must balance priority (crawling important sites first) and politeness (not hitting the same host too fast) while managing trillions of URLs.
Progressive Questions:
How do we ensure we don't crawl the same URL twice? (Use a distributed Bloom Filter).
How do we manage the politeness constraint across thousands of distributed workers? (Centralized Frontier with per-host queues).
How do we prevent DNS resolution from becoming the system bottleneck? (Implement a custom high-performance DNS cache).
How do we scale storage for trillions of small objects? (Use Object Storage for content and a Wide-column Store for metadata).

Bonus Points

DNS Prefetching & Custom Resolver: Standard OS DNS libraries are blocking and slow. Using an asynchronous, custom DNS resolver with a local cache saves 100ms+ per request.
Simhash for Near-Duplicate Detection: Web content is often redundant (e.g., same article on different mirrors). Simhash allows us to identify "near-duplicates" by comparing bit-differences in fingerprints.
Checkpointing & State Recovery: In a distributed crawl, nodes fail. Using a Kafka-backed URL Frontier ensures that if a Fetcher crashes, the URL is re-queued via consumer offset management.
Trap Detection: Implementing heuristics to detect "infinite URL loops" (e.g., dynamically generated calendars) to save bandwidth and storage.
Design Breakdown

Functional Requirements

Core Use Cases:
Seed URL Injection: Ability to bootstrap the crawl with high-quality domains.
HTML Fetching: Distributed downloading of web pages.
Link Extraction: Parsing HTML to find new URLs to add to the frontier.
Content Storage: Storing the raw HTML and metadata for downstream processing (Search Indexing).
Scope Control:
In-Scope: URL Frontier, Politeness logic, HTML fetching, Link extraction, De-duplication.
Out-of-Scope: JavaScript rendering (SPA), image/video processing, deep web crawling (behind logins).

Non-Functional Requirements

Scale: Support for 12,000+ pages per second (1B/day).
Latency: High throughput is prioritized over single-page latency, but DNS/TCP handshakes must be optimized.
Availability & Reliability: The system must be fault-tolerant; a worker crash should not lose the crawl progress.
Consistency: Eventual consistency is acceptable for URL discovery.
Politeness: Mandatory adherence to robots.txt and host-level throttling.
Extensibility: Modular design to add new processors (e.g., language detection) later.

Estimation

Traffic: 1,000,000,000 pages / 86,400s ≈ 11,500 QPS (Average). Peak QPS ≈ 25,000.
Storage (Content): 1B pages/day 100 KB/page = 100 TB/day. Over 30 days = 3 PB**.
Storage (Metadata/URL): 30B URLs 500 bytes (URL + metadata) ≈ 15 TB**.
Bandwidth: 11,500 pages/s * 100 KB/page ≈ 1.15 GB/s (9.2 Gbps) total throughput.

Blueprint

Concise Summary: A distributed architecture centered around a "URL Frontier" that manages work distribution via Kafka, ensuring politeness and priority while horizontally scalable Fetchers handle the networking.
Major Components:
URL Frontier: The brain of the system, managing URL queues, prioritization, and politeness constraints.
Fetcher Service: Stateless workers that perform DNS resolution and HTTP fetching.
Content Storage (S3): Highly durable blob store for raw page content.
Metadata Store (Cassandra): Distributed wide-column store to track URL crawl history and status.
URL Filter (Bloom Filter): High-speed, memory-efficient component to check if a URL has already been discovered.
Simplicity Audit: We avoid complex distributed locking by using Kafka partitions for host-based affinity, ensuring politeness is naturally maintained.
Architecture Decision Rationale:
Why this architecture?: Separating URL management (Frontier) from execution (Fetchers) allows independent scaling of I/O-bound and CPU-bound tasks.
Functional Satisfaction: Meets the need to discover, fetch, and store content at scale.
Non-functional Satisfaction: Kafka provides the durability needed for 1B pages/day; Cassandra handles the high-write volume for metadata.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling
Fetcher Workers: Stateless, deployed in auto-scaling groups across multiple regions to reduce cross-region latency. Scaling is based on Kafka Consumer Lag.
Parser Service: Decoupled from Fetchers. Processes the HTML to extract links and generate Simhashes.
API Schema Design
Internal API (Frontier to Fetcher): gRPC for low-latency communication or Kafka message consumption.
Fetch Response: { url: string, status: int, body_s3_key: string, timestamp: long }.
Resilience & Reliability
Retry Policy: 3 retries with exponential backoff for 5xx errors. 4xx errors (except 429) are logged but not retried.
Circuit Breaker: If a host fails consistently, it is blacklisted in the Frontier for 24 hours.
Observability
Metrics: Track pages_per_second, http_error_rate_by_host, dns_resolution_time.

Storage

Access Pattern
Write-Heavy: Every page crawl results in a content write (S3) and metadata update (Cassandra).
Read-Light: Reads mainly occur when the Frontier re-initializes or for re-crawling logic.
Database Table Design (Cassandra)
Table: url_metadata
Fields: url_hash (PK), url_string, last_crawl_time, etag, priority_score, content_checksum.
Technical Selection
Cassandra: Chosen for linear write scalability and the ability to handle TTLs (Time-To-Live) for old crawl data.
S3: Industry standard for cost-effective storage of massive blob data.
Distribution Logic
Sharding: Cassandra partitions by url_hash. Kafka partitions by host_domain to ensure a single consumer handles one host (simplifies politeness).

Cache

Purpose & Justification: DNS resolution is a major bottleneck.
Key-Value Schema (Redis):
DNS Cache: Key: domain_name, Value: IP_address, TTL: 24h.
Politeness Cache: Key: host_name, Value: last_access_timestamp.
Failure Handling: If Redis fails, workers fall back to standard DNS resolution (degraded performance).

Messaging

Purpose & Decoupling: Kafka acts as the buffer between URL discovery (Parser) and URL execution (Fetcher).
Throughput & Partitioning:
Topic: pending_urls.
Partition Key: HostHash(url). This ensures that all URLs for a single domain go to the same Kafka partition, allowing a single worker to enforce politeness for that domain easily.
Technical Selection: Kafka for its high throughput and durability.

Data Processing

Processing Model: Stream processing using the Parser Service.
Processing DAG:
Fetcher saves HTML to S3.
Parser reads HTML -> Extracts Links.
Parser checks Bloom Filter for "New" URLs.
Parser calculates Simhash and checks Metadata DB for content duplicates.
Valid New URLs are sent back to the Frontier.
Technical Selection: Custom Go-based service for link extraction (fast regex/DOM parsing).
Wrap Up

Advanced Topics

Trade-offs (Freshness vs. Depth): We prioritize breadth (new URLs) but use the Metadata DB to track change frequency. If a site changes often, the Frontier increases its priority for the next cycle.
Reliability: Use of Kafka ensures that even if the whole Fetcher cluster goes down, the "work list" is preserved.
Bottleneck Analysis:
Hot Shards: High-traffic domains (e.g., Wikipedia) could overwhelm a single Kafka partition. Mitigation: Sub-partitioning high-volume domains.
Storage Cost: 3 PB/month is expensive. Mitigation: Use S3 Intelligent-Tiering to move older crawl data to Glacier.
Security: Fetchers are placed in a restricted VPC. Outbound traffic is restricted to ports 80/443. A User-Agent string is clearly defined to identify the crawler to webmasters.
Distinguishing Insight: "Robot-Exclusion Protocol" (REP) Cache. Fetching robots.txt for every page is redundant. We cache the parsed robots.txt rules in Redis for each domain to significantly reduce round-trips.