The Question
Design

Distributed Web Crawler Design

Design a high-scale, distributed web crawler capable of processing billions of URLs monthly. The system must address challenges including efficient URL discovery and deduplication, domain-level politeness constraints, high-throughput metadata storage, and fault-tolerant architectural components for handling large-scale data ingestion and storage.
Kafka
Cassandra
Redis
S3
Bloom Filter
Kubernetes
gRPC
DNS Caching