The Question
DesignDistributed Web Crawler Design
Design a highly scalable, distributed system capable of crawling and indexing a significant portion of the web. The system must efficiently manage URL discovery, prioritize content fetching, and strictly adhere to website-specific politeness policies while handling petabytes of data.
Redis
Bloom Filter
S3
PostgreSQL
Distributed Workers