Web Crawler
Web Crawler
URL frontier, politeness, deduplication, distributed fetching.
Components (8)
- Seed URLs
- URL Frontier
- Fetcher Service
- Parser Service
- Bloom Filter
- robots.txt Cache
- Content Store
- URL Metadata DB
Headline numbers
- Crawl throughput needed~385 pages/sec
- Storage for content~20 TB / month
- URL Frontier queue depth~10B entries