Web Crawler: System Design

Requirements & API: Web Crawler

The first move in any interview: define requirements and sketch the API before drawing a single box.

Functional requirements

•Given a set of seed URLs, crawl the web and download page content.
•Extract all hyperlinks from each page and add new URLs to the crawl queue.
•Avoid crawling the same URL twice (deduplication).
•Respect robots.txt and per-domain crawl rate limits (politeness).
•Store crawled content for downstream indexing or analysis.

Non-functional requirements

•Throughput: crawl billions of pages per month (Google-scale) or millions per day (mid-scale).
•Politeness: never send more than N requests/sec to any single domain.
•Fault tolerance: a crashed worker loses at most one in-flight URL, not its entire queue.
•Freshness: re-crawl popular pages more frequently than stale ones.

API contract

POST /crawl/seed { urls: string[] } → { jobId }

Bootstraps the frontier with seed URLs.

GET /crawl/status/:jobId → { queued, fetched, failed, qps }

Job status and throughput metrics.

GET /content/:urlHash → { url, html, fetchedAt, statusCode }

Retrieve stored page content by URL fingerprint.

About Web Crawler

A web crawler is the engine behind every search index. Google's Googlebot fetches billions of pages a day, and the same fundamental design shows up in SEO tools, price monitors, and academic research scrapers. The core challenge isn't fetching a page; that's trivial. The challenge is doing it at scale without hammering a single website, without fetching the same page twice, and without losing your place when a worker crashes.

The design centers on a URL Frontier: a prioritized, politeness-aware queue of URLs waiting to be crawled. When a fetcher grabs a page, a parser extracts every link and feeds new URLs back into the Frontier, but only after deduplication checks to make sure you haven't already visited them. The Frontier also enforces politeness: you wait between requests to the same domain, and you respect robots.txt.

At scale, the interesting problems are exactly where you'd expect: distributed deduplication across hundreds of workers (a Bloom filter over a URL fingerprint set), back-pressure when the frontier grows faster than workers drain it, and deciding which pages to recrawl and how often (freshness scheduling).