Requirements & API: Web Crawler

The first move in any interview: define requirements and sketch the API before drawing a single box.

Functional requirements

  • Given a set of seed URLs, crawl the web and download page content.
  • Extract all hyperlinks from each page and add new URLs to the crawl queue.
  • Avoid crawling the same URL twice (deduplication).
  • Respect robots.txt and per-domain crawl rate limits (politeness).
  • Store crawled content for downstream indexing or analysis.

Non-functional requirements

  • Throughput: crawl billions of pages per month (Google-scale) or millions per day (mid-scale).
  • Politeness: never send more than N requests/sec to any single domain.
  • Fault tolerance: a crashed worker loses at most one in-flight URL, not its entire queue.
  • Freshness: re-crawl popular pages more frequently than stale ones.

API contract

POST /crawl/seed { urls: string[] } → { jobId }
Bootstraps the frontier with seed URLs.
GET /crawl/status/:jobId → { queued, fetched, failed, qps }
Job status and throughput metrics.
GET /content/:urlHash → { url, html, fetchedAt, statusCode }
Retrieve stored page content by URL fingerprint.

About Web Crawler

A web crawler is the engine behind every search index. Google's Googlebot fetches billions of pages a day, and the same fundamental design shows up in SEO tools, price monitors, and academic research scrapers. The core challenge isn't fetching a page; that's trivial. The challenge is doing it at scale without hammering a single website, without fetching the same page twice, and without losing your place when a worker crashes.

The design centers on a URL Frontier: a prioritized, politeness-aware queue of URLs waiting to be crawled. When a fetcher grabs a page, a parser extracts every link and feeds new URLs back into the Frontier, but only after deduplication checks to make sure you haven't already visited them. The Frontier also enforces politeness: you wait between requests to the same domain, and you respect robots.txt.

At scale, the interesting problems are exactly where you'd expect: distributed deduplication across hundreds of workers (a Bloom filter over a URL fingerprint set), back-pressure when the frontier grows faster than workers drain it, and deciding which pages to recrawl and how often (freshness scheduling).