The first move in any interview: define requirements and sketch the API before drawing a single box.
POST /crawl/seed { urls: string[] } → { jobId }GET /crawl/status/:jobId → { queued, fetched, failed, qps }GET /content/:urlHash → { url, html, fetchedAt, statusCode }A web crawler is the engine behind every search index. Google's Googlebot fetches billions of pages a day, and the same fundamental design shows up in SEO tools, price monitors, and academic research scrapers. The core challenge isn't fetching a page; that's trivial. The challenge is doing it at scale without hammering a single website, without fetching the same page twice, and without losing your place when a worker crashes.
The design centers on a URL Frontier: a prioritized, politeness-aware queue of URLs waiting to be crawled. When a fetcher grabs a page, a parser extracts every link and feeds new URLs back into the Frontier, but only after deduplication checks to make sure you haven't already visited them. The Frontier also enforces politeness: you wait between requests to the same domain, and you respect robots.txt.
At scale, the interesting problems are exactly where you'd expect: distributed deduplication across hundreds of workers (a Bloom filter over a URL fingerprint set), back-pressure when the frontier grows faster than workers drain it, and deciding which pages to recrawl and how often (freshness scheduling).