Reqflow
← All comparisons

Web Crawler vs URL Shortener

Web Crawler

Web Crawler

URL frontier, politeness, deduplication, distributed fetching.

Components (8)

  • Seed URLs
  • URL Frontier
  • Fetcher Service
  • Parser Service
  • Bloom Filter
  • robots.txt Cache
  • Content Store
  • URL Metadata DB

Headline numbers

  • Crawl throughput needed~385 pages/sec
  • Storage for content~20 TB / month
  • URL Frontier queue depth~10B entries
URL Shortener

URL Shortener

Hashing, key generation, read-heavy caching.

Components (6)

  • Client
  • API Gateway
  • Write Service
  • Read Service
  • Redis
  • Postgres

Headline numbers

  • Write QPS (avg)~1,200/sec
  • Read QPS (avg)~120,000/sec
  • Storage per year~5 TB

Key differences

Only in Web Crawler
  • URL Frontier
  • Content Store
In both
  • Client
  • Service
  • Cache
  • Database
Only in URL Shortener
  • API Gateway

Flow shape

Web Crawler flows
  • Crawl a page8 steps
  • Fetcher worker crashes4 steps
URL Shortener flows
  • Shorten a URL3 steps
  • Resolve short URL (cache hit)3 steps
  • Resolve short URL (cache miss)4 steps
  • Redis is down4 steps