Reqflow
← All concepts
Networking·3 min read

DNS (Domain Name System)

The distributed phone book that turns hostnames into IP addresses.

Try it

Resolve example.com. The first lookup walks the hierarchy; the second is cached.

Resolver
Root
TLD (.com)
Authoritative

DNS turns a name into an IP by walking a hierarchy: root points to the .com servers, which point to example.com's nameserver, which holds the real address. Resolvers cache the answer so the next lookup skips the whole walk, which is why DNS changes take time to propagate.

First time reading this? Start here

Plain English: computers route by numbers (IP addresses), but humans type names (google.com). DNS is the global lookup system that translates between them. Every site visit starts with a DNS query.

Used in:NetflixYelp

What it is

A hierarchical, distributed key-value store mapping human-readable names (api.example.com) to IP addresses. Lookups go through a chain: local cache → resolver → root → TLD (.com) → authoritative server.

The problem it solves

Users type names, computers route to IPs. DNS bridges the gap, and does it at a scale where the entire internet's name lookups happen in milliseconds.

How it works

Your OS asks a resolver (often your ISP or 8.8.8.8). The resolver checks its cache; if miss, it walks the hierarchy. Result is cached for the TTL specified in the DNS record. Record types: A (IPv4), AAAA (IPv6), CNAME (alias), MX (mail), TXT (arbitrary, often for verification), NS (delegates to another nameserver).

Why use it

  • Distributed, cacheable, decades of operational experience
  • Lets you change IPs without changing what users type
  • Can do basic load balancing (multiple A records, round-robin)

What it costs you

  • TTL caching means changes take time to propagate globally
  • DNS-based failover is slow (TTL-bound); for fast failover, use anycast IPs or an actual load balancer
  • DNS over UDP is the default, while DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) exist for privacy

Where it shows up in our architectures

  • Netflix

    DNS routes users to the nearest Open Connect appliance; geo-aware DNS is part of the CDN

  • Yelp

    DNS-based geographic routing to the nearest API region

Gotchas

  • Low TTLs (60s) give you fast change-over but multiply lookup traffic. High TTLs (1 day) cache well but slow your ability to react.
  • Don't rely on DNS for hard failover, because caches lie. Use a load balancer or anycast for fast failover.
  • GeoDNS sends users to the nearest region based on resolver IP, not user IP, which is usually correlated but not always.
When this went wrong in production

Facebook locks itself out of its own datacenter · 2021

Postmortem ↗

A BGP misconfig wiped Facebook from the internet for 6 hours, including their badge access.

A routine command intended to assess global backbone capacity was issued, but a bug in Facebook's audit tool failed to stop it. The command withdrew all Facebook BGP routes, taking the company off the internet. Worse: the same DNS infrastructure that announced their existence to the world also gated their internal tools, including the badge-access system at the datacenters. Engineers couldn't VPN in, couldn't open the doors, couldn't even reach the management plane to roll back. Recovery required physically driving engineers to the datacenter floor. The lesson: never let your control plane depend on your data plane. Out-of-band access has to actually be out-of-band.

Google Cloud networking failure: 4 hours, 3 regions · 2019

Postmortem ↗

A config push to the backbone control plane caused packet loss across three GCP regions for four hours.

Google pushed a config update to the network control plane managing inter-region backbone routing. The config included software that consumed far more memory than expected under production conditions, causing the control plane to crash on a large fraction of routers. Each restarting router needed to re-establish BGP peering, which consumed network capacity. Restarting routers and network traffic competing for bandwidth created a feedback loop: routers trying to recover caused more congestion, which slowed recovery further. Three GCP regions (us-east1, us-central1, europe-west1) experienced 30-87% packet loss for services using the Google backbone. The lesson: stage control plane changes and validate memory/resource usage before the push. A control plane change should never be able to create a data plane feedback loop.

Interview angle

DNS comes up in global system design and in 'how does a request reach your server?' questions. The thing interviewers want to hear is that DNS has TTL-based caching, which means changes are slow to propagate and you cannot use DNS alone for fast failover. Show you know the difference between DNS-based routing (GeoDNS for global load balancing) and load-balancer routing (for fast failover within a region). Candidates lose points by treating DNS as a simple lookup that resolves instantly.

Your notes

Private to you