Operations·3 min read

Distributed Tracing

Stitch one request's journey across many services into a single timeline so you can see exactly where the time went.

Try it

Trace one request across services. The waterfall shows where time goes.

Gateway

Auth

Orders

Database

One request touches many services. Distributed tracing tags it with a trace ID and records a timed span at each hop, then stitches them into a waterfall. Instead of guessing which service is slow, you see it, here the database is the bottleneck.

First time reading this? Start here

Plain English: when one user request bounces through ten microservices, distributed tracing tags it with a shared ID and records how long each hop took. The result is a waterfall chart showing exactly which service made the request slow, instead of guessing.

Used in:Netflix Uber Payment Gateway

What it is

A technique for tracking a single request as it propagates through a distributed system. A trace is the whole request's journey; it's made of spans, one per unit of work (a service call, a DB query), each with a start/end time, parent link, and metadata. A trace ID is propagated across every hop so the pieces can be reassembled into one end-to-end timeline.

The problem it solves

In microservices, one user action triggers a cascade of internal calls. When it's slow or failing, per-service metrics tell you each service's health but not how they compose for this request. Distributed tracing reconstructs the exact path and timing across all services, so you can see that the 800ms latency was 700ms waiting on one downstream call, turning 'somewhere it's slow' into 'here it's slow.'

How it works

When a request enters the system, a trace ID and root span are created. Each service that handles it creates a child span (with a span ID and parent ID) and propagates the trace context (typically via HTTP headers like W3C traceparent, or message metadata) to every downstream call. Each service reports its spans to a tracing backend (Jaeger, Tempo, Zipkin), which joins them by trace ID into a waterfall. Sampling decides which traces to keep; OpenTelemetry is the standard instrumentation layer.

Why use it

Pinpoints exactly which service/hop owns the latency or error in a multi-service request
Reveals the real call graph and dependencies, including surprise N+1 fan-outs
OpenTelemetry standardizes instrumentation across languages and backends

What it costs you

Context must be propagated through every hop, and one un-instrumented service breaks the trace into disconnected pieces
Full-fidelity tracing is expensive at scale; sampling is required and can miss the rare bad trace if done naively
Async/queued hops (a message sits in a queue for minutes) make spans awkward, so you must propagate context through the message

Where it shows up in our architectures

Netflix →
Tracing across hundreds of services is the only practical way to localize a slow API call to the responsible downstream
Uber →
A trace follows a ride request through dispatch, geo-index, and trip services to show which tier added latency during a surge
Payment Gateway →
A charge's trace spans the gateway, fraud check, ledger write, and async webhook so a stuck payment can be located precisely

Gotchas

A single un-instrumented service (or one that drops the trace headers) breaks the chain, and downstream spans become orphans. Propagation must be end-to-end to be useful.
Tail-based sampling beats head-based for debugging: decide whether to keep a trace after you know it was slow or errored, not at the start when every request looks the same.
Tracing across async boundaries (queues, batch jobs) requires explicitly carrying the trace context inside the message; it won't flow automatically like it does over HTTP.
Clock skew between hosts can make a child span look like it started before its parent. Tracing backends correct for this, but don't trust raw cross-host timestamps to the microsecond.

When this went wrong in production

Google Docs deletes documents for 0.001% of users · 2023

Postmortem ↗

A storage migration bug silently deleted the document content for a small fraction of Google Docs users.

During a backend storage migration for Google Drive, a race condition in the migration code permanently deleted document contents for a small fraction of users, roughly 0.001% of the user base, which still represents hundreds of thousands of documents. The deletion was silent: Google Drive kept showing the document title and metadata, but opening the document showed blank content. Users didn't realize the content was gone right away, which delayed support tickets. Recovery was partial. Google's cross-region replication had the content, but the deletion had already propagated before the replication lag resolved, making some recent edits unrecoverable. The lesson: data migrations must be zero-destructive. Write, then verify, then delete, with a flag that can be rolled back. Replication protects against node failure, not application-layer bugs that replicate the bug to every replica.

All war stories →

Interview angle

Distributed tracing is a great concept to mention proactively when you propose a microservices architecture, because it shows you're thinking about operational reality, not just the happy path. The key thing to say is that you'd use OpenTelemetry for instrumentation and propagate the trace ID as an HTTP header across every service hop. Candidates who skip tracing in a multi-service design signal they've never debugged a production incident in a distributed system.

Your notes

Private to you