← Concepts
Operations·3 min read

Distributed Tracing

Stitch one request's journey across many services into a single timeline so you can see exactly where the time went.

First time reading this? Start here

Plain English: when one user request bounces through ten microservices, distributed tracing tags it with a shared ID and records how long each hop took. The result is a waterfall chart showing exactly which service made the request slow, instead of guessing.

Used in:NetflixUberPayment Gateway
What it is

A technique for tracking a single request as it propagates through a distributed system. A trace is the whole request's journey; it's made of spans, one per unit of work (a service call, a DB query), each with a start/end time, parent link, and metadata. A trace ID is propagated across every hop so the pieces can be reassembled into one end-to-end timeline.

The problem it solves

In microservices, one user action triggers a cascade of internal calls. When it's slow or failing, per-service metrics tell you each service's health but not how they compose for this request. Distributed tracing reconstructs the exact path and timing across all services, so you can see that the 800ms latency was 700ms waiting on one downstream call, turning 'somewhere it's slow' into 'here it's slow.'

How it works

When a request enters the system, a trace ID and root span are created. Each service that handles it creates a child span (with a span ID and parent ID) and propagates the trace context (typically via HTTP headers like W3C traceparent, or message metadata) to every downstream call. Each service reports its spans to a tracing backend (Jaeger, Tempo, Zipkin), which joins them by trace ID into a waterfall. Sampling decides which traces to keep; OpenTelemetry is the standard instrumentation layer.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas

Your notes

Private to you