Stitch one request's journey across many services into a single timeline so you can see exactly where the time went.
Trace one request across services. The waterfall shows where time goes.
One request touches many services. Distributed tracing tags it with a trace ID and records a timed span at each hop, then stitches them into a waterfall. Instead of guessing which service is slow, you see it, here the database is the bottleneck.
Plain English: when one user request bounces through ten microservices, distributed tracing tags it with a shared ID and records how long each hop took. The result is a waterfall chart showing exactly which service made the request slow, instead of guessing.
A technique for tracking a single request as it propagates through a distributed system. A trace is the whole request's journey; it's made of spans, one per unit of work (a service call, a DB query), each with a start/end time, parent link, and metadata. A trace ID is propagated across every hop so the pieces can be reassembled into one end-to-end timeline.
In microservices, one user action triggers a cascade of internal calls. When it's slow or failing, per-service metrics tell you each service's health but not how they compose for this request. Distributed tracing reconstructs the exact path and timing across all services, so you can see that the 800ms latency was 700ms waiting on one downstream call, turning 'somewhere it's slow' into 'here it's slow.'
When a request enters the system, a trace ID and root span are created. Each service that handles it creates a child span (with a span ID and parent ID) and propagates the trace context (typically via HTTP headers like W3C traceparent, or message metadata) to every downstream call. Each service reports its spans to a tracing backend (Jaeger, Tempo, Zipkin), which joins them by trace ID into a waterfall. Sampling decides which traces to keep; OpenTelemetry is the standard instrumentation layer.
Tracing across hundreds of services is the only practical way to localize a slow API call to the responsible downstream
A trace follows a ride request through dispatch, geo-index, and trip services to show which tier added latency during a surge
A charge's trace spans the gateway, fraud check, ledger write, and async webhook so a stuck payment can be located precisely
A storage migration bug silently deleted the document content for a small fraction of Google Docs users.
During a backend storage migration for Google Drive, a race condition in the migration code permanently deleted document contents for a small fraction of users, roughly 0.001% of the user base, which still represents hundreds of thousands of documents. The deletion was silent: Google Drive kept showing the document title and metadata, but opening the document showed blank content. Users didn't realize the content was gone right away, which delayed support tickets. Recovery was partial. Google's cross-region replication had the content, but the deletion had already propagated before the replication lag resolved, making some recent edits unrecoverable. The lesson: data migrations must be zero-destructive. Write, then verify, then delete, with a flag that can be rolled back. Replication protects against node failure, not application-layer bugs that replicate the bug to every replica.
Distributed tracing is a great concept to mention proactively when you propose a microservices architecture, because it shows you're thinking about operational reality, not just the happy path. The key thing to say is that you'd use OpenTelemetry for instrumentation and propagate the trace ID as an HTTP header across every service hop. Candidates who skip tracing in a multi-service design signal they've never debugged a production incident in a distributed system.