Operations·3 min read

Observability (Logs, Metrics, Traces)

The three pillars (logs, metrics, traces) that let you ask new questions about a live system you didn't anticipate.

Try it

Checkout is slow. Reveal each pillar to see what it tells you.

⚠ Incident: checkout is slow

Observability is the three pillars working together: metrics tell you something is wrong, logs tell you what happened, traces tell you where the time went. Monitoring alerts you; observability lets you ask new questions about a problem you did not anticipate.

First time reading this? Start here

Plain English: observability is being able to figure out what's wrong with a running system from the outside. Metrics tell you something is broken (latency spiked), logs tell you what happened (this error), and traces tell you where in the chain of services it happened. You need all three.

Used in:Netflix Payment Gateway Uber

What it is

The property of being able to understand a system's internal state from its external outputs. It rests on three pillars: metrics (numeric time-series like request rate, error rate, latency percentiles), logs (timestamped event records, ideally structured), and traces (the path of a single request across many services). The goal is to answer questions you didn't pre-define, not just watch fixed dashboards.

The problem it solves

In a distributed system, a user-facing slowdown could come from any of dozens of services, a database, a cache, or the network. Without observability you're guessing. Metrics tell you something is wrong and alert you; logs tell you what specifically happened; traces tell you where in the request path the time or error originated. Monitoring answers known questions ('is CPU high?'); observability lets you explore unknown ones ('why are checkouts from this region slow only on Tuesdays?').

How it works

Metrics: services emit counters/gauges/histograms (Prometheus-style); you aggregate and alert on rates and percentiles. Logs: emit structured (JSON) events with a correlation/trace ID, ship them to a central store (ELK, Loki) for search. Traces: propagate a trace ID across every service hop (via headers); each service records spans with timing; a tracing backend (Jaeger, Tempo) reconstructs the full request waterfall. The three are tied together by shared IDs so you can pivot from a spiking metric to the logs and the trace behind it.

Why use it

Lets you debug novel failures in production without redeploying instrumentation
Each pillar answers a different question: metrics 'is it broken?', logs 'what happened?', traces 'where?'
Correlation IDs let you pivot between the three when chasing an incident

What it costs you

Telemetry volume is huge and expensive: storing every log and trace at full fidelity costs more than the service itself at scale
Cardinality explosions (high-cardinality metric labels like user_id) can blow up your metrics backend
Instrumentation is real work and easy to do inconsistently, and gaps leave blind spots exactly where you need visibility

Where it shows up in our architectures

Netflix →
Hundreds of microservices make observability mandatory; distributed traces and per-service metrics are the only way to localize a slow request
Payment Gateway →
Structured audit logs and metrics on every charge are both an operational and a compliance requirement
Uber →
Metrics on dispatch latency and traces across the matching/geo/trip services pinpoint which tier is slow during a surge

Gotchas

Monitoring (known questions) is not observability (unknown questions). Fixed dashboards won't help with the novel failure; you need the ability to slice telemetry on dimensions you didn't pre-plan.
Watch metric cardinality. Putting user_id or request_id as a metric label multiplies time-series into the millions and bankrupts your metrics store; those belong in logs/traces, not labels.
Sample traces. Tracing 100% of requests at scale is unaffordable; head- or tail-based sampling keeps the interesting traces (errors, slow ones) while dropping the boring majority.
Without a shared correlation/trace ID across services, your logs, metrics, and traces are three disconnected islands. Propagating that ID is the single highest-leverage thing to get right.

When this went wrong in production

Google Docs deletes documents for 0.001% of users · 2023

Postmortem ↗

A storage migration bug silently deleted the document content for a small fraction of Google Docs users.

During a backend storage migration for Google Drive, a race condition in the migration code permanently deleted document contents for a small fraction of users, roughly 0.001% of the user base, which still represents hundreds of thousands of documents. The deletion was silent: Google Drive kept showing the document title and metadata, but opening the document showed blank content. Users didn't realize the content was gone right away, which delayed support tickets. Recovery was partial. Google's cross-region replication had the content, but the deletion had already propagated before the replication lag resolved, making some recent edits unrecoverable. The lesson: data migrations must be zero-destructive. Write, then verify, then delete, with a flag that can be rolled back. Replication protects against node failure, not application-layer bugs that replicate the bug to every replica.

All war stories →

Interview angle

Observability questions test whether you understand the difference between monitoring and actually debugging a live system. The signal is knowing all three pillars by name and what question each one answers: metrics for alerting on known failure modes, logs for what specifically happened, traces for where in a distributed call chain. Candidates lose points by just saying 'add logging' without connecting logs to traces via a correlation ID, which is useless in a microservices architecture.

Your notes

Private to you