The three pillars (logs, metrics, traces) that let you ask new questions about a live system you didn't anticipate.
Checkout is slow. Reveal each pillar to see what it tells you.
Observability is the three pillars working together: metrics tell you something is wrong, logs tell you what happened, traces tell you where the time went. Monitoring alerts you; observability lets you ask new questions about a problem you did not anticipate.
Plain English: observability is being able to figure out what's wrong with a running system from the outside. Metrics tell you something is broken (latency spiked), logs tell you what happened (this error), and traces tell you where in the chain of services it happened. You need all three.
The property of being able to understand a system's internal state from its external outputs. It rests on three pillars: metrics (numeric time-series like request rate, error rate, latency percentiles), logs (timestamped event records, ideally structured), and traces (the path of a single request across many services). The goal is to answer questions you didn't pre-define, not just watch fixed dashboards.
In a distributed system, a user-facing slowdown could come from any of dozens of services, a database, a cache, or the network. Without observability you're guessing. Metrics tell you something is wrong and alert you; logs tell you what specifically happened; traces tell you where in the request path the time or error originated. Monitoring answers known questions ('is CPU high?'); observability lets you explore unknown ones ('why are checkouts from this region slow only on Tuesdays?').
Metrics: services emit counters/gauges/histograms (Prometheus-style); you aggregate and alert on rates and percentiles. Logs: emit structured (JSON) events with a correlation/trace ID, ship them to a central store (ELK, Loki) for search. Traces: propagate a trace ID across every service hop (via headers); each service records spans with timing; a tracing backend (Jaeger, Tempo) reconstructs the full request waterfall. The three are tied together by shared IDs so you can pivot from a spiking metric to the logs and the trace behind it.
Hundreds of microservices make observability mandatory; distributed traces and per-service metrics are the only way to localize a slow request
Structured audit logs and metrics on every charge are both an operational and a compliance requirement
Metrics on dispatch latency and traces across the matching/geo/trip services pinpoint which tier is slow during a surge
A storage migration bug silently deleted the document content for a small fraction of Google Docs users.
During a backend storage migration for Google Drive, a race condition in the migration code permanently deleted document contents for a small fraction of users, roughly 0.001% of the user base, which still represents hundreds of thousands of documents. The deletion was silent: Google Drive kept showing the document title and metadata, but opening the document showed blank content. Users didn't realize the content was gone right away, which delayed support tickets. Recovery was partial. Google's cross-region replication had the content, but the deletion had already propagated before the replication lag resolved, making some recent edits unrecoverable. The lesson: data migrations must be zero-destructive. Write, then verify, then delete, with a flag that can be rolled back. Replication protects against node failure, not application-layer bugs that replicate the bug to every replica.
Observability questions test whether you understand the difference between monitoring and actually debugging a live system. The signal is knowing all three pillars by name and what question each one answers: metrics for alerting on known failure modes, logs for what specifically happened, traces for where in a distributed call chain. Candidates lose points by just saying 'add logging' without connecting logs to traces via a correlation ID, which is useless in a microservices architecture.