The three pillars (logs, metrics, traces) that let you ask new questions about a live system you didn't anticipate.
Plain English: observability is being able to figure out what's wrong with a running system from the outside. Metrics tell you something is broken (latency spiked), logs tell you what happened (this error), and traces tell you where in the chain of services it happened. You need all three.
The property of being able to understand a system's internal state from its external outputs. It rests on three pillars: metrics (numeric time-series like request rate, error rate, latency percentiles), logs (timestamped event records, ideally structured), and traces (the path of a single request across many services). The goal is to answer questions you didn't pre-define, not just watch fixed dashboards.
In a distributed system, a user-facing slowdown could come from any of dozens of services, a database, a cache, or the network. Without observability you're guessing. Metrics tell you something is wrong and alert you; logs tell you what specifically happened; traces tell you where in the request path the time or error originated. Monitoring answers known questions ('is CPU high?'); observability lets you explore unknown ones ('why are checkouts from this region slow only on Tuesdays?').
Metrics: services emit counters/gauges/histograms (Prometheus-style); you aggregate and alert on rates and percentiles. Logs: emit structured (JSON) events with a correlation/trace ID, ship them to a central store (ELK, Loki) for search. Traces: propagate a trace ID across every service hop (via headers); each service records spans with timing; a tracing backend (Jaeger, Tempo) reconstructs the full request waterfall. The three are tied together by shared IDs so you can pivot from a spiking metric to the logs and the trace behind it.
Hundreds of microservices make observability mandatory; distributed traces and per-service metrics are the only way to localize a slow request
Structured audit logs and metrics on every charge are both an operational and a compliance requirement
Metrics on dispatch latency and traces across the matching/geo/trip services pinpoint which tier is slow during a surge