The first move in any interview: define requirements and sketch the API before drawing a single box.
POST /ingest/metrics [{ name, labels, timestamp, value }] → 204GET /query/range?metric=http_requests&labels={service:auth}&start=T1&end=T2&step=60sGET /query/instant?expr=avg(http_latency_p99{env='prod'}) → floatPOST /alerts/rules { name, expr, threshold, for, notify } → ruleIdEvery production system generates a constant stream of numbers: request latency, error rate, queue depth, memory usage, CPU load. A metrics and monitoring system ingests all of those numbers, stores them efficiently, lets engineers query them in real time, and fires alerts when something goes wrong. Without it, you're flying blind.
The design challenge is scale. A medium-sized company might have 50,000 services, each emitting 200 metrics every 15 seconds. That's 666,000 data points per second, 24/7. The data model is always the same: a metric name, a set of labels (which service, which instance, which region), a timestamp, and a float value. The access pattern is also distinctive: almost all writes are recent (you never write to the past), almost all reads are recent time ranges, and queries aggregate across many series (average latency across 50 instances of the same service).
This combination of write-heavy, time-ordered, heavily-aggregated workload is a poor fit for a relational database. Time-series databases (TSDBs) are optimized specifically for it: columnar storage groups values from the same metric over time, delta-encoding and compression turn small float differences into tiny bit sequences, and pre-aggregation (downsampling) makes old data cheap to query.
The two dominant open-source paradigms are Prometheus-style scraping (the monitoring system pulls metrics from services at regular intervals) and push-based ingestion (services push metrics to a collector). Both are in widespread production use, with different trade-offs for service discovery, network topology, and reliability.