Metrics & Monitoring System: System Design

Requirements & API: Metrics & Monitoring System

The first move in any interview: define requirements and sketch the API before drawing a single box.

Functional requirements

•Ingest metrics from thousands of services (push) or scrape them (pull) at configurable intervals.
•Store metrics as time series: (metric_name, labels, timestamp, value) tuples.
•Support range queries: give me all values of metric M with labels L over the last N hours.
•Support aggregation: average, sum, percentile across a set of matching series.
•Evaluate alerting rules continuously and fire alerts when thresholds are crossed.

Non-functional requirements

•Ingest rate: 1M+ data points per second without data loss.
•Query latency: dashboard queries over the last hour should return in under 1 second.
•Retention: high-resolution data for 15 days; downsampled data for 1 year.
•Availability: monitoring must stay up even when the systems it monitors are down.

API contract

POST /ingest/metrics [{ name, labels, timestamp, value }] → 204

Push path. Batched writes for throughput.

GET /query/range?metric=http_requests&labels={service:auth}&start=T1&end=T2&step=60s

Range query: the core read path for dashboards.

GET /query/instant?expr=avg(http_latency_p99{env='prod'}) → float

PromQL-style instant query for alerting evaluation.

POST /alerts/rules { name, expr, threshold, for, notify } → ruleId

About Metrics & Monitoring System

Every production system generates a constant stream of numbers: request latency, error rate, queue depth, memory usage, CPU load. A metrics and monitoring system ingests all of those numbers, stores them efficiently, lets engineers query them in real time, and fires alerts when something goes wrong. Without it, you're flying blind.

The design challenge is scale. A medium-sized company might have 50,000 services, each emitting 200 metrics every 15 seconds. That's 666,000 data points per second, 24/7. The data model is always the same: a metric name, a set of labels (which service, which instance, which region), a timestamp, and a float value. The access pattern is also distinctive: almost all writes are recent (you never write to the past), almost all reads are recent time ranges, and queries aggregate across many series (average latency across 50 instances of the same service).

This combination of write-heavy, time-ordered, heavily-aggregated workload is a poor fit for a relational database. Time-series databases (TSDBs) are optimized specifically for it: columnar storage groups values from the same metric over time, delta-encoding and compression turn small float differences into tiny bit sequences, and pre-aggregation (downsampling) makes old data cheap to query.

The two dominant open-source paradigms are Prometheus-style scraping (the monitoring system pulls metrics from services at regular intervals) and push-based ingestion (services push metrics to a collector). Both are in widespread production use, with different trade-offs for service discovery, network topology, and reliability.