← Concepts
Performance·3 min read

Latency vs Throughput

Latency is how long one request takes; throughput is how many you handle per second. Optimizing one often hurts the other.

First time reading this? Start here

Plain English: latency is the wait for a single request (how fast). Throughput is how many requests you handle per second (how much). They're different: a system can be high-throughput but high-latency (like batching) or low-latency but low-throughput. You usually have to trade one for the other.

Used in:Stock Exchange (Matching Engine)Apache KafkaNetflix
What it is

Two distinct performance dimensions. Latency is the time to complete a single operation, end to end, measured in percentiles (p50, p99, p999), not averages. Throughput is the rate of operations the system sustains: requests/sec, MB/sec, messages/sec. They are related but independent: a highway's speed limit is latency; its number of lanes is throughput.

The problem it solves

Forces you to be precise about what 'fast' means. A user staring at a spinner cares about latency. A batch pipeline processing a billion rows overnight cares about throughput. Conflating them leads to the wrong optimization: batching improves throughput but adds latency; tiny per-request work minimizes latency but wastes throughput.

How it works

Latency is reduced by caching, doing less work per request, moving computation closer to the user (CDN), and avoiding serial round-trips. Throughput is increased by parallelism, batching, pipelining, and adding nodes (horizontal scaling). The tension: batching N requests amortizes fixed costs (great for throughput) but each request now waits for the batch to fill (worse latency). Little's Law ties them together: concurrency = throughput × latency.

Why use it
What it costs you
Where it shows up in our architectures
Gotchas

Your notes

Private to you