Performance·3 min read

Latency vs Throughput

Latency is how long one request takes; throughput is how many you handle per second. Optimizing one often hurts the other.

Try it

Pick a scenario below and see which option fits, and why.

Optimize latency

fast for a single request

Optimize throughput

max work per second

Scenarios

There is no universal winner here, only the right fit for a given situation. Each scenario above pushes the decision a different way, which is exactly how this tradeoff shows up in real design questions.

First time reading this? Start here

Plain English: latency is the wait for a single request (how fast). Throughput is how many requests you handle per second (how much). They're different: a system can be high-throughput but high-latency (like batching) or low-latency but low-throughput. You usually have to trade one for the other.

Used in:Stock Exchange (Matching Engine)Apache Kafka Netflix

What it is

Two distinct performance dimensions. Latency is the time to complete a single operation, end to end, measured in percentiles (p50, p99, p999), not averages. Throughput is the rate of operations the system sustains: requests/sec, MB/sec, messages/sec. They are related but independent: a highway's speed limit is latency; its number of lanes is throughput.

The problem it solves

Forces you to be precise about what 'fast' means. A user staring at a spinner cares about latency. A batch pipeline processing a billion rows overnight cares about throughput. Conflating them leads to the wrong optimization: batching improves throughput but adds latency; tiny per-request work minimizes latency but wastes throughput.

How it works

Latency is reduced by caching, doing less work per request, moving computation closer to the user (CDN), and avoiding serial round-trips. Throughput is increased by parallelism, batching, pipelining, and adding nodes (horizontal scaling). The tension: batching N requests amortizes fixed costs (great for throughput) but each request now waits for the batch to fill (worse latency). Little's Law ties them together: concurrency = throughput × latency.

Why use it

Measuring both separately tells you which optimization actually helps your users
Throughput optimizations (batching, pipelining) are often cheap capacity wins
Latency optimizations (caching, CDN) directly improve perceived user experience

What it costs you

They trade off: batching for throughput adds latency; minimizing latency wastes throughput on per-request overhead
Averages lie about latency, so you must track tail percentiles (p99/p999) because that's what users feel
Optimizing the wrong metric burns engineering time without moving the number users care about

Where it shows up in our architectures

Stock Exchange (Matching Engine) →
Microsecond latency is the entire product; they sacrifice almost everything else to shave tail latency off order matching
Apache Kafka →
Built for throughput: batches and pipelines records, accepting some per-message latency to push millions of messages/sec
Netflix →
CDN edges cut video start latency; the encoding pipeline is throughput-optimized batch work where latency doesn't matter

Gotchas

Never report latency as an average. p99 and p999 are what users actually feel; averages hide the tail where everyone gets angry.
Batching is the classic latency-for-throughput trade. Tune batch size and max-wait to the SLA, and don't make latency-sensitive requests wait for a batch to fill.
Little's Law (concurrency = throughput × latency) is the cheat sheet: if latency rises and throughput is flat, in-flight requests are piling up, which is where queues back up and systems fall over.
Higher throughput doesn't imply lower latency. A system can be saturated (max throughput) while every individual request crawls.

Interview angle

Latency vs throughput questions test whether you know what you're actually optimizing for. The first thing to ask is 'what does the user feel?' because if it's an interactive request, latency matters, and if it's a batch pipeline, throughput matters. Show you measure latency at p99, not the average, because tail latency is what real users experience. Candidates lose points by conflating throughput with performance and proposing batching for a real-time user-facing request where it would make response time worse.

Your notes

Private to you