Reqflow
← All concepts
Performance·3 min read

Latency vs Throughput

Latency is how long one request takes; throughput is how many you handle per second. Optimizing one often hurts the other.

Try it

Pick a scenario below and see which option fits, and why.

Optimize latency
fast for a single request
Optimize throughput
max work per second
Scenarios

There is no universal winner here, only the right fit for a given situation. Each scenario above pushes the decision a different way, which is exactly how this tradeoff shows up in real design questions.

First time reading this? Start here

Plain English: latency is the wait for a single request (how fast). Throughput is how many requests you handle per second (how much). They're different: a system can be high-throughput but high-latency (like batching) or low-latency but low-throughput. You usually have to trade one for the other.

What it is

Two distinct performance dimensions. Latency is the time to complete a single operation, end to end, measured in percentiles (p50, p99, p999), not averages. Throughput is the rate of operations the system sustains: requests/sec, MB/sec, messages/sec. They are related but independent: a highway's speed limit is latency; its number of lanes is throughput.

The problem it solves

Forces you to be precise about what 'fast' means. A user staring at a spinner cares about latency. A batch pipeline processing a billion rows overnight cares about throughput. Conflating them leads to the wrong optimization: batching improves throughput but adds latency; tiny per-request work minimizes latency but wastes throughput.

How it works

Latency is reduced by caching, doing less work per request, moving computation closer to the user (CDN), and avoiding serial round-trips. Throughput is increased by parallelism, batching, pipelining, and adding nodes (horizontal scaling). The tension: batching N requests amortizes fixed costs (great for throughput) but each request now waits for the batch to fill (worse latency). Little's Law ties them together: concurrency = throughput × latency.

Why use it

  • Measuring both separately tells you which optimization actually helps your users
  • Throughput optimizations (batching, pipelining) are often cheap capacity wins
  • Latency optimizations (caching, CDN) directly improve perceived user experience

What it costs you

  • They trade off: batching for throughput adds latency; minimizing latency wastes throughput on per-request overhead
  • Averages lie about latency, so you must track tail percentiles (p99/p999) because that's what users feel
  • Optimizing the wrong metric burns engineering time without moving the number users care about

Where it shows up in our architectures

  • Stock Exchange (Matching Engine)

    Microsecond latency is the entire product; they sacrifice almost everything else to shave tail latency off order matching

  • Apache Kafka

    Built for throughput: batches and pipelines records, accepting some per-message latency to push millions of messages/sec

  • Netflix

    CDN edges cut video start latency; the encoding pipeline is throughput-optimized batch work where latency doesn't matter

Gotchas

  • Never report latency as an average. p99 and p999 are what users actually feel; averages hide the tail where everyone gets angry.
  • Batching is the classic latency-for-throughput trade. Tune batch size and max-wait to the SLA, and don't make latency-sensitive requests wait for a batch to fill.
  • Little's Law (concurrency = throughput × latency) is the cheat sheet: if latency rises and throughput is flat, in-flight requests are piling up, which is where queues back up and systems fall over.
  • Higher throughput doesn't imply lower latency. A system can be saturated (max throughput) while every individual request crawls.
Interview angle

Latency vs throughput questions test whether you know what you're actually optimizing for. The first thing to ask is 'what does the user feel?' because if it's an interactive request, latency matters, and if it's a batch pipeline, throughput matters. Show you measure latency at p99, not the average, because tail latency is what real users experience. Candidates lose points by conflating throughput with performance and proposing batching for a real-time user-facing request where it would make response time worse.

Your notes

Private to you