Operations·3 min read

Auto-Scaling

Automatically add and remove instances based on load so you pay for what you need and survive spikes without a pager.

Try it

Drag the traffic up and down. Servers scale to match.

srv1

req/sec

servers

75%

per-server load

quiettraffic spike

Auto-scaling watches load and adds servers when each one gets too busy, then removes them when traffic falls, so you pay for roughly what you use and stay responsive during spikes. The trade-off is that new servers take time to warm up, so it reacts, it does not predict.

First time reading this? Start here

Plain English: instead of manually adding servers when traffic rises and removing them when it falls, you set rules ('keep CPU around 60%') and the platform spins instances up and down for you. Saves money off-peak and absorbs spikes, as long as you've sized the limits and warm-up right.

Used in:Instagram Feed Notification System Ticketmaster (Seat Booking)

What it is

A control loop that adjusts the number of running instances (or container replicas) in response to observed load. It watches metrics (CPU, request rate, queue depth, custom signals), compares them to a target, and scales out (add instances) or in (remove instances) within configured min/max bounds.

The problem it solves

Traffic is rarely flat; it has daily peaks, weekly cycles, and unpredictable spikes. Provisioning for peak wastes money off-peak; provisioning for average means falling over during spikes. Auto-scaling matches capacity to demand automatically, controlling cost during quiet periods while absorbing surges without a human in the loop.

How it works

Target-tracking: set a target (e.g. 60% CPU or 1000 RPS/instance); the controller adds/removes instances to hold the metric near target. Step/threshold: define rules ('+2 instances if CPU > 80% for 5 min'). Scheduled: scale ahead of known events (Black Friday, market open). Predictive: ML forecasts demand and pre-warms capacity. New instances must pass health checks before the load balancer routes to them; scale-in respects cooldowns and connection draining to avoid killing in-flight work.

Why use it

Cost efficiency: pay for peak capacity only during the peak
Handles spikes and failures without a human reacting at 3am
Pairs naturally with horizontal scaling of stateless tiers and queue-based workloads

What it costs you

Reactive scaling lags the spike: new instances take time to boot and warm up, so you eat latency at the front edge
Bad metrics or thresholds cause thrashing (scale out, scale in, repeat) or runaway scaling that empties your budget
Only works for stateless/shardable tiers; you can't auto-scale a single stateful primary database

Where it shows up in our architectures

Instagram Feed →
The stateless feed-serving tier auto-scales on request rate to ride out daily traffic peaks
Notification System →
Worker pools auto-scale on queue depth: a burst of send requests spins up more workers, then scales back down
Ticketmaster (Seat Booking) →
Scheduled and predictive scaling ahead of a hot on-sale; the virtual queue smooths the spike auto-scaling can't react to fast enough

Gotchas

Auto-scaling is reactive and instances aren't instant. For known spikes (on-sales, launches) use scheduled or predictive scaling, or a virtual queue, instead of waiting for the metric to trip.
Scale on the right metric. CPU is a poor proxy for I/O-bound services; queue depth or request latency is often the real signal.
Set a sane max. Without an upper bound, a traffic spike (or a retry storm) can scale you into a five-figure bill before anyone notices.
Cold starts and connection draining matter: new instances need warm-up before taking full traffic, and scale-in must drain in-flight requests or you drop user work.

When this went wrong in production

Amazon Prime Day collapses under its own launch load · 2018

Postmortem ↗

Prime Day 2018 opened with Amazon's own landing page returning errors for the first 90 minutes.

Prime Day 2018 launched with a load spike Amazon had anticipated and prepared for, but not quite enough. The front-end tier scaled horizontally via auto-scaling groups. The recommendation service underneath did not: it depended on a Redis cluster sized for projected peak, not actual peak. The Redis cluster hit its connection limit within minutes of launch. Backend services queuing for Redis connections started timing out. The front-end returned errors. The recommendation service's circuit breaker was supposed to fail open (show a degraded UI without personalization), but configuration drift meant it was set to fail closed instead. Customers saw error dogs on Amazon.com for 90 minutes. The lesson: auto-scaling the frontend while leaving stateful dependencies unscaled is the most common Prime-Day-class mistake. Circuit breakers also need to be exercised in production, not just configured and forgotten.

All war stories →

Interview angle

Auto-scaling comes up when you're discussing how to handle traffic spikes. The thing interviewers want to hear is that reactive scaling has a lag, so you need either predictive scaling or a queue buffer for spikes you can see coming. Show you know what the right trigger metric is for your workload: CPU for compute-bound, queue depth for workers, request latency for user-facing APIs. Candidates lose points by saying 'it just scales automatically' without addressing boot time, warmup, or what happens between the spike and the first new instance coming online.

Your notes

Private to you