Automatically add and remove instances based on load so you pay for what you need and survive spikes without a pager.
Drag the traffic up and down. Servers scale to match.
Auto-scaling watches load and adds servers when each one gets too busy, then removes them when traffic falls, so you pay for roughly what you use and stay responsive during spikes. The trade-off is that new servers take time to warm up, so it reacts, it does not predict.
Plain English: instead of manually adding servers when traffic rises and removing them when it falls, you set rules ('keep CPU around 60%') and the platform spins instances up and down for you. Saves money off-peak and absorbs spikes, as long as you've sized the limits and warm-up right.
A control loop that adjusts the number of running instances (or container replicas) in response to observed load. It watches metrics (CPU, request rate, queue depth, custom signals), compares them to a target, and scales out (add instances) or in (remove instances) within configured min/max bounds.
Traffic is rarely flat; it has daily peaks, weekly cycles, and unpredictable spikes. Provisioning for peak wastes money off-peak; provisioning for average means falling over during spikes. Auto-scaling matches capacity to demand automatically, controlling cost during quiet periods while absorbing surges without a human in the loop.
Target-tracking: set a target (e.g. 60% CPU or 1000 RPS/instance); the controller adds/removes instances to hold the metric near target. Step/threshold: define rules ('+2 instances if CPU > 80% for 5 min'). Scheduled: scale ahead of known events (Black Friday, market open). Predictive: ML forecasts demand and pre-warms capacity. New instances must pass health checks before the load balancer routes to them; scale-in respects cooldowns and connection draining to avoid killing in-flight work.
The stateless feed-serving tier auto-scales on request rate to ride out daily traffic peaks
Worker pools auto-scale on queue depth: a burst of send requests spins up more workers, then scales back down
Scheduled and predictive scaling ahead of a hot on-sale; the virtual queue smooths the spike auto-scaling can't react to fast enough
Prime Day 2018 opened with Amazon's own landing page returning errors for the first 90 minutes.
Prime Day 2018 launched with a load spike Amazon had anticipated and prepared for, but not quite enough. The front-end tier scaled horizontally via auto-scaling groups. The recommendation service underneath did not: it depended on a Redis cluster sized for projected peak, not actual peak. The Redis cluster hit its connection limit within minutes of launch. Backend services queuing for Redis connections started timing out. The front-end returned errors. The recommendation service's circuit breaker was supposed to fail open (show a degraded UI without personalization), but configuration drift meant it was set to fail closed instead. Customers saw error dogs on Amazon.com for 90 minutes. The lesson: auto-scaling the frontend while leaving stateful dependencies unscaled is the most common Prime-Day-class mistake. Circuit breakers also need to be exercised in production, not just configured and forgotten.
Auto-scaling comes up when you're discussing how to handle traffic spikes. The thing interviewers want to hear is that reactive scaling has a lag, so you need either predictive scaling or a queue buffer for spikes you can see coming. Show you know what the right trigger metric is for your workload: CPU for compute-bound, queue depth for workers, request latency for user-facing APIs. Candidates lose points by saying 'it just scales automatically' without addressing boot time, warmup, or what happens between the spike and the first new instance coming online.