Standalone pages for foundational concepts that show up across multiple systems. Read these once and you'll recognize the pattern everywhere.
When a network partition happens, you have to pick: consistency or availability. You can't have both.
Require N/2+1 nodes to agree on every operation, so the system stays consistent even when some nodes are down.
One deployable unit that's simple but couples everything, vs many small services that scale teams but cost you a distributed system.
Pick exactly one node to coordinate, and re-pick safely when it dies, without ever ending up with two leaders.
Services react to events others emit instead of calling each other directly: loose coupling at the cost of harder reasoning.
Replace a cross-service ACID transaction with a sequence of local transactions plus compensating actions to undo on failure.
Pick relational when you need joins + transactions; pick a NoSQL store when you need a specific scaling property they're built for.
Pre-computed lookup structures that turn O(n) table scans into O(log n) or O(1) lookups.
Split the write model from the read model (CQRS) and store state as a log of events instead of current values (event sourcing).
A probabilistic 'have I seen this?' check that uses tiny memory at the cost of occasional false positives.
Spread incoming requests across a pool of identical servers so no single one melts.
Store the answer to expensive questions so you don't pay to compute it again.
Buy a bigger machine (simple, capped, SPOF) or add more machines (unlimited, but now you're a distributed system).
Latency is how long one request takes; throughput is how many you handle per second. Optimizing one often hurts the other.
How a cache decides what to throw out (LRU/LFU/TTL) and how writes propagate to the backing store (through/back/around).
Stop calling a failing dependency before it takes you down with it.
Keep copies of your data on multiple nodes so you don't lose it (or your ability to serve) when one dies.
Periodic 'I'm alive' signals between nodes so failures are detected within seconds, not minutes.
An operation you can safely apply more than once and get the same result: the foundation of every retryable system.
Keep spare capacity standing by and switch to it automatically when something dies, so a component failure isn't an outage.
Deliberately inject failures into production to prove your resilience works before a real outage tests it for you.
A single front door for many backend services that handles auth, rate-limiting, routing, observability in one place.
Serve static and semi-static content from edge servers physically close to the user.
The distributed phone book that turns hostnames into IP addresses.
Servers that sit between two parties and intercept their traffic for some purpose.
Three ways to deliver server-pushed updates; pick based on direction, scale, and infra constraints.
Three API styles: REST is the universal default, gRPC is fast binary RPC for service-to-service, GraphQL lets clients ask for exactly the data they want.
Decouple producers from consumers with a durable buffer so spikes get absorbed and slow work happens asynchronously.
Automatically add and remove instances based on load so you pay for what you need and survive spikes without a pager.
The three pillars (logs, metrics, traces) that let you ask new questions about a live system you didn't anticipate.
Stitch one request's journey across many services into a single timeline so you can see exactly where the time went.
Authentication proves who you are; authorization decides what you're allowed to do. Different problems, often confused.
Encrypt data on the wire (TLS) and on disk (at-rest) so neither a network eavesdropper nor a stolen drive yields plaintext.