Core ideas, explained on their own

When a network partition happens, you have to pick: consistency or availability. You can't have both.

Quorum

Require N/2+1 nodes to agree on every operation, so the system stays consistent even when some nodes are down.

Monolith vs Microservices

One deployable unit that's simple but couples everything, vs many small services that scale teams but cost you a distributed system.

Leader Election

Pick exactly one node to coordinate, and re-pick safely when it dies, without ever ending up with two leaders.

Event-Driven Architecture

Services react to events others emit instead of calling each other directly: loose coupling at the cost of harder reasoning.

Saga Pattern (Distributed Transactions)

Replace a cross-service ACID transaction with a sequence of local transactions plus compensating actions to undo on failure.

Sharding

Consistent Hashing

Distribute keys across N nodes such that adding or removing a node only reshuffles ~1/N of the keys.

Sharding / Data Partitioning

Split one big dataset across N smaller stores so each one stays manageable.

Data

SQL vs NoSQL

Pick relational when you need joins + transactions; pick a NoSQL store when you need a specific scaling property they're built for.

Used in 4 systems →

Database Indexes

Pre-computed lookup structures that turn O(n) table scans into O(log n) or O(1) lookups.

CQRS & Event Sourcing

Split the write model from the read model (CQRS) and store state as a log of events instead of current values (event sourcing).

Performance

Bloom Filter

A probabilistic 'have I seen this?' check that uses tiny memory at the cost of occasional false positives.

Used in 1 system →

Load Balancing

Spread incoming requests across a pool of identical servers so no single one melts.

Caching

Vertical vs Horizontal Scaling

Store the answer to expensive questions so you don't pay to compute it again.

Used in 4 systems →

Buy a bigger machine (simple, capped, SPOF) or add more machines (unlimited, but now you're a distributed system).

Latency vs Throughput

Latency is how long one request takes; throughput is how many you handle per second. Optimizing one often hurts the other.

Cache Eviction & Write Policies

How a cache decides what to throw out (LRU/LFU/TTL) and how writes propagate to the backing store (through/back/around).

Reliability

Circuit Breaker

Stop calling a failing dependency before it takes you down with it.

Replication

Keep copies of your data on multiple nodes so you don't lose it (or your ability to serve) when one dies.

Heartbeat

Periodic 'I'm alive' signals between nodes so failures are detected within seconds, not minutes.

Idempotency

An operation you can safely apply more than once and get the same result: the foundation of every retryable system.

Failover & Redundancy

Keep spare capacity standing by and switch to it automatically when something dies, so a component failure isn't an outage.

Chaos Engineering

Deliberately inject failures into production to prove your resilience works before a real outage tests it for you.

Networking

API Gateway

A single front door for many backend services that handles auth, rate-limiting, routing, observability in one place.

CDN (Content Delivery Network)

Serve static and semi-static content from edge servers physically close to the user.

DNS (Domain Name System)

The distributed phone book that turns hostnames into IP addresses.

Proxies (Forward & Reverse)

Servers that sit between two parties and intercept their traffic for some purpose.

Long-Polling vs WebSockets vs SSE

Communication

Three ways to deliver server-pushed updates; pick based on direction, scale, and infra constraints.

REST vs gRPC vs GraphQL

Three API styles: REST is the universal default, gRPC is fast binary RPC for service-to-service, GraphQL lets clients ask for exactly the data they want.

Message Queues

Decouple producers from consumers with a durable buffer so spikes get absorbed and slow work happens asynchronously.

Operations

Auto-Scaling

Automatically add and remove instances based on load so you pay for what you need and survive spikes without a pager.

Observability (Logs, Metrics, Traces)

The three pillars (logs, metrics, traces) that let you ask new questions about a live system you didn't anticipate.

Distributed Tracing

Stitch one request's journey across many services into a single timeline so you can see exactly where the time went.

Authentication vs Authorization

Security

Authentication proves who you are; authorization decides what you're allowed to do. Different problems, often confused.

Encryption in Transit & at Rest

Encrypt data on the wire (TLS) and on disk (at-rest) so neither a network eavesdropper nor a stolen drive yields plaintext.