Performance·3 min read

Bloom Filter

A probabilistic 'have I seen this?' check that uses tiny memory at the cost of occasional false positives.

Try it

Add a word (it flips a few bits), then ask if a word is present.

Add

Query

A Bloom filter answers "have I seen this?" in a tiny amount of memory. A 0 bit proves a thing is absent; all-1 bits only mean it is probably present, since two different words can set the same bits. You use it when a cheap "probably yes" is fine because you will double-check after.

First time reading this? Start here

Plain English: a tiny memory structure that answers 'have I seen this before?' very fast. It works like a bouncer with a guest list who can say 'definitely not on the list' for sure, but only 'maybe' when he thinks you are. So it says either 'definitely no' or 'probably yes,' and you use it when 'probably yes' is okay because you'll double-check after.

Used in:Search Engine

What it is

A bit array plus several hash functions. To add an item, hash it K times and set those K bits. To check membership, hash it the same way: if any of those K bits is 0, the item is definitely NOT in the set. If all K are 1, it's probably in the set (with a tunable false-positive rate).

The problem it solves

Some applications need to check 'have I seen this?' billions of times against a huge set. Storing the full set in a hash table costs gigabytes; a database lookup is too slow. A Bloom filter answers the question in constant time and a few bits per item, which is perfect when false positives are tolerable but you can't afford the storage.

How it works

Allocate an array of M bits. For each item, compute K hash values modulo M and set those bits to 1. On lookup, recompute the same K hash values and check those bits. If any are 0 → definitely not in the set. If all are 1 → probably in the set (false positive rate is configurable via M and K).

Why use it

Large space savings: bits per item, not bytes
Constant-time lookup, no I/O
No false negatives, so if it says 'no', it really means no

What it costs you

False positives are real, so tune the bit array size to keep them tolerable
Items cannot be removed (variants like counting Bloom filters help)
Hashes must be independent; a bad hash choice destroys the math

Where it shows up in our architectures

Search Engine →
The crawler uses a Bloom filter to check 'have I already crawled this URL?' A false positive means we occasionally skip a URL (acceptable), while a false negative would mean re-crawling everything (unacceptable)

Gotchas

False positives are usually fine, while false negatives are usually fatal. Pick problems where the asymmetry matches.
Sizing matters. The false-positive rate is (1 - e^(-K*n/M))^K, so work this out for your expected n before deciding M and K.
Use a quality hash function (Murmur3 is the standard). Cryptographic hashes are overkill and slow.

Interview angle

Bloom filters come up when an interviewer asks how you'd handle a lookup that's expensive to miss on. The key signal they want is that you know the asymmetry: false positives are fine, false negatives are fatal, so you only use this when 'probably yes' is okay and you'll verify before acting. Candidates lose points by saying 'just cache the results' without explaining why a cache doesn't work at billions-of-entries scale.

Your notes

Private to you