A probabilistic 'have I seen this?' check that uses tiny memory at the cost of occasional false positives.
Add a word (it flips a few bits), then ask if a word is present.
A Bloom filter answers "have I seen this?" in a tiny amount of memory. A 0 bit proves a thing is absent; all-1 bits only mean it is probably present, since two different words can set the same bits. You use it when a cheap "probably yes" is fine because you will double-check after.
Plain English: a tiny memory structure that answers 'have I seen this before?' very fast. It works like a bouncer with a guest list who can say 'definitely not on the list' for sure, but only 'maybe' when he thinks you are. So it says either 'definitely no' or 'probably yes,' and you use it when 'probably yes' is okay because you'll double-check after.
A bit array plus several hash functions. To add an item, hash it K times and set those K bits. To check membership, hash it the same way: if any of those K bits is 0, the item is definitely NOT in the set. If all K are 1, it's probably in the set (with a tunable false-positive rate).
Some applications need to check 'have I seen this?' billions of times against a huge set. Storing the full set in a hash table costs gigabytes; a database lookup is too slow. A Bloom filter answers the question in constant time and a few bits per item, which is perfect when false positives are tolerable but you can't afford the storage.
Allocate an array of M bits. For each item, compute K hash values modulo M and set those bits to 1. On lookup, recompute the same K hash values and check those bits. If any are 0 → definitely not in the set. If all are 1 → probably in the set (false positive rate is configurable via M and K).
The crawler uses a Bloom filter to check 'have I already crawled this URL?' A false positive means we occasionally skip a URL (acceptable), while a false negative would mean re-crawling everything (unacceptable)
Bloom filters come up when an interviewer asks how you'd handle a lookup that's expensive to miss on. The key signal they want is that you know the asymmetry: false positives are fine, false negatives are fatal, so you only use this when 'probably yes' is okay and you'll verify before acting. Candidates lose points by saying 'just cache the results' without explaining why a cache doesn't work at billions-of-entries scale.