Question 1

What is a Bloom filter?

Accepted Answer

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can report 'possibly in set' or 'definitely not in set', using a bit array and several hash functions instead of storing the items themselves.

Question 2

Why does a Bloom filter never give false negatives?

Accepted Answer

When an item is inserted, all k of its hashed bit positions are set to 1 and are never cleared. So if any of an item's k bits is 0, the item was definitely never added — there are no false negatives.

Question 3

What causes false positives in a Bloom filter?

Accepted Answer

A false positive happens when all k bits for a queried item are already set to 1 by other inserted items, even though the item itself was never added. The probability of this rises as the filter fills up.

Question 4

What is the false-positive probability formula?

Accepted Answer

After inserting n items into m bits with k hash functions, the approximate false-positive rate is (1 − e^(−kn/m))^k. The term e^(−kn/m) estimates the fraction of bits still 0.

Question 5

How do I choose k, the number of hash functions?

Accepted Answer

The optimal number of hash functions is k = (m/n) · ln 2, which minimises the false-positive rate. Too few hashes leave the filter under-discriminating; too many fill the array quickly.

Question 6

How big should the bit array m be?

Accepted Answer

For a target false-positive rate p and n items, the optimal size is m = −(n · ln p) / (ln 2)^2 bits, roughly 9.6 bits per element for a 1% error rate.

Question 7

Can you delete items from a Bloom filter?

Accepted Answer

Not from a standard Bloom filter, because clearing bits could affect other items. A Counting Bloom filter uses small counters instead of single bits to allow deletions.

Question 8

Where are Bloom filters used in practice?

Accepted Answer

Databases like Cassandra and HBase use them to skip disk reads for missing keys, web browsers use them for malicious-URL checks, and CDNs and caches use them to avoid one-hit-wonder caching.

Question 9

How do real implementations get k independent hashes?

Accepted Answer

Rather than k separate hash functions, double hashing combines two base hashes as h_i(x) = h1(x) + i · h2(x) mod m, which behaves close to k independent hashes for filter purposes.

Question 10

What is the difference between measured and theoretical error?

Accepted Answer

The theoretical rate (1 − e^(−kn/m))^k assumes idealised uniform hashing. The measured rate in this simulation counts actual false positives over random test queries, so it fluctuates around the theory.

Adding an item

Querying an item

The error rate

Tuning k and m

Frequently asked questions