Contents
raw book
bits (m)=?, k=? Fill: 0% FP≈0% visual bits: ?
Powered by BloomFilter.js

Introduction to Bloomfilters

Bloomfilters are probabilistic data structures designed to test set membership efficiently. They provide a clear answer when an element is not in a set, while allowing a small probability of false positives when reporting that an element is in the set. This balance makes them especially useful when memory constraints matter more than occasional uncertainty. A classical example is email filtering: before querying a blocklist database, a Bloomfilter can be used as a quick preliminary test to reject addresses that are definitely not listed.

1. The Basic Idea

A Bloomfilter uses a bit array of length \(m\) and \(k\) hash functions \(h_1, \ldots, h_k\).

The key question is: how large must \(m\) be, and how many hash functions \(k\) should be used, to achieve a low false positive probability for \(n\) inserted elements?

2. False Positive Probability

2.1 Occupancy of Bits

Each insertion sets \(k\) bits. After \(n\) insertions, there are \(kn\) bit assignments.

The probability that a given bit remains 0 after all insertions is

\[ \left(1 - \frac{1}{m}\right)^{kn}. \]

The probability that the bit is 1 is then

\[ p = 1 - \left(1 - \frac{1}{m}\right)^{kn}. \]

2.2 Querying an Element

For an element not in the set, each of the \(k\) checked positions has probability \(p\) of being 1. The false positive probability is therefore

\[ P_\text{fp} = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k. \]

2.3 Approximation

For large \(m\),

\[ \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m}. \]

Thus

\[ P_\text{fp} \approx \left(1 - e^{-kn/m}\right)^k. \]

This compact expression connects \(m\), \(n\), and \(k\) to the false positive rate.

3. Optimal Number of Hash Functions

The optimal \(k\) minimizes

\[ f(k) = \left(1 - e^{-kn/m}\right)^k. \]

Taking logarithms,

\[ \ln f(k) = k \ln\!\big(1 - e^{-kn/m}\big). \]

Differentiating and solving yields

\[ k = \frac{m}{n}\ln 2. \]

At this value, half of the bits in the filter are expected to be set, maximizing entropy and minimizing false positives.

Substituting \(k = \tfrac{m}{n}\ln 2\) into the approximation gives

\[ P_\text{fp} \approx \left(1 - e^{-(m/n)\ln 2 \cdot n/m}\right)^{(m/n)\ln 2}. \]

The inner term simplifies:

\[ e^{-(m/n)\ln 2 \cdot n/m} = e^{-\ln 2} = \tfrac{1}{2}. \]

So

\[ P_\text{fp} \approx \left(1 - \tfrac{1}{2}\right)^{(m/n)\ln 2} = \left(\tfrac{1}{2}\right)^{(m/n)\ln 2}. \]

Since \((1/2)^{\ln 2} \approx 0.6185\),

\[ P_\text{fp} \approx (0.6185)^{m/n}. \]

This step shows why the minimal false positive probability depends only on the bits-per-element ratio \(m/n\).

4. Hashing Mechanisms

The mathematical analysis assumes perfectly uniform and independent hash functions. In practice, real functions approximate this ideal with varying quality. The key requirements are:

Different mechanisms approximate these requirements in different ways.

5. Hashing Algorithms in Practice

5.1 Murmur3

Murmur3 is a non-cryptographic hash designed for speed and dispersion. It mixes input blocks with rotations and multiplications by large odd constants, creating near-random diffusion.

Mathematically, the mixing steps approximate a random affine transformation in \(\mathbb{Z}_{2^{32}}\) or \(\mathbb{Z}_{2^{64}}\). Repeated rotations and multiplications spread entropy evenly, keeping variance of bit occupancy close to 0, which is optimal for Bloomfilters.

Example: With \(m/n = 10\) and \(k \approx 6.93\), the theoretical false positive rate is

\[ P_\text{fp} \approx (0.6185)^{10} \approx 0.0082. \]

Empirical results with Murmur3 confirm this prediction within negligible error.

5.2 Fowler–Noll–Vo (FNV)

FNV iteratively multiplies the hash value by a prime and XORs with input bytes:

\[ H_{i+1} = (H_i \cdot P) \oplus b_{i+1} \quad \text{mod } 2^w, \]

where \(b_{i+1}\) is the next input byte, \(P\) is a fixed prime, and \(w\) (where \(w\) is 32 or 64) is the machine word size.

Because multiplication in \(\mathbb{Z}_{2^w}\) is linear, residues tend to align on arithmetic progressions. This leads to biased output distributions for structured input (such as strings with shared prefixes), effectively lowering the usable \(m\) by increasing variance in probe distribution. Bloomfilters using FNV exhibit higher false positive rates than theory predicts.

Example: Suppose FNV increases probe variance by 10%, giving

\[ m_\text{eff} = \frac{m}{1+0.1} = 0.91m. \]

For \(m/n = 10\), the adjusted false positive rate is

\[ P_\text{fp} \approx (0.6185)^{9.1} \approx 0.0102, \]

about 25% worse than Murmur3.

In practice, FNV sometimes shows even larger variance, especially for data with repeated structure.

5.3 Cryptographic Hashes (SHA-1, SHA-256, Blake2)

Cryptographic hashes employ multiple nonlinear mixing rounds (S-boxes, modular additions, rotations). These functions behave statistically like random oracles: output is indistinguishable from uniform over the codomain.

With \(\sigma^2 \approx 0\), cryptographic hashes yield the same effective performance as Murmur3. Their main advantage is stability against adversarially chosen inputs.

5.4 Double Hashing

A practical optimization is to reduce the number of independent hash evaluations. Instead of computing \(k\) distinct functions, it is sufficient to compute two base hashes \(h_1\) and \(h_2\) and generate the rest as

\[ h_i(x) = h_1(x) + i \cdot h_2(x) \pmod m, \quad i = 0, 1, \dots, k-1. \]

This method was introduced by Adam Kirsch and Michael Mitzenmacher (2006). Their analysis shows that double hashing closely approximates the behavior of \(k\) independent hash functions in Bloomfilters, with only negligible loss in performance.

Though not perfectly independent, this scheme produces nearly uniform probes in practice. It preserves \(\sigma^2 \approx 0\) provided \(h_1\) and \(h_2\) are well-chosen. The reduction in hash evaluations significantly speeds up Bloomfilter operations and is widely adopted in high-performance Bloomfilter implementations.

6. Weak Hashing in Bloomfilters

If the hash distribution is non-uniform, some positions are set more often than others. Let \(\sigma^2\) be the normalized variance in occupancy probability across the \(m\) positions. The Bloomfilter behaves as if it had

\[ m_\text{eff} = \frac{m}{1+\sigma^2}. \]

usable bits.

The false positive probability then becomes

\[ P_\text{fp} \approx \left(0.6185\right)^{m_\text{eff}/n}. \]

Thus every percentage of variance in probe distribution translates directly into an exponential increase in false positives.

Countermeasures in the Bloomfilter context include:

7. Applications of Bloomfilters

Bloomfilters are widely deployed where compact membership testing is needed:

References