Introduction to Bloomfilters

Bloomfilters are probabilistic data structures designed to test set membership efficiently. They provide a clear answer when an element is not in a set, while allowing a small probability of false positives when reporting that an element is in the set. This balance makes them especially useful when memory constraints matter more than occasional uncertainty. A classical example is email filtering: before querying a blocklist database, a Bloomfilter can be used as a quick preliminary test to reject addresses that are definitely not listed.

1. The Basic Idea

A Bloomfilter uses a bit array of length \(m\) and \(k\) hash functions \(h_1, \ldots, h_k\).

To insert an element \(x\): compute \(h_i(x) \pmod m\) for each \(i\) and set those bits to 1.
To query \(y\): compute the same \(k\) indices. If any bit is 0, then \(y\) is definitely not in the set. If all are 1, then \(y\) is possibly in the set.

The key question is: how large must \(m\) be, and how many hash functions \(k\) should be used, to achieve a low false positive probability for \(n\) inserted elements?

2. False Positive Probability

2.1 Occupancy of Bits

Each insertion sets \(k\) bits. After \(n\) insertions, there are \(kn\) bit assignments.

The probability that a given bit remains 0 after all insertions is

\[ \left(1 - \frac{1}{m}\right)^{kn}. \]

The probability that the bit is 1 is then

\[ p = 1 - \left(1 - \frac{1}{m}\right)^{kn}. \]

2.2 Querying an Element

For an element not in the set, each of the \(k\) checked positions has probability \(p\) of being 1. The false positive probability is therefore

\[ P_\text{fp} = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k. \]

2.3 Approximation

For large \(m\),

\[ \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m}. \]

Thus

\[ P_\text{fp} \approx \left(1 - e^{-kn/m}\right)^k. \]

This compact expression connects \(m\), \(n\), and \(k\) to the false positive rate.

3. Optimal Number of Hash Functions

The optimal \(k\) minimizes

\[ f(k) = \left(1 - e^{-kn/m}\right)^k. \]

Taking logarithms,

\[ \ln f(k) = k \ln\!\big(1 - e^{-kn/m}\big). \]

Differentiating and solving yields

\[ k = \frac{m}{n}\ln 2. \]

At this value, half of the bits in the filter are expected to be set, maximizing entropy and minimizing false positives.

Substituting \(k = \tfrac{m}{n}\ln 2\) into the approximation gives

\[ P_\text{fp} \approx \left(1 - e^{-(m/n)\ln 2 \cdot n/m}\right)^{(m/n)\ln 2}. \]

The inner term simplifies:

\[ e^{-(m/n)\ln 2 \cdot n/m} = e^{-\ln 2} = \tfrac{1}{2}. \]

\[ P_\text{fp} \approx \left(1 - \tfrac{1}{2}\right)^{(m/n)\ln 2} = \left(\tfrac{1}{2}\right)^{(m/n)\ln 2}. \]

Since \((1/2)^{\ln 2} \approx 0.6185\),

\[ P_\text{fp} \approx (0.6185)^{m/n}. \]

This step shows why the minimal false positive probability depends only on the bits-per-element ratio \(m/n\).

4. Hashing Mechanisms

The mathematical analysis assumes perfectly uniform and independent hash functions. In practice, real functions approximate this ideal with varying quality. The key requirements are:

Uniformity: output values should be evenly distributed across the range \(0, \dots, m-1\).
Independence: different hash functions should not produce correlated outputs.
Avalanche effect: small input changes should flip many output bits.

Different mechanisms approximate these requirements in different ways.

5. Hashing Algorithms in Practice

5.1 Murmur3

Murmur3 is a non-cryptographic hash designed for speed and dispersion. It mixes input blocks with rotations and multiplications by large odd constants, creating near-random diffusion.

Pros: High-quality uniform distribution, excellent avalanche behavior, fast.
Cons: Predictable under adversarial control, not cryptographically secure.

Mathematically, the mixing steps approximate a random affine transformation in \(\mathbb{Z}_{2^{32}}\) or \(\mathbb{Z}_{2^{64}}\). Repeated rotations and multiplications spread entropy evenly, keeping variance of bit occupancy close to 0, which is optimal for Bloomfilters.

Example: With \(m/n = 10\) and \(k \approx 6.93\), the theoretical false positive rate is

\[ P_\text{fp} \approx (0.6185)^{10} \approx 0.0082. \]

Empirical results with Murmur3 confirm this prediction within negligible error.

5.2 Fowler–Noll–Vo (FNV)

FNV iteratively multiplies the hash value by a prime and XORs with input bytes:

\[ H_{i+1} = (H_i \cdot P) \oplus b_{i+1} \quad \text{mod } 2^w, \]

where \(b_{i+1}\) is the next input byte, \(P\) is a fixed prime, and \(w\) (where \(w\) is 32 or 64) is the machine word size.

Pros: Extremely simple and fast.
Cons: Non-uniform probe distribution, clustering of outputs.

Because multiplication in \(\mathbb{Z}_{2^w}\) is linear, residues tend to align on arithmetic progressions. This leads to biased output distributions for structured input (such as strings with shared prefixes), effectively lowering the usable \(m\) by increasing variance in probe distribution. Bloomfilters using FNV exhibit higher false positive rates than theory predicts.

Example: Suppose FNV increases probe variance by 10%, giving

\[ m_\text{eff} = \frac{m}{1+0.1} = 0.91m. \]

For \(m/n = 10\), the adjusted false positive rate is

\[ P_\text{fp} \approx (0.6185)^{9.1} \approx 0.0102, \]

about 25% worse than Murmur3.

In practice, FNV sometimes shows even larger variance, especially for data with repeated structure.

5.3 Cryptographic Hashes (SHA-1, SHA-256, Blake2)

Cryptographic hashes employ multiple nonlinear mixing rounds (S-boxes, modular additions, rotations). These functions behave statistically like random oracles: output is indistinguishable from uniform over the codomain.

Pros: Excellent uniformity, robust against adversarial input.
Cons: 5–50x slower than non-cryptographic options.

With \(\sigma^2 \approx 0\), cryptographic hashes yield the same effective performance as Murmur3. Their main advantage is stability against adversarially chosen inputs.

5.4 Double Hashing

A practical optimization is to reduce the number of independent hash evaluations. Instead of computing \(k\) distinct functions, it is sufficient to compute two base hashes \(h_1\) and \(h_2\) and generate the rest as

\[ h_i(x) = h_1(x) + i \cdot h_2(x) \pmod m, \quad i = 0, 1, \dots, k-1. \]

This method was introduced by Adam Kirsch and Michael Mitzenmacher (2006). Their analysis shows that double hashing closely approximates the behavior of \(k\) independent hash functions in Bloomfilters, with only negligible loss in performance.

Pros: Requires only two base hash evaluations, greatly improving speed.
Cons: Slightly less independence between probes, though in practice the false positive probability remains essentially unchanged.

Though not perfectly independent, this scheme produces nearly uniform probes in practice. It preserves \(\sigma^2 \approx 0\) provided \(h_1\) and \(h_2\) are well-chosen. The reduction in hash evaluations significantly speeds up Bloomfilter operations and is widely adopted in high-performance Bloomfilter implementations.

6. Weak Hashing in Bloomfilters

If the hash distribution is non-uniform, some positions are set more often than others. Let \(\sigma^2\) be the normalized variance in occupancy probability across the \(m\) positions. The Bloomfilter behaves as if it had

\[ m_\text{eff} = \frac{m}{1+\sigma^2}. \]

usable bits.

The false positive probability then becomes

\[ P_\text{fp} \approx \left(0.6185\right)^{m_\text{eff}/n}. \]

Thus every percentage of variance in probe distribution translates directly into an exponential increase in false positives.

Countermeasures in the Bloomfilter context include:

Using double hashing with strong base hashes (Murmur3, SHA family).
Introducing random seeds to decorrelate instances.
Rebuilding filters before occupancy exceeds 50%, since clustering worsens rapidly beyond this point.

7. Applications of Bloomfilters

Bloomfilters are widely deployed where compact membership testing is needed:

Email filtering: pre-check addresses against blocklists.
Databases: avoid unnecessary disk lookups by testing keys.
Networking: routing, peer discovery, and cache membership.
Security: blacklist testing for malicious IPs or URLs.
Distributed systems: compact set exchange (Bigtable, HBase, Apache Cassandra).
Browsers: check if sites were previously visited or bookmarked

References

BloomBloom, B. H. (1970). Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7), 422–426.
KMKirsch, A., & Mitzenmacher, M. (2006). Less Hashing, Same Performance: Building a Better Bloom Filter. ESA 2006.
ApplebyAppleby, A. (2008). MurmurHash3.
FNVFowler, G., Noll, L. C., & Vo, K. (1991). The FNV Non-Cryptographic Hash Algorithm.