uint64_t murmur64(uint64_t h) {
h ^= h >> 33;
h *= UINT64_C(0xff51afd7ed558ccd);
h ^= h >> 33;
h *= UINT64_C(0xc4ceb9fe1a85ec53);
h ^= h >> 33;
return h;
}
k
hash functions k
bits to 1k
bits to 1 uint64_t hash = hasher(key);
uint64_t a = (hash >> 32) | (hash << 32);
uint64_t b = hash;
for (int i = 0; i < k; i++) {
if ((data[reduce(a, length)] & getBit(a)) == 0) {
return NotFound;
}
a += b;
}
return Found;
bits per element | hash functions | fpp |
---|---|---|
9 | 6 | 1.3% |
10 | 7 | 0.8% |
12 | 8 | 0.3% |
13 | 9 | 0.2% |
15 | 10 | 0.07% |
16 | 11 | 0.04% |
number of hash functions | cache misses (miss) | cache misses (hit) |
---|---|---|
8 | 3.5 | 7.5 |
11 | 3.8 | 10.5 |
(Intel Ice Lake processor, out-of-cache filter)
number of hash functions | all out | all in |
---|---|---|
8 | 0.95 | 0.0 |
11 | 0.95 | 0.0 |
(Intel Ice Lake processor, out-of-cache filter)
number of hash functions | always out (cycles/entry) | always in (cycles/entry) |
---|---|---|
8 | 135 | 170 |
11 | 140 | 230 |
(Intel Ice Lake processor, out-of-cache filter)
auto hash = hasher_(key);
uint32_t bucket_idx = reduce(rotl64(hash, 32), bucketCount);
__m256i mask = MakeMask(hash);
__m256i bucket = directory[bucket_idx];
return _mm256_testc_si256(bucket, mask);
bool contain(uint64_t key, const binary_fuse_t *filter) {
uint64_t hash = mix_split(key, filter->Seed);
uint8_t f = fingerprint(hash);
binary_hashes_t hashes = hash_batch(hash, filter);
f ^= filter->Fingerprints[hashes.h0] ^ filter->Fingerprints[hashes.h1] ^
filter->Fingerprints[hashes.h2];
return f == 0;
}
cache misses | mispredictions | |
---|---|---|
3-wise binary fuse | 2.8 | 0.0 |
4-wise binary fuse | 3.7 | 0.0 |
(Intel Ice Lake processor, out-of-cache filter)
always out (cycles/entry) | always in (cycles/entry) | bits per entry | |
---|---|---|---|
Bloom |
135 | 170 | 12 |
3-wise bin. fuse | 85 | 85 | 9.0 |
4-wise bin. fuse | 100 | 100 | 8.6 |
(Intel Ice Lake processor, out-of-cache filter)
For warm small filters, number of access is less important.
Becomes more computational.
For large cold filters, accesses are costly.
ns/query (all out) | ns/query (all in) | fpp | bits per entry | |
---|---|---|---|---|
Bloom | 17 | 14 | 0.32% | 12.0 |
Blocked Bloom (NEON) | 3.8 | 3.8 | 0.6% | 12.8 |
3-wise bin. fuse | 3.5 | 3.5 | 0.39% | 9.0 |
4-wise bin. fuse | 4.0 | 4.0 | 0.39% | 8.6 |
(Apple M2)
ns/query (all out) | ns/query (all in) | fpp | bits per entry | |
---|---|---|---|---|
Bloom | 38 | 33 | 0.32% | 12.0 |
Blocked Bloom (NEON) | 11 | 11 | 0.6% | 12.8 |
4-wise bin. fuse | 17 | 17 | 0.39% | 9.0 |
4-wise bin. fuse | 20 | 20 | 0.39% | 8.6 |
(Apple M2)
bits per entry (raw) | bits per entry (zstd) | |
---|---|---|
Bloom |
12.0 | 12.0 |
3-wise bin. fuse | 9.0 | 8.59 |
4-wise bin. fuse | 8.60 | 8.39 |
theory | 8.0 | 8.0 |
Compressed (zstd) binary fuse filters can be within 5% of the theoretical minimum.

---
---  --- # How does the performance scale with size? --- # Huge tables? --- # Compressibility? If you are sending the filters over a network, you can further compress it.
Gap between binary fuse filters and blocked Bloom filters is less. Regular Bloom struggles (lots of computation plus mispredictions).
Block Bloom filters dominate in performance.