Sifting Through the Archive: Private Set Membership for Blossom

Binary Fuse filters let Blossom clients privately check which files exist on a server - download a compact filter once, query locally with zero server load, and use delta lists for real-time accuracy.
Sifting Through the Archive: Private Set Membership for Blossom

Blossom is a protocol for storing blobs on media servers, using SHA-256 hashes as content addresses - simple, elegant, and effective. But lurking beneath this simplicity is a privacy problem: every time a client checks whether a file exists via the HEAD /<sha256> endpoint, the server learns exactly which file the client is looking for. For users who wish to synchronize thousands of files without revealing their entire library to the server, a different approach is needed.

The problem

Imagine you have 10,000 files and want to determine which ones already exist on a Blossom server before uploading. The naive approach - issuing 10,000 individual HEAD requests - certainly works, but it hands the server a complete manifest of your interests. The server can build a profile of your stored content and correlate your queries over time. What we need is a privacy-preserving solution with specific properties: minimal server computation (ideally zero per query), minimal bandwidth consumption, the ability to push computation to the client, and most importantly, a guarantee that the server learns nothing about which hashes the client queries.

Why probabilistic filters work

The key insight comes from understanding how probabilistic filters like Bloom and Binary Fuse filters behave. These data structures exhibit an asymmetric error profile: they never produce false negatives, meaning that when the filter declares an item “not present,” that answer is absolutely reliable. However, they can produce false positives, occasionally claiming an item is “maybe present” when it is not present. This asymmetry turns out to be acceptable for our use case.

For file deduplication, this asymmetry aligns beautifully with our needs. When the filter says “not present,” we know with certainty the file is missing and should be uploaded. When it says “maybe present,” the file probably exists and we can safely skip the upload. The occasional false positive means we might skip uploading something that is missing, but this delays the upload until the next synchronization cycle after the filter rebuilds - an acceptable trade-off for most applications.

How Binary Fuse filters work

Binary Fuse filters, introduced by Graf and Lemire in 2022, represent the current state of the art in static set membership data structures. Understanding their elegance requires a closer look at their construction.

The filter works by mapping each element to three distinct positions in an array through hash functions. During construction, the algorithm finds a “peeling order” - a sequence in which elements can be processed such that each element, when its turn comes, has at least one array position that no remaining element touches. Think of it like carefully removing books from a precarious stack: you must find the right order so that each book you remove doesn’t disturb the others still waiting.

Once this peeling order is discovered, the filter assigns fingerprints (small hash values) to array positions in reverse order. For each element, the algorithm stores a value at one of its three positions such that XORing the values at all three positions yields the element’s fingerprint. This XOR-based construction is what gives the filter its name and its efficiency.

To query whether an element exists, the client hashes the element to find its three positions, XORs the values stored there, and checks if the result matches the element’s expected fingerprint. A match indicates the element is probably in the set; a mismatch proves it is absent. The entire query requires just three memory lookups and two XOR operations - extraordinarily fast.

The “binary fuse” innovation specifically refers to how the array is partitioned into segments with a binary structure, which dramatically improves construction success rates and reduces the space overhead compared to earlier Xor filter designs.

Three variants for different needs

Binary Fuse filters come in three variants that trade space for accuracy:

Variant False Positive Rate Size per element Use case
Binary Fuse8 0.4% (1/256) ~9 bits General purpose, best space efficiency
Binary Fuse16 0.0015% (1/65536) ~18 bits When higher accuracy is required
Binary Fuse32 0.00000002% (1/2^32) ~36 bits Effectively perfect, for critical applications

All three variants share remarkably fast construction times - approximately 350 milliseconds for 10 million keys - and query times of roughly 25 nanoseconds per lookup.

Binary Fuse8 offers the best space efficiency, but with a 0.4% false positive rate, roughly 40 out of every 10,000 queries might incorrectly return “maybe present.”

These filters are deterministic, and that property deserves emphasis. The same hash queried against the same filter will always return the same result, which means a false positive will not resolve itself through repeated queries. It persists stubbornly until the server rebuilds the filter with a fresh structure. This is precisely why periodic rebuilds matter - they incorporate new files while also reshuffling which hashes happen to collide into false positives.

Binary Fuse16 is the sweet spot for most applications. It reduces the false positive rate to 0.0015%, nearly 256 times better than Fuse8, at the cost of doubling the filter size. At this rate, you would expect roughly one false positive per 65,000 queries - negligible in practice.

Binary Fuse32 achieves what is effectively perfection for practical purposes - you would need billions of queries before encountering a single false positive. The filter grows to four times the size of Fuse8, but for applications where correctness is paramount, this trade-off is worthwhile.

For a server hosting 10 million files, the concrete numbers look like this:

Variant Filter size False positives per 10K queries
Binary Fuse8 ~11 MB ~40
Binary Fuse16 ~22 MB ~0.15
Binary Fuse32 ~45 MB ~0.000002

Once the client downloads the filter, all queries happen locally with perfect privacy - the server never learns which hashes the client checks.

The dynamic data problem

Blossom servers, however, are not static archives. Files arrive constantly - one every ten seconds on a busy server - and Binary Fuse filters are immutable by design. You cannot incrementally add or remove elements; the entire structure must be rebuilt. Does this limitation doom the approach?

Not at all. The mathematics work decisively in our favor. Rebuilding a filter over 10 million hashes costs 350 milliseconds of computation. Performing this rebuild every hour consumes just 0.01% of a single CPU core - utterly negligible. Between hourly rebuilds, approximately 360 new files may arrive outside the filter window, appearing only after the next rebuild.

When a client queries for one of these recently-added files, the filter correctly responds “not present” (after all, the file was not present when the filter was constructed), and the client attempts to upload it. The server, already possessing the file, deduplicates the upload and responds accordingly. The cost is minor bandwidth waste, not a correctness failure.

The hybrid solution

For applications demanding near-real-time accuracy without the overhead of constant filter rebuilds, the elegant solution combines the filter with compact delta lists:

GET /filter
  Returns:
    - Binary Fuse filter (rebuilt hourly/daily)
    - List of hashes added since filter was built
    - List of hashes removed since filter was built
    - Timestamp of filter generation

Client logic is clear: download the filter and delta lists (caching them with ETags for efficiency), then for each hash to check, first consult the removal list (if present, the file is gone, then the addition list (if present, the file exists), and finally the filter itself. Periodic refreshes keep the client synchronized with server state.

The beauty of this approach lies in the size disparity between the components. The delta lists remain tiny - a few hundred hashes accumulating between rebuilds amounts to 10-20 KB - while the filter efficiently handles the bulk of millions of hashes. Together, they deliver real-time accuracy with minimal bandwidth overhead.

Server Implementation

The server’s responsibilities are minimal. It maintains three components: a base filter rebuilt on a schedule (hourly for busy servers, daily for quieter ones), an addition list accumulating hashes of files uploaded since the last filter build, and a removal list tracking deletions over the same period.

As files arrive, their hashes accumulate in the addition list; deletions go into the removal list. At rebuild time, the server regenerates the filter from its database and clears both lists. The total server cost per client query is zero - all the computational work happens on the client side.

A potential BUD

This scheme could be formalized as a Blossom Upgrade Document:

GET /filter
  Headers:
    X-Filter-Type: binary-fuse-16
    X-Filter-Timestamp: 2024-01-15T12:00:00Z
    X-Element-Count: 10485760
    ETag: "abc123"
  
  Body: {
    "filter": <base64-encoded Binary Fuse filter>,
    "added": ["<sha256>", "<sha256>", ...],
    "removed": ["<sha256>", "<sha256>", ...]
  }

Clients indifferent to privacy can continue using HEAD /<sha256> as before, while privacy-conscious clients fetch the filter and query entirely locally.

Implementation

Binary Fuse filter libraries exist for most popular languages:

  • C: github.com/FastFilter/xor_singleheader
  • Go: github.com/FastFilter/xorfilter
  • Rust: docs.rs/xorf
  • Zig: github.com/hexops/fastfilter
  • JavaScript: Various npm packages

All of these libraries support serialization, enabling simple HTTP transmission of the filter.

Conclusion

The combination of Binary Fuse filters and delta lists solves private set membership for Blossom with remarkable elegance. Clients enjoy perfect privacy since queries never leave their machines. Servers expend zero computation per query since all work happens client-side. Bandwidth remains minimal - a compact filter plus tiny delta lists. Real-time accuracy comes from the delta lists covering changes since the last rebuild. Well-tested libraries cover most languages.

The server rebuilds a filter periodically - a few hundred milliseconds of work - serves it to clients alongside two small lists, and that is all. Clients download the filter, query locally as often as needed, then refresh on a schedule. No cryptographic complexity, no multi-round protocols, just efficient data structures doing precisely what they were designed to do.

Sometimes the simplest solution is also the right one.



Loading comments…