Bloom Filters, Merkle Trees and … Accumulators

I really miss Meetups and physical conferences. To me, the online versions of these just do not work.

Bloom Filters, Merkle Trees and … Accumulators

I really miss Meetups and physical conferences. To me, the online versions of these events, just do not work, especially related to Meetups.

Why? Because these are places to meet up with people with ideas and vision, and to share thoughts. The current pandemic has thus restricted the opportunities for innovators to engage with others, and get feedback (and meet potential investors, too).

It was at one of these Meetups a few years ago, that I pitched the idea of commitments and proofs as part of a trusted system. My use case was quite trivial, and based on a real-life event of someone making a bet on a football match. For this, they took odds of a team winning a match (and who were already 2–0 up with a fairly short time to go), and where they ended up winning the bet. But, the betting company said that the odds were a mistake, and refused to payout. To me, this is was unfair, and a better approach might have been that the person and the betting company should have entered into a binding (smart) contract, and where it is possible to make a commitment to the bet, and then reveal it after the event:

Much of the foundation work around commitments and proofs relate to the work of the mighty Torben Pedersen. In 1991, he published his classic paper of “Non-interactive and information-theoretic secure verifiable secret sharing.” [pdf] and where he outlined a method that would soon be known as a Pedersen Commitment [here]:

While Pedersen commitments are now used extensively in privacy-preserving cryptocurrency, let’s look at another method … the accumulator, and compare with the other well-known methods of Merkle Trees and Bloom filters. Overall accumulators were initially defined in 1994 by Benaloh and de Mare [1][here]:

The paper describes a fixed-length data value — an accumulator — that can contain multiple values in a set. None of the values in the accumulator can then be discovered until we reveal them. This allows us to make a commitment to values, and then create a proof that our values are within the dataset. For this, we add values into the accumulator, and make this public. We can then check for the existence of a value, and create a proof that it exists.

Bloom filters

With a Bloom filter, we take three non-cryptography hashes of a data value and then add them onto a bit vector. We use non-cryptographic hashes as they are much faster than cryptographic hashes, and are also used in areas of hashing tables. When we want to check if a value is stored in the Bloom filter, we perform the same hashing and check the bits in the Bloom filter. If the bits are all set to ‘1’, there is a likelihood of a match, otherwise, if any of the bits are zero, the data value is not stored in the Bloom filter. In our research here, we used this method to triage disk system for contraband content — and which has since developed in a successful spin-out (Cyan Forensics):

Fast contraband detection in large capacity disk drives
In recent years the capacity of digital storage devices has been increasing at a rate that has left digital forensic…www.sciencedirect.com

A Bloom filter is used to create a probabilistic guess on whether an item is in a data structure, and was created by Burton Howard Bloom. Within the test, the query will define if the value is “possibly in the set” or “definitely not in the set”. Each element that is added is hashed with two or more hashing methods, and the values generated are used to set the bits in a bit array. In this example, we use a 32-bit bit vector, and use Murmur 2 and FNV for the hashes:

https://asecuritysite.com/hash/bloom

Merkle Trees

With Merkle Trees, we create a tree of hashes and then make a commitment to a root hash. In a blockchain, it is this root hash that we add to our blockchain, in order to create a chaining effect. So let’s say we have the names of a few countries: ‘France’, ‘Germany’, ‘Denmark’, ‘Sweden’, ‘UK’, ‘Spain’, and ‘Italy’. We can then hash as [here]:

Input:		 [ 'France', 'Germany', 'Denmark', 'Sweden', 'UK', 'Spain', 'Italy' ]
Root hash:	 65C7C87E2731AE97D0BE89EA3299A544FD28DE6A
Tree depth	 3
Tree levels:	 4
Tree nodes:	 7

Level  0
[ '65C7C87E2731AE97D0BE89EA3299A544FD28DE6A' ]

Level  1
[ '8D832398ACABF7B669C023174169C8876F9764DC',
  '2BBB8E1E1D16671B763D0C026C9A2BA43FD73381' ]

Level  2
[ '0F705D70B6CF6FFE1FE570FC24947767438541FA',
  'A5D0D57347C4BE03B6E91E0176D5E53614AA861E',
  'F8A5C1FC111019BDCD2E6F7877299878F9699640',
  'AD79EF0F076D3A686AB9738925F4DD2C7E69D7D1' ]

Level  3
[ 'E3772AC4B4DB87B4A8DBFA59EF43CD1A8AD29515',
  '17D53E0E6A68ACDF80B78D4F9D868C8736DB2CEC',
  '89DA124E04DFE1AD9946CD37D91A119E1D028898',
  '72DDD2B619AF6D6A73FEBF80F7FCAD22495498CD',
  '7770D04C7BFB791E7C95959F12D6797BEE87A8D8',
  '20A8DF9B760336178FCA425339EC1C7E542A2463',
  'AD79EF0F076D3A686AB9738925F4DD2C7E69D7D1' ]

The root of the tree is then 65C7C87E2731…299A544FD28DE6A, and we have seven nodes and four levels. We can now draw the Merkle Tree:

The SHA-1 hash of ‘France’ is E3772AC4B4D…43CD1A8AD29515. Notice that because we have an odd number of nodes, the hash for ‘Italy’ is used for the next level. The great advantage of using a Merkle Tree is that we can easily check that a value is included in the root hash. In the diagram below, to test that ‘Germany’ (Node 2) is in the tree, we would need only need Node 1, Node 34 and Node 567 in order to generate the root (Node 1234567):

The reason we only need nodes 1, 34 and 567, is that we can regenerate Node 12 and Node 1234 from these values. Bitcoin, for example, uses this method in order to check the validity of a transaction within the block. We can then add our transactions into a Merkle Tree, and produce a root hash. Within the block we can then add the root hash of the previous block and the root hash of the next block, and thus interlock the blocks into a chain:

Accumulators

And so we end with accumulators. Basically these have a similar application to Bloom filters and Merkle Trees (and where we can hide the contents of our data set), but where we can publish proofs of knowledge of the contents of the data.

A great advantage of accumulators over Merkle Trees is that we can easily add and delete data elements from our accumulator, whereas Merkle Trees tend to be fixed in their proofs. In Bloom filters, too, it is difficult for us to remove a data element, and we probably have to rebuild the complete filter in order to remove new entries. With an accumulator, we can add and remove data entries, and can even create batch processing to add and remove data elements in an efficient manner. Overall, though, with your private key, we can produce a proof of knowledge of the existence of a data value in a data set, and that can be proven with our public key. In this way, we create a digital signature on the existing data that we have already committed, too.

Here is a quick overview of how accumulators work:

And a number of links for practical implementations:

Accumulators. Accumulator. An accumulator can be used for zero-knowledge proof of knowledge. In this case, we will use the BLS curve and create an accumulator and data a proof onto it, and then remove the proof.
Witness and ZKP. Witness. An accumulator can be used for zero-knowledge proof of knowledge. In this case, we will use the BLS curve and create an accumulator and data a proof onto it, and then remove the proof. We will then create a witness proof to show that a given data element is added within the accumulator.
RSA Accumulators. RSA Accumulator. An accumulator can be used for zero-knowledge proof of knowledge. In this case, we will generate RSA-based proofs of knowledge.
ECC Accumulators with Golang. ECC Accumulator. An accumulator can be used for zero-knowledge proof of knowledge. In this case, we will use the BLS curve and create an accumulator and data a proof onto it, and then remove the proof.
ECC Accumulators with Node.js. ECC Accumulator with JavaScript. An accumulator can be used for zero-knowledge proof of knowledge. In this case we will generate ECC-based proofs of knowledge using JavaScript.

But, you may ask about performance compared with Bloom filters? Surely public key methods are slower? Well, Kumar et al [2] analysed RSA and ECC methods against Bloom filters, and found that RSA methods were indeed slower than Bloom filters, but that ECC methods were faster as than Bloom filters [2]:

Conclusions

While online meetings possibly save time, I don’t think we can replace the spark and energy of seeing people present their ideas and vision.

And so, Web 3 will bring a whole lot of trust to the Internet. With this, we will see an increased usage of digital signing, but with this, we need to preserve privacy. There’s a good chance that this privacy will involve commitments to data, and then when required, for that data to be revealed in a trusted way. So, go build a more trusted, secure and resilient world.

And for Meetups? I hope they get back up and running soon. I miss seeing new ideas being pitched, and in meeting the people with vision and passion for disrupting this flawed digital world that we have built.

References

[1] Benaloh J., de Mare M. (1994) One-Way Accumulators: A Decentralized Alternative to Digital Signatures. In: Helleseth T. (eds) Advances in Cryptology — EUROCRYPT ’93. EUROCRYPT 1993.

[2] Kumar, A., Lafourcade, P., & Lauradoux, C. (2014, September). Performances of cryptographic accumulators. In 39th Annual IEEE Conference on Local Computer Networks (pp. 366–369). IEEE.