How To Analyse A Hashing Method … Meet Balls In Bins

With hashing methods, we typically want to build in some form of integrity, and where we can check that the data has not been modified…

How To Analyse A Hashing Method … Meet Balls In Bins

With hashing methods, we typically want to build in some form of integrity, and where we can check that the data has not been modified. This might relate to a file that you want to download, a driver on your system, or the text in an email. At the core of this checking is a simple hashing function. Often, too, we digitally sign a hash value with our private key and then prove the signature with our public key.

With hashing, we perform a one-way function that takes us from the data and then produces a hashed value. We should be able to mathematically do the reverse, as we don’t want someone to reverse our hashes. But, at times, Eve can actually reverse the hash if she has already computed the match between the data and the generated hash. And so we also add a bit of salt, and where the salt value changes the hash for a given input.

So, here’s a bit of background on hashing:

Balls in bins approach (Biba)

We can analyse a hash function by looking at the balls-in-bins (Biba) problem. In this, we imagine we are in a fairground and have a number of balls to throw in bins. We then throw the balls randomly, and they will go into one of the bins. If we get two balls in a bin we will win. So take five bins and determine the probability of us getting two balls in a bin at any given throw [here]:

Number of bins: 5
Average number of throws 3.28Throw Prob of winning
2 0.200
3 0.520
4 0.808
5 0.962
6 1.000

The code is:

import sys

n = 5

if (len(sys.argv)>1):
    n=float(sys.argv[1])

print ("Number of bins: %d\n" % n)


print


E=0
p_prev=1.0/n

E=p_prev*2



for i in range(2,(n)):
    pnew = p_prev*(n+1-i)*i/(n*(i-1))
    E = E+pnew*(i+1)    
    p_prev=pnew


print ("Average number of throws",E)

print ("\nThrow\tProb of winning")


val=(n-1)/(n)
for i in range(2,(n+2)):
    print ("%d\t%.3f" % (i,1-val))
    val = val* (n-i)/n
    if (i>=100): 
        print ("Not printing after 100")
        break

We can use this as an analogy for hashing and where we create a collision for different data elements leading to the same hashed value. In this case with five hits, we will generate a collision every 3.28 attempts. With 65,536 bins (64-bits), we have an average of 321.51 attempts before we will get two balls in a bin [here]:

Number of bins: 65536
Average number of throws 321.515493347

And, what about birthdays? Well, I often do a live demo to my class around collisions within birthdays, and that you just need a class of around 25 students to have a 50% chance of two students having the same birthday. This is equivalent to having 365 bins and throwing the balls into the bin in a random way. For this, on average, we will take 24.61 throws to get two balls in a bin [here]:

Number of bins: 365
Average number of throws 24.6165858946

So, the Biba concept allows us to understand for a given number of bits that we have to represent a hash value, that we can determine how often a collision of hash values will occur.

Entropy

The true measure of the actual information within a hashed value is measured by entropy. This determines the actual amount of information contained in the data, and it is measured in bits. If the hash is truly random, the number of bits in the hash will be the entropy value. Otherwise, if we use a 32-bit hash, and only have 1,024 different values to hash, the entropy will be 10 bits.

To calculate the entropy we take the number of possible input values for our hashing (count) and compute:

Entropy=log10(count)/log10(2)

This is basically a log2() function but converted into log10 (as not many calculators have log2() on them). For a five-character password for lower case, we have 2⁶⁵ different possible passwords, so the key entropy is:

Entropy = log10(2⁶⁵)/log10(2) = 23.5 bits

Overall 23.5 bits is a low value, as the state-of-the-art in hash cracking is around 72-bits. Now let’s try for different password lengths [here]:

We can see now, that we get 56.41 for a 12 character password, but this is not strong. If we try upper and lower case, it improves to 68.41 bits, and which is near the limits of hash cracking [here]:

The reason we can crack for 12 characters, is that we can employ parallel processing with GPUs or ASICs. In order to measure the strength of the hash of a password, we thus turn to the true measure of its strength, the key entropy value.

Meet the hashing methods

And so, what about the hashing methods? We can authenticate messages, have lightweight versions of hashes, and so much more. Here are most of the main methods:

cryptographic hashes. These are fast and secure, such as SHA-256, Blake 3 and SHA-3. Examples here.
message authentication codes (MACs) — where we can have a secret key between Bob and Alice and which they sign hashes with. Examples here.
light-weight hashes. These run well on resource-constrained devices, such as for sensors and IoT devices. Examples here.
non-crpyto hashing. These are ultra-fast, and not as secure as the cryptography hashes. Examples include Farm and xxHash. Examples here.
one-time passwords. These can be generated so that a unique hash is generated by a count or by time. Examples here.
quantum robust hashes. These allow digital signatures to scale into an era with quantum computers and used hash-based signatures. Examples here.
similarity hashes. These look to match similar objects, such as similar words or similar images. Examples here.
key derivation function (KDF) hashes. These generate encryption keys from passwords. Examples here.