In Search of the Perfect Hash

The Crypto hash, that is

In Search of the Perfect Hash

The Crypto hash, that is

We hash values in order to get a digital thumbprint of data. In most cases we take data of any size and then convert it into a digital hash. For MD5 this is a 128-bit hash (32 hex characters) and for SHA-1 it is a 160-bit hash (40 hex characters).

The longer the hash, the less chance there is of getting a collision, and where two sets of data will give the same hashed value. While this is acceptable if there is no context between the data, it becomes worrying when we use data in a different context, and it then results in the same hashed value. For example if a message of “Hello” can be changed to “Goodbye”, and still produce the same hashed value, then we have changed the context of the message, and still produced the same hash.

In another example, we might have two images, and if they create the same hashed signature, we cannot tell them apart from the hashed value. Thus if someone could go to prison because we detected a given hashed value for a banned image, but where the person could have viewed another image with the same hash signature. This is one of the reasons that MD5 signatures are often not acceptable in digital forensics investigations.

In this article I will pose a few questions:

  • If I change a single bit in the input of a hash, what is the effect on the output bits?
  • Is SHA-2 an improvement over SHA-1 for its variation?

MD5 problems

MD5 is used extensively in checking that data has not been changed and in proving identity. In the past it has been used to store hashed values of passwords, but its application in this area is reducing fast, as many of the common hashed MD5 values for words have been resolved. One of the key things that is important for MD5 is that different data does not produce a collision — where different data, especially in the same type of context does not produce the same hash signature.

Recently, though, Mat McHugh showed that he could produce the same hash signature for different images, using hashclash, and for just 65 cents on the Amazon GPU Cloud, and took just 10 hours to process.

For 10 hours of computing on the Amazon GPU Cloud, Mat created these two images which generate the same hash signature (Figure 1). If we check the hash signatures we get:

C:\openssl>openssl md5 hash01.jpg
MD5(hash01.jpg)= e06723d4961a0a3f950e7786f3766338
C:\openssl>openssl md5 hash02.jpg
MD5(hash02.jpg)= e06723d4961a0a3f950e7786f3766338
Figure 1: Images used for a hash collision

SHA-1 collision

While the world was focused on ransomware, and which pin-pointed some of the weakest computer systems, I’ve been reading Google’s paper on cracking SHA-1, and which used some of the best computing power around [here]. It took Google two years to create a single collision. Considering the resources at Google’s fingertips, the cracking of SHA-1 is well out-of-budget for most organisations [here]. The research team produced two PDF files which have the same hash signature: (PDF 1 and PDF 2).

Figure 2: Google SHAttered

Overall SHAttered took 9,223,372,036,854,775,808 computations, with the equivalent of 6,500 years of single-CPU computations (such as from a desktop computer) and 110 years of single-GPU computations. With 110 GPUs, of course, it will only take one year. So, if you can afford lots of GPUs, then you could find a collision, but its going to be costly in terms for energy requirements. SHAttered is 100,000 faster than a brute force attack that uses the birthday paradox, and where we would it would take 12,000,000 GPU one year to crack.

Differential cryptanalysis

As the methods are well defined in how the hashing is created, it is not too difficult for a cryptography method to be analysed using small changes in the input values and then observing the difference in the output. For the perfect hashing method, we should be able to change one bit on the input, and, on average, 50% of the bits on the output will change.

So let’s analyse some hashing methods for the number of bits that change for an input string. In this case we will use an input of “abc” and change the first bit of the input string [here]:

Plaintext 1: abc
Plaintext 2: ábc
Hash 1: 900150983cd24fb0d6963f7d28e17f72
Hash 2: 73daa9468b3e3999589875343afa18f5
Bits changed: 58  out of  128 ( 45 %)
Cipher1: 10010000000000010101000010011000001111001101001001
Cipher2: 01110011110110101010100101000110100010110011111000
Changes: XXX---XXXX-XX-XXXXXXX--XXX-XXXX-X-XX-XXXXXX-XX---X

It can be seen that 58 bits of the 128 bits have changed, with a 45% change of the bits (and where an X identifies a change in the bits from one cipher to the next). Next we will try SHA-1 (a 160-bit signature) [here]:

Plaintext 1: abc
Plaintext 2: ábc
Hash 1: a9993e364706816aba3e25717850c26c9cd0d89d
Hash 2: 4c5b0ec5033ffce76fe3f3c162398cb38d04e838
Bits changed: 79 out of 160 ( 49 %)
Cipher1: 10101001100110010011111000110110010001110000011010
Cipher2: 01001100010110110000111011000101000000110011111111
Changes: XXX--X-XXX----X---XX----XXXX--XX-X---X----XXX--X-X

It can be see that we now have a 49% change in the output bits for a single bit change on the input. Finally let’s try SHA-512 [here]:

Plaintext 1: abc
Plaintext 2: ábc
Hash 1: ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f
Hash 2: 6f2b2f91ea5b6457a886ea2533afbaa775a007cb3f8ede145a5f64dc8958218fd5fedf32da21974166500870b9faea1c13e16602166a36c768d007345bbede3c
Bits changed: 261 out of 512 ( 50 %)
Cipher1: 11011101101011110011010110100001100100110110000101
Cipher2: 01101111001010110010111110010001111010100101101101
Changes: X-XX--X-X----X-----XX-X---XX-----XXXX--X--XXX-X---

And now we see better results, as the output change is now 50%, which is near optimal. If we now try a simple test with differing inputs strings we get:

Input     MD5   SHA-1   SHA-512
-------------------------------
a 51% 48% 48% [Link]
ab 53% 41% 49% [Link]
abc 45% 49% 50%
abcd 50% 46% 50%
abcde 52% 47% 50%
abcdef 45% 46% 50%
abcdefg 49% 43% 49%
abcdefgh 40% 48% 49% [Link]
abcdefghi 47% 58% 52% [Link]

We can see that MD5 has a fairly wide variation (from 40% to 53% changes) for different strings. SHA-1 improves on this with a smaller variation, but SHA-512 - which is a 512-bit version of SHA-2 — shows a more consistent response for the changes in the output bits. On this sample, a single bit change is thus masked better with SHA-512 than it is with MD5.

The code I have used is here:

Conclusions

We need to design cryptography systems where it is not possible to spot changes, and thus a single bit should not be identified on the output. As we can see in this article, SHA-2 (in the form of SHA-512) produces more randomness for our sample than MD5 or SHA-1.