When Fast Just Isn’t Enough: Nobody inspects the spammish repetition

A message-digest provides a fingerprint for data and is used to prove the identity and integrity of messages and entities. The most common…

When Fast Just Isn’t Enough: Nobody inspects the spammish repetition

A message-digest provides a fingerprint for data and is used to prove the identity and integrity of messages and entities. The most common ones used are MD5 (128-bit message hash), SHA1 (164-bit message hash), and SHA-256 (256-bit message hash). So for “hello” we get an MD5 signature of “5D41402ABC4B2A76B9719D911017C592” and for SHA-1 it is “AAF4C61DDCC5E8A2DABEDE0F3B482CD9AEA9434D”.

Overall the hash signature should be fairly unique, so if we change any part of the data, it will produce a completely different hash signature. The longer the hash signature the less chance we will have of two pieces of data having the same hash signature.

These methods are fairly fast and can run on most computers, but they use cryptography methods for their processing. This makes it difficult to run the hashing method in parallel, so if we had 16 cores on our computer, with these methods we would only be using one of the cores. Along with this the methods such as MD5 and SHA-1 use numbers which are fairly difficult to process on a 32-bit or 64-bit processor.

Non-crypto hashing

So how do we improve the performance of our hashing process? The best solution is to use non-crypto hashing. The main methods are typically written in C++, in order that they are fast, and are well mapped to the processor. These include xxHash, Mumur, Spooky, City Hash and FNV. A tool such as smHash can thus benchmark each of the methods, and, using a Linux Mint 64 bits on a Core i5 3340M at 2.7GHz, we get:

As we can see xxHash runs at 15 times faster than MD5 and SHA1, and can take full use of the cores on the processor. In fact, the method runs almost at the speed of the memory on the system. The throughput on a 64-bit system is also highly optimised with a blistering 13.8GB/s processed. Compare that with SHA-1 which gives 0.25GB/s, and we get a 55 fold improvement in throughput.

The other problem that the non-crypto methods overcome is that most processors only have a 32-bit or 64-bit register, thus dealing with 128-bits and more is quite cumbersome for them. With the non-crypto methods, we typically generate a 64-bit or 32-bit hash signature, and which is optimised to the processor.

With hashing methods we should always get the same result if we take the same message and/or seed value:

  • Try “Nobody inspects the spammish repetition”. Try!, which should give a hash of: 0xe2293b2f (32-bit) and 0xfbcea83c8a378bf1 (64-bit).
  • Try “The quick brown fox jumps over the lazy dog”. Try!, which should give a hash of: 0xe85ea4de
  • Try “The quick brown fox jumps over the lazy dog.”. Try!, which should give a hash of: 0x68d039c8

You can see that the 32-bit signature has 8 hex characters, where each represents 4 bits, so we get 32-bits. With a 64-bit signature, we use 16 hex characters. Using a Linux Mint 64 bits on a Core i5 3340M at 2.7GHz, we get:

Name     Speed on 64 bits    Speed on 32 bits
XXH64 13.8 GB/s 1.9 GB/s
XXH32 6.8 GB/s 6.0 GB/s

The Python equivalent is:

If you want to try the other methods, they are here:

Siphash, Farm and djb2

The SipHash produces a hash function which is optimised for speed on short messages. It is typically used in networking application and is simpler than cryptographic hashing methods. It produces a 64-bit output and can produce Gigahashes/sec. It is not quite as fast as xxHash, but is considerably faster than MD5 and SHA-1. A sample run is [here]:

======Siphash=========
Key: 0123456789ABCDE0
Message: hello
Hash: 10442178880121868220
======Farm Hash=========
Farmhash: 13009744463427800296
======Murmur Hash=========
Murmur3: 11548395115745546715
Murmur3: 213030289162235495270783145757721615258
======xxHash=========
xxHash32: fb0077f9
xxHash64: 26c7827d889f6da3
======djb2 Hash=========
DJB2: 0xf923099L
======FNV-1a Hash=========
FNV-1: 2879634789696118705735610293694015385900839271
FNV-1: 23615531076550004469577376377959424885116481126984737282095593755520035349976263
FNV-1a: 2879634757790805244413300996931090322359790763
FNV-1a: 23615531076550004533854898275474062960111728987064870418154029787321558796778763
======Loselose Hash=========
Loselose: 532

The performance of the Siphash is fairly good, but no where near as fast as xxHash64 Cityhash64, Spooky and Farm64:

Ref: [here]

Conclusions

Processors do not like crypto that much, especially when we deal with numbers of more than 4,000 bits. In most cases, they either run in 32-bit mode, and use 32-bit registers, or in 64-bit mode with 64-bits registers. Our numbers normally fit nicely into this, where an unsigned integer takes up 64 bits, and even our floating-point values fit into these. The maths processor thus deals with these numbers. So the non-crypto methods overcome the problems of the normal hashing methods and optimize for the processor. The xxHash is one of the fastest, and ports with C++, to make it even faster.

So:

Nobody inspects the spammish repetition