Blooming Great: The Birth of Cyan Forensics

Our research has supported four amazing spin-outs: Zonefox (acquired by Fortinet), Symphonic (acquired by Ping), Cyan Forensics and…

Blooming Great: The Birth of Cyan Forensics

Our research has supported four amazing spin-outs: Zonefox (acquired by Fortinet), Symphonic (acquired by Ping), Cyan Forensics and Memcrypt. Each of the spin-outs looked at fundamental research work, and which was then used to build a core business case. It is thus important to understand the route that research takes, and how it can be turned into a high-impact spin-out.

For Cyan Forensics, a core part of the success of the spin-out has been in the creation of the core IP (though the PhD work of Phil Penrose), the translation of this work into a law enforcement domain (though Bruce Ramsay — who had previously worked as a forensics investigator in Police Scotland) and the translation of the research into a business context (through people like Ian Stevenson and Bruce Ramsay). Along the way, the support of Scottish Enterprise was key in translating the work from a research domain into a commercial one. The company is now scaling on an international basis, and could possibly be the next tech unicorn in Scotland [here]:

In this article, I will thus outline the patent submitted in 2017 and published in 2018 and which has since become part of the core innovation with the company [here]:

Addressing the problem

Just over five years ago, Phil Penrose was investigating the fundamental weaknesses of digital forensics. This included the increasing issues related to investigating the ever-increasing sizes of disk drives. For a traditional HDD, it could take around a day to make a copy of a disk. Along with this, law enforcement agencies had to cope with an ever-increasing number of devices that stored data. Another related matter focused on using the flawed MD5 hashes for identifying contraband. For Phil, the answer involved randomly sampling areas on the disk for fragments of files. On most disks, the smallest fragment is 4KB. So, he then hashed fragments of contraband files onto a Bloom filter. If we sample a disk a number of times (n) for a disk with a number of sectors (N) which contains a number of sectors of contraband data (M), what is the probability of finding the contraband in at least one of the sectors [here]:

In this case, we only have to sample 30,000 sectors (with 512 bytes in each sector) to detect the presence of a 200MB file on a 1TB disk.

A Bloom filter

A Bloom filter is used to create a probabilistic guess on whether an item is in a data structure, and was created by Burton Howard Bloom (Bloom, 1970). Within the test, the query will define if the value is “possibly in the set” or “definitely not in the set”. Each added element is hashed with two or more hashing methods, and the values generated values are used to set the bits in a bit array. In this example, we use a 32-bit bit vector and use Murmur 2 and FNV for the hashes. Typically we use non-crypto hashes, in order to speed up the process.

In this demo, the first value is taken from Murmur 2, and the second one is from FNV. Each of these is used to generate a 32-bit bit vector. We will add “fred”, “bert” and “greg”, and which gives a Bloom filter of [here]:

01234567890123456789012345678901
Add fred:       00000000000000100000010000000000  fred [21,14]
Add bert:       00000000100000100000010000000100  bert [29,8]
Add greg:       00000000100100100000011000000100  greg [11,22]We now have bit position 8, 11, 14, 21, 22 and 29 set.

We can now test for “amy” and “greg”:

Now we can test for amy:
    amy is not there [16,12]
New we can test for greg:
    greg may be in there [11,22]

Non-crypto hashes

The traditional hashing methods, such as MD5, SHA-1 and SHA-256, would be too slow for the scale of a Bloom filter, and thus we turn to the non-cryptography. These do not have the same security proofs of the cryptography hashes but are fast and fairly robust.

In 2011, Google released a fast hashing method in two main forms: CityHash64 (64-bit) and CityHash128 (128-bit). These are non-cryptography hashing methods but could be used for fast hash tables. Overall it is one of the fastest hashes without problems. Google generally produced a complex code and then optimized it for speed, especially on little-endian 32-bit or 64-bit CPUs. With little-endian, we organise the bytes of a value, so that the least significant byte is stored at the end. This suits Intel x68 and x86 architectures. Then, in 2014, Google has since released FarmHash as a successor to CityHash, and which had a number of enhancements.

xxHash was created by Yann Collet and is one of the fastest non-cryptography hashing methods. It uses a non-cryptographic technique. A significant speed improvement is achieved on processes that support SSE2, and which is an extension to the IA-32 architecture. This, of course, limits the architecture range for its implementation. Overall xxHash works at close RAM limits. In a recent test, xxH3 achieved a hashing rate of 31GB/s, and was especially efficient for a small amount of data (such as with text strings). In the test, MUM (MUltiply and Mix) also achieved good levels of throughput. It was created by Vladimir Makarov in 2016 and designed for 64-bit CPUs. Overall many of the methods are based on the Murmur hash and was designed by Austin Appleby. It has a good performance compared with other hashing methods, and generally provide a good balance between performance and CPU utilization. Also, it performs well in terms of hash collisions.

SpookyHash was created by Bob Jenkins and was released on Halloween in 2010. It is one of the fastest non-cryptographic hashes around and is generally free of security problems. It can produce 32-bit, 64-bit and 128-bit hash values. MetroHash was created by J. Andrew Rogers in 2015.

I have created a demo of the non-cryptography methods here:

https://asecuritysite.com/encryption/smh

Conclusions

From a university point-of-view, our part of the work is complete, and we can now only stand back and watch Cyan become an international leader, and in improving the scientific ways of digital investigation. From the core team of Phil Penrose, Bruce Ramsay, Owen Lo, Rich Macfarlane, Ian Stevenson and myself, it was grown into something that is truly world-beating. Here’s a few things from our scrap book:

https://www.qinetiq.com/en/blogs/cyan-forensics-case-study

https://www.eu-startups.com/2021/03/scottish-crime-busting-startup-cyan-forensics-lands-e5-8-million/

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/989753/UK_Safety_Tech_Analysis_2021_-_Final_-_190521.pdf]

https://internationalsecurityjournal.com/cyan-forensics-announces-first-american-partnership/

https://cyanforensics.com/2019/11/18/govtech-summit-2019-win/

Go and innovate like never before!