Malware/Spear Phishing Detection: Meet TLSH — A Locality Sensitive Hash

In digital forensics and cybersecurity, we often use hashing methods to pin-point a unique hash value for a given set of data. This is…

Photo by Hannes Johnson on Unsplash

Malware/Spear Phishing Detection: Meet TLSH — A Locality Sensitive Hash

In digital forensics and cybersecurity, we often use hashing methods to pinpoint a unique hash value for a given set of data. This is known as a cryptographic hash. We can then use the cryptographic hash to determine if something has been changed or not. But, sometimes, we need to create a hash to show that a file is similar to another file. This is the case of malware or spear phishing emails, and where the data used has often been copied from one source to another, and just changed a little.

Similarity hashing

In malware analysis and in text similarity, we often have to identify areas of data within files that are similar. One method is TLSH (A Locality Sensitive Hash), and which was defined by Oliver et al [1]:

It is used — along with ssdeep — by VirusTotal to identify a hash value for malware:

It is a fuzzy hashing method that requires at least 50 bytes of data. The hash itself is 35 bytes long with “T1” (the version number) at the start and followed by 70 hex characters. An example is [here]:

T10F9022C0330203338E88008038882A80FF820A0C203203222C00000023030200022C88

TLSH examples

Now let’s try a few examples [here]:

String 1: This is the first string, and it has some text. Thank you
String 2: This is the first string, and it has some text. Thank you

tlsh.hash: T10F9022C0330203338E88008038882A80FF820A0C203203222C00000023030200022C88
tlsh.hash: T10F9022C0330203338E88008038882A80FF820A0C203203222C00000023030200022C88
tlsh.diff(hex1, hex2) 0

Now, just by adding a full stop at the end [here] we can see that the score changes:

String 1: This is the first string, and it has some text. Thank you
String 2: This is the first string, and it has some text. Thank you.

tlsh.hash: T10F9022C0330203338E88008038882A80FF820A0C203203222C00000023030200022C88
tlsh.hash: T1E0A022C8330203338E88008038882A80FF820A0C203203222C00000023030200022C88
tlsh.diff(hex1, hex2) 4

We see that our perfect score (0) has now changed to 4. We can see that the change in the hash is mainly localized at the start of the hash:

String 1: This is the first string, and it has some text. Thank you
String 2: This is the first string, and it has some text. Thank you.

tlsh.hash: T1 0F9 022C0330203338E88008038882A80FF820A0C203203222C00000023030200022C88
tlsh.hash: T1 E0A 022C8330203338E88008038882A80FF820A0C203203222C00000023030200022C88
tlsh.diff(hex1, hex2) 4

Now we can modify one of the words [here]:

String 1: This is the first string, and it has some text. Thank you.
String 2: This is the second string, and it has some text. Thank you.

tlsh.hash: T1E0A022C8330203338E88008038882A80FF820A0C203203222C00000023030200022C88
tlsh.hash: T1CEA022CA330803B288A802A080882280FF820808B0380A2228020800030B2020022880
tlsh.diff(hex1, hex2) 62

This has a significant effect on the score.

String 1: This is the first string, and it has some text. Thank you.
String 2: This is the second string, and it has some text. Thank you.

tlsh.hash: T1 E0A 022C 8 330203 33 8E 8 80 08 0 38 882 A 80FF820 A 0 C203 20 3222C000000230 3020 0022 C 8 8
tlsh.hash: T1 CEA 022C A 330803 B2 88 A 80 2A 0 80 882 2 80FF820 8 0 8B03 80 A2228020800030 B202 0022 8 8 0
tlsh.diff(hex1, hex2) 62

Now let’s use two string which are completely different [here]:

String 1: And so Bill Gates defined the strategy of Microsoft.
String 2: This is the second string, and it has some text. Thank you.

tlsh.hash: T15C900251560DC62195211192D48CC5819905D96702105975514D0E3D0805654D568191
tlsh.hash: T1CEA022CA330803B288A802A080882280FF820808B0380A2228020800030B2020022880
tlsh.diff(hex1, hex2) 181

We can see that the scoring depends on the changes between one set of data and another, and where we could vary the threshold depending on the number of changes that have been made.

For spear phishing, we often see emails which only differ in the name of the person, and so:

String 1: Dear Bill,

Please you have won $1,000,000 in the local lottery.
Please contact me to receive it.

String 2: Dear Fred,

Please you have won $1,000,000 in the local lottery.
Please contact me to receive it.

tlsh.hash: T183B0121AC00102E15450D381C60F6179AF09D0D4C3849C77442D005840987AF604B0C8
tlsh.hash: T1E8B01229C00102D15550D340C60BA1AABB00D0D4C2489877082D815480983AE60470C4
tlsh.diff(hex1, hex2) 32

In this case, there is a fairly good match between the two data inputs. Figure 1 outlines the performance of TLSH against sdhash and ssdeep, and shows that the false positive rate shows a significant improvement with TLSH.

Figure 1: ROC Curve [1]

Coding

The coding is Python is [here]:

import sys
import tlsh


str1="This was a sample piece of text that we are going to assess."
str2="This is a sample piece of text that we are going to assess"

if (len(sys.argv)>1):
str1=str(sys.argv[1])

if (len(sys.argv)>2):
str2=str(sys.argv[2])


print(f"String 1: {str1}")
print(f"String 2: {str2}")

if (len(str1)<50 or len(str2)<0):
print("Strings need to have at least 50 characters")
sys.exit(0)

data=bytes(str1, 'utf-8')
hex1 = tlsh.hash(data)

print('\ntlsh.hash:', hex1)


data=bytes(str2, 'utf-8')
hex2 = tlsh.hash(data)

print('tlsh.hash:', hex2)
print('tlsh.diff(hex1, hex2)', tlsh.diff(hex1, hex2))

Conclusions

Cryptographic hashes give us near certainty that something hasn’t changed, but similarity hashes aim to show as where things have not changed. In spear phishing we often see template messages being used, and where only small parts of the message change. The same can happen with malware and were just small parts of the malware change — and which avoids signature detection methods that use a hash value for malicious code. So, enjoy being similar:

https://asecuritysite.com/hashsim/tlhash

References

[1] Oliver, J., Cheng, C., & Chen, Y. (2013, November). TLSH — a locality sensitive hash. In 2013 Fourth Cybercrime and Trustworthy Computing Workshop (pp. 7–13). IEEE.