Spotting Spam with Similarity Hashes

Many researchers have tried to find methods where they can hash a string and then compare the hashes. This allows content to be searched…

Spotting Spam with Similarity Hashes

Many researchers have tried to find methods where they can hash a string and then compare the hashes. This allows content to be searched for and matched against to be represented in a hash form (and not in the original text). This makes it more efficient to store and to process. A good example is within a spam message, and where we receive two spam emails which only differ in the target’s contact name (taken from a real spam email):

Dear Sir, target@home !!Reference #PP-003-AC7-993-014
Account Security Alret. 
We need your help resolving an issue with your account.
What's going on?
Your debit or credit card issuer let us know that someone used your 
card without your permission. We want to make sure that you authorised 
any recent PayPal payments.

and:

Dear Sir, other@home !!Reference #PP-003-AC7-993-014
Account Security Alret. 
We need your help resolving an issue with your account.
What's going on?
Your debit or credit card issuer let us know that someone used your 
card without your permission. We want to make sure that you authorised 
any recent PayPal payments.

With this, we can use a similarity match to a known spam message, and gain a score. If the score goes over a certain level, we can quarantine it.

Charikar

The Charikar similarity method is often used for documents and metadata in order to located duplicates or clustered objects. It uses strings in a list and where each word is a token. The order of the words does not matter. If we try the following:

word1 =”this is the first string”, word2 =”this is the first string” Try!
word1 =”this is the first string”, word2 =”this is the string first” Try!
word1 =”this is the first string”, word2 =”this is the first help” Try!
word1 =”this is the first string”, word2 =”this keep the first help” Try!
word1 =”this is the first string”, word2 =”a totally different sentence” Try!

Nilsimsa

Nilsimsa is a similarity hash which is used as an anti-spam focused locality-sensitive hashing method [here]:

This produces a score from 0 (dissimilar objects) to 128 (identical or very similar objects). Nilsima was created Damiani et al. in order to detect spam emails. It uses a 5-byte fixed-size sliding window that analyses the input on a byte-by-byte and produces trigrams of possible combinations of the input characters.

The trigrams map into a 256-bit array (known as the accumulator) to create the hash, and every time a given position is accessed, its value is incremented. At the end of the processing, if the values are above a certain threshold, the value is set to a 1, else it will be zero. This produces a 32-byte digest.

To compare two hashes, the method checks the number of identical bits red to the same position. This produces a score from 0 (dissimilar objects) to 128 (identical or very similar objects).

Examples:

“This is an email.”, “This is an email.” Score: 128 Try!
“Hello Bill”, “hello bill” Score: 89 Try!
“Dear Bill, You have won £10,000,000. Please email me back.”, “Dear Mark, You have won £10,000,000. Please email me back.” Score: 108 Try!
“Dear Bill, You have won £10,000,000. Please email me back.”, “Dear Mark, You have won £10,000,000. Please email me back” Score: 107 Try!
“The quick brown fox jumps over the lazy dog”, “ A quick brown fox jumps over the angry dog” Score: 86 Try!
“This is a true email and I want you to contact me immediately”, “Hello. This is a true email and I want you to contact me immediately” Score: 76 Try!
“Please be sure this is a sample email”, “Hello. This is to be sure this email is a sample” Score: 73 Try!
This email conforms to all data protection standards.”, “All parts of data protection this Web conforms to.” Score: 52 Try!
“Dear Bill, Please be ready to receive the money.”, “Dear Mark, I hope you are okay.” Score: 1 Try!

A sample run is:

Nilsimsa method
String 1:	Dear Bill, You have won £10,000,000. Please email me back.
String 2:	Dear Mark, You have won £10,000,000. Please email me back.--------------
String 1 digest: 28af87350c35cd842f13c9bd70ba24e232aedc1e7610cef4ed436f342ba82afb
String 2 digest: 28af87370e25cd842f17c9bd62ba20e212aadc0e7690ced5e9036f342bac221b
Score:	108

Minhash

Minhash provides a measure of the resemblence between sets of data (Jaccard similarity). MinHash was created by Andrei Z. Broder [paper]:

A sample run:

String 1:	this is the first string
String 2:	this is the first string
===================
Estimated Jaccard:	1.0
Actual Jaccard:		1.0String 1:	this is the first string
String 2:	this is the string first
===================
Estimated Jaccard:	1.0
Actual Jaccard:		1.0String 1:	this is the first string
String 2:	this is the string help
===================
Estimated Jaccard:	0.65625
Actual Jaccard:		0.666666666667String 1:	this is the first string
String 2:	this keep the string help
===================
Estimated Jaccard:	0.4453125
Actual Jaccard:		0.428571428571String 1:	this is the first string
String 2:	a totally different sentence
===================
Estimated Jaccard:	0.0
Actual Jaccard:		0.0

Conclusions

We increasing have to find data elements which are matched to each and need to assess the criteria for the match (such as not about the order of words or the case of the letters). The similarity methods defined here give a basic toolkit for integrating this analysis.