Finding Similarities

Often in data science, we need to find the similarity between two or more string. For example, we might want to match the string of “Celtic…

Finding Similarities

Often in data science, we need to find the similarity between two or more string. For example, we might want to match the string of “Celtic won the cup” with “The cup was won by Celtic”. For we can use string similarity methods. The most common methods are:

Token. This involves finding similar blocks of text between two and uses this as a match. This method is strong when there is the same word within the two strings to be matched, but that they appear in different places. For example, it works very well for matching “Loans and Advances” with “Advances and Loans”.
Jaro. This method works by analysing the number of character transpositions requires between two strings.
QGrams. This method involves analysing the number of similarities in n-length character sequences between two strings. It is similar in its operation to the Edit distance but performs well when the same prefixes are used between strings.
Edit distance. This method analyses the number of changes required to go from one string to another. It works well when there number of changes between two strings, such as between “Loan and Account” and “Loans and Accounts”, which block methods often struggle with. Typically methods are Levenstein, Needleman-Wunch, Gotoh, Gotoh Window Affine, Jaro, Jaro Winkler, Qgram, Block, Cosine and Euclid.

Some of the key likely events are:

Loss of insignificant word. This involves losing insignificant words such as “and”, “or”, “the”, and so on. The test used is “Loans and Accounts” and “Loans Accounts”.
Small changes. This involves small changes in the words. The test used is “loans and accounts” and “loan and account”.
Rearrangement of words. This involves moving the words around in the strings. The test used is “loans and accounts” and “accounts and loans”.
Punctuation. This involves adding/removing punctuation marks. The test used for this is “fishing, “camping”; and ‘forest” and “fishing camping and forest”.
Case. This involves analyzing the case of the characters. The test used for this is “Loan Account and Dealings” and “LOAN ACCOUNT AND DEALING”.
Spacing. This involves changing the spaces between words. The test used for this is “LoanAccountDealing” and “Loan Account Dealing”.

The following figure outlines the different methods we can use:

A scoring system for the methods is given in Table 1 [here]. A few conclusions can be drawn from this:

The block methods generally work well with the loss of insignificant words, as they try to find the words wherever they are in the strings, where the distance methods struggle as they count the movements as a penalty for each change.
The block methods also perform well with the rearrangement of words, as they are generally trying to find matching blocks.
For small changes, the distance vector methods perform best, as only give a small penalty to one or two changes in a word, such as in making a word plural. The block methods often use a space as a and thus might miss a similar word if it is changed in a small way. The method also performs well here too.

Table 1: Performance of string matching methods

So let’s try a few examples:

Loss of insignificant words: “loans and accounts” and “loans accounts”. Try
Small changes: “loans and accounts” and “loan and account”. Try
Rearrangement of words: “loans and accounts” and “accounts and loans”. Try
Punctuation: “fishing, “camping”; and ‘forest$” and “fishing camping and forest”. Try
Case: “Loan Account and Dealing” “LOAN ACCOUNT DEALING”. Try
Spacing: “LoanAccountDealing” “Load, Account, Dealing”. Try
Spacing: “LoanAccountDealing” “Load, Account, Dealing”. Try

Here are some other tests:

Test: “The quick brown fox” and “The slow brawn fur”. Try
Test: “Harry Potter” and “Potter Harry”. Try

Similarity hash

Another method we can use is to create a hash value of a string, and then we can create a similarity hash. The Charikar similarity method is often used for documents and metadata in order to located duplicates or clustered objects. It uses strings in a list and where each word is a token. The order of the words does not matter. If we try the following:

“word1 =”this is the first string”, word2 =”this is the first string” Try! We get a similarity of 1.0.
word1 =”this is the first string”, word2 =”this is the string first” Try! We get a similarity of 1.0.
word1 =”this is the first string”, word2 =”this is the first help” Try! We get a 256-bit similarity of 0.92.
word1 =”this is the first string”, word2 =”this keep the first help”Try! We get a 256-bit similarity of 0.89.

Try here.

Conclusions

We increasing have to find data elements which are matched to each and need to assess the criteria for the match (such as not about the order of words or the case of the letters). The similarity methods defined here give a basic toolkit for integrating this analysis.

If you are interested we recently applied Damerau–Levenshtein Distance, Cosine Distance, and Jaccard Distance methods to the detection of insider threat detection in Cyber Security. You can read the paper here.