The Charikar similarity method is often used for documents and metadata in order to located duplicates or clustered objects [paper]
Simhash |
Outline
The following uses Charikar similarity into order to create a hex signature for strings. It basically takes words and compares the strings for the similarity of the words betweeen strings. In the following we calculate the percentage of bits which are the same.
import sys from hashes.simhash import simhash str1="This is a test string" str2="This is not a test string" if (len(sys.argv)>1): str1=str(sys.argv[1]) if (len(sys.argv)>2): str2=str(sys.argv[2]) print "String 1:\t",str1 print "String 2:\t",str2 print "\n==== 8-bit hash ====" hash1 = simhash(str1,hashbits=8) hash2 = simhash(str2,hashbits=8) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2) print "\n==== 16-bit hash ====" hash1 = simhash(str1,hashbits=16) hash2 = simhash(str2,hashbits=16) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2) print "\n==== 24-bit hash ====" hash1 = simhash(str1,hashbits=24) hash2 = simhash(str2,hashbits=24) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2) print "\n==== 1024-bit hash ====" hash1 = simhash(str1,hashbits=1024) hash2 = simhash(str2,hashbits=1024) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2)