[Hashing Home][Home]
The Charikar similarity method [1] is often used for documents and metadata in order to located duplicates or clustered objects [paper]
Similarity hash: SimHash (Charikar similarity)
[Hashing Home][Home]
The Charikar similarity method [1] is often used for documents and metadata in order to located duplicates or clustered objects [paper]
|
The following uses Charikar similarity into order to create a hex signature for strings. It basically takes words and compares the strings for the similarity of the words betweeen strings. In the following we calculate the percentage of bits which are the same.
import sys from hashes.simhash import simhash str1="This is a test string" str2="This is not a test string" if (len(sys.argv)>1): str1=str(sys.argv[1]) if (len(sys.argv)>2): str2=str(sys.argv[2]) print "String 1:\t",str1 print "String 2:\t",str2 print "\n==== 8-bit hash ====" hash1 = simhash(str1,hashbits=8) hash2 = simhash(str2,hashbits=8) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2) print "\n==== 16-bit hash ====" hash1 = simhash(str1,hashbits=16) hash2 = simhash(str2,hashbits=16) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2) print "\n==== 24-bit hash ====" hash1 = simhash(str1,hashbits=24) hash2 = simhash(str2,hashbits=24) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2) print "\n==== 1024-bit hash ====" hash1 = simhash(str1,hashbits=1024) hash2 = simhash(str2,hashbits=1024) print "Hash1:\t\t",hash1.hex() print "Hash2:\t\t", hash2.hex() print "Similarity:\t",hash1.similarity(hash2)
[1] Charikar, M. S. (2002, May). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388).