The Charikar similarity method [1] is often used for documents and metadata in order to located duplicates or clustered objects [paper]. A value of zero means no differences, and the higher the simularity value, the more there are changes.
SimHash: Charikar similarity method with Golang |
Outline
The following uses Charikar similarity into order to create a hex signature for strings. It basically takes words and compares the strings for the similarity of the words betweeen strings. In the following we calculate the percentage of bits which are the same.
package main import ( "fmt" "os" "github.com/mfonda/simhash" ) func main() { s1 := "This is a test" s2 := "This is a test1" argCount := len(os.Args[1:]) if argCount > 0 { s1 = os.Args[1] } if argCount > 1 { s2 = os.Args[2] } string1 := []byte(s1) string2 := []byte(s2) hashes1 := simhash.Simhash(simhash.NewWordFeatureSet(string1)) fmt.Printf("Simhash of `%s`: %x\n", string1, hashes1) hashes2 := simhash.Simhash(simhash.NewWordFeatureSet(string2)) fmt.Printf("Simhash of `%s`: %x\n", string2, hashes2) fmt.Printf("Comparison of `%s` and `%s`: %d\n", string1, string2, simhash.Compare(hashes1, hashes2)) }
References
[1] Charikar, M. S. (2002, May). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388).