Jesse Kornblum [1] created ssdeep and which computes fuzzy hashes, and is a Context Triggered Piecewise Hashes (CTPH). This aims to match similar byte sequences, no matter if there are differences between these sequences. For this we break a file up into fragments with a rolling hash function. We then produce a small hash for each of those fragments. These are then aggregated to produce the complete hash of a file. In comparing file signatures, we treat the resultant hash value as a string, and then uses an edit distance method to compare.
Context Triggered Piecewise Hashes (CTPH): ssdeep |
Outline
CTPH is a rolling hash, and involves multiple traditional cryptographic hashes for one or more fixed-size segments in the file. One of the most popular CTPH methods is ssdeep. This was created, in 2006, by Kornblum et al [1] and uses fuzzy hashes.
This aims to match similar byte sequences, no matter if there are differences between these sequences. For this we break a file up into fragments with a rolling hash function. We then produce a small hash for each of those fragments. These are then aggregated to produce the complete hash of a file. In comparing file signatures, we treat the resultant hash value as a string, and then uses an edit distance method to compare.
With this, ssdeep keeps the state based on just the last few bytes as an input. Bytes are added to the state, and then removed once other bytes have been added — this is similar to having an window which moves over the input data. Areas which have the same byte sequences in parts will produce the same hash output sequence for that segment.
The code is:
package main import ( "fmt" "os" "github.com/glaslos/ssdeep" ) func getSSD(msg string) string { buffer := make([]byte, 4097) copy (buffer,msg) h, err := ssdeep.FuzzyBytes(buffer) if err != nil { fmt.Println(err) } return(h) } func main() { msg1:="hello. how are you?" msg2:="hello. how are you" argCount := len(os.Args[1:]) if (argCount>0) {msg1= string(os.Args[1])} if (argCount>1) {msg2 = string(os.Args[2])} h1:=getSSD(msg1) fmt.Printf("%s:\t%s\n",msg1,h1) h2:=getSSD(msg2) fmt.Printf("%s:\t%s\n",msg2,h2) }
A sample run is
this is the first string: 3:YKRkjWKC1:YS this is the second string: 3:YKRym1:YV
and:
this is the first string: 3:YKRkjWKC1:YS a totally different sentence: 3:sRKP+DTfWALXAl1:sa+DKALQl
If we take an example of:
to be or not to be that is the question, or is it?: 3:qEYAFrxWFr2AWB7PRu:R3rk24 to be or beet to be that is the question, or is it?: 3:qhK9AFrxWFr2AWB7PRu:m3rk24
Intially we have:
to be or : 3:q:
And then the middle bit changes the hash, and then finally we have a match at the end:
q EY AFrxWFr2AWB7PRu q hK9 AFrxWFr2AWB7PRu
As we have inserted or deleted bytes, similar sequences will always match. This is perfectly matched to malware detection, as we can still detect a match, even if they bytes are not aligned.
One system which uses an ssdeep hash to identify malware is VirusTotal:
References
Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3, 91-97.