Malware Detection: Context Triggered Piecewise Hashes (CTPH)

A CTPH is a rolling hash and involves multiple traditional cryptographic hashes for one or more fixed-size segments in a file. One of the…

Photo by Growtika Developer Marketing Agency

Malware Detection: Context Triggered Piecewise Hashes (CTPH)

A CTPH is a rolling hash and involves multiple traditional cryptographic hashes for one or more fixed-size segments in a file. One of the most popular CTPH methods is ssdeep. This was created, in 2006, by Jesse Kornblum [1] and uses fuzzy hashes:

This aims to match similar byte sequences, no matter if there are differences between these sequences. For this, we break a file up into fragments with a rolling hash function. We then produce a small hash for each of those fragments. These are then aggregated to produce the complete hash of a file. In comparing file signatures, we treat the resultant hash value as a string and then use an edit distance method to compare.

With this, ssdeep keeps the state based on just the last few bytes as an input. Bytes are added to the state, and then removed once other bytes have been added — this is similar to having a window that moves over the input data. Areas that have the same byte sequences in parts will produce the same hash output sequence for that segment.

The code is [here]:

package main
import (
"fmt"

"os"
"github.com/glaslos/ssdeep"
)

func getSSD(msg string) string {

buffer := make([]byte, 4097)

copy (buffer,msg)

h, err := ssdeep.FuzzyBytes(buffer)
if err != nil {
fmt.Println(err)
}
return(h)
}


func main() {
msg1:="hello. how are you?"
msg2:="hello. how are you"

argCount := len(os.Args[1:])
if (argCount>0) {msg1= string(os.Args[1])}
if (argCount>1) {msg2 = string(os.Args[2])}

h1:=getSSD(msg1)
fmt.Printf("%s:\t%s\n",msg1,h1)
h2:=getSSD(msg2)
fmt.Printf("%s:\t%s\n",msg2,h2)

}

A sample run is [here]:

this is the first string:	3:YKRkjWKC1:YS
this is the second string: 3:YKRym1:YV

and [here]:

this is the first string:	3:YKRkjWKC1:YS
a totally different sentence: 3:sRKP+DTfWALXAl1:sa+DKALQl

If we take an example of:

to be or not to be that is the question, or is it?: 3:qEYAFrxWFr2AWB7PRu:R3rk24
to be or beet to be that is the question, or is it?: 3:qhK9AFrxWFr2AWB7PRu:m3rk24

Initially we have:

to be or : 3:q:

And then the middle bit changes the hash, and then finally we have a match at the end:

q EY  AFrxWFr2AWB7PRu
q hK9 AFrxWFr2AWB7PRu

As we have inserted or deleted bytes, similar sequences will always match. This is perfectly matched to malware detection, as we can still detect a match, even if the bytes are not aligned.

Some examples are:

  • word1 =”this is the first string”, word2 =”this is the first string” Try!
  • word1 =”this is the first string”, word2 =”this is the string first” Try!
  • word1 =”this is the first string”, word2 =”this is the first help” Try!
  • word1 =”this is the first string”, word2 =”this keep the first help” Try!
  • word1 =”this is the first string”, word2 =”a totally different sentence” Try!

One system which uses an ssdeep hash to identify malware is VirusTotal:

Conclusions

With CTPH we create a fuzzy hash. Other fuzzy hashing methods include those which have Statiscally-Improbable Features (SIF), such as with sdhash, and for Block-Based Rebuilding (BBR), such as mvHash-B.

References

[1] Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3, 91–97.