Machine Learning in Digital Forensics

Data Science is not just for Data Scientists, and Machine Learning (ML) is not just for AI Developers.

Machine Learning in Digital Forensics

Data Science is not just for Data Scientists, and Machine Learning (ML) is not just for AI Developers.

With data science, we have the maths of the past applied to data analysis. For this, engineers and scientists have been using these methods for decades, and where the term of a data scientist is a little lost on many. These days, most researchers could be defined as data scientists, as they spend a good deal of their time capturing, analyzing and presenting data. And, for ML, too, the years of it being a specialized tool for those who study ML have passed. Now, ML is a tool in every researcher’s tool kit. With a click of a button, we have access to advanced ML models and deep learning techniques.

What is required is not quite a deep understanding of ML, but core domain knowledge, as without a deep understanding of the domain, there is little chance of advancements in research. A data scientist is unlikely to uncover a new method to determine a side channel in an elliptic curve operation, but a cryptography researcher can easily apply their art with ML methods.

And, it is in cybersecurity that we possibly face one of the greatest challenges in data science. How do we detect and mitigate cyber threats, and where we are increasingly swamped with data? Nowhere is this more relevant than in digital forensics, and which has seen a massive increase in the amount of data archived and analysed.

A new paper now reviews the state-of-the-art in the use of ML in Digital Forensics [1][here]:

A basic analysis of keywords in papers shows that image-based research and the usage of CNN (Convolutional Neural Network) seems to be at the forefront of ML work in digital forensics:

So, why is CNN ahead of other methods? Well, just like the brain, a CNN is good at detecting objectives within images and can be trained to detect these. An analysis of the linkages shows key research hubs around detection, identification and images:

For the ML methods used, we see that CNN and DL (Deep Learning) are the most popular. Other methods used include Bayesian, Logistic Regression (LR), KNN (k-nearest neighbour), LSTM (Long Short-Term Memory), CapsNet, and K-means.

The paper identifies that deep learning models in digital forensics have significantly grown since 2017. Key research domains included:

CGI and fake detection. This included the use of fake audio, video and images. Overall, the authors define that deep learning models, especially CNN-based models, have achieved good results for fake detection because they can have many features to train on. Unfortunately, their performance decreases on blind detection. The example given is when using computer graphics rendering tools that are not known, it is often difficult for the match to pinpoint a unique digital signature within the processing. For this, Convolutional Traces analysis and feature extraction with Expectation-Maximization (EM) algorithm with SVM has produced good results [2].
Manipulation detection. This included the detection of manipulation in audio, video, images and text. With images, a CNN model can use feature extractors and classifiers [3].
Camera or phone source identification. An important piece of evidence can be the linking of a camera or phone to an image and/or video, such as to a camera fingerprint for copyright purposes. Again [4], CNN provides ways to identify the camera model.
Printer source identification. While the usage of printers has reduced over the years, there is still important work done on fingerprinting of printers to the model and instance of a printer (such as related to copyright ownership and document manipulation. As with text detection methods, SVM (Support Vector Machine) provides a strong method for text printer sources [5].
Authorship attribution and Profiling. This area identifies the profile of an author. While there is a good correlation for fairly long documents, it can be difficult for short messages (such as those on social media) [6]. One area of interesting research is in Source Code Authorship Attribution, and where a hybrid approach of a dependence graph and deep learning mode has produced high levels of success [7].
Attack and malware detection (network-based). The research work in attack and malware detection using network-based methods has advanced a great deal over the years, but still have challenges around the usage of tunneling. Along with this, there needs to be a significant amount of pre-processing of the data for the learning process, and much of the attack data is based on time analysis. For this recurrent neural network (RNN) with a LSTM (Long Short Term Memory) architecture can produce good success rates for time-based data. It has also been shown to detect operations from side channels, such as from elliptic curve cryptography (ECC) operations [8].

Conclusions

The authors identify that there is generally a lack of a taxonomy and ontology for ML applied to digital forensics, but, at least, make a start on defining the key areas involved. Along with this, apart from image and video analysis, the usage of ML still remains fairly limited.

If you are interested in ML and Cybersecurity, here’s a quick presentation:

and here:

References

[1] Nayerifard, T., Amintoosi, H., Bafghi, A. G., & Dehghantanha, A. (2023). Machine Learning in Digital Forensics: A Systematic Literature Review. arXiv preprint arXiv:2306.04965.

[2] Guarnera, L., Giudice, O., & Battiato, S. (2020). Deepfake detection by analyzing convolutional traces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 666–667).

[3] Bayar, B., & Stamm, M. C. (2018). Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection. IEEE Transactions on Information Forensics and Security, 13(11), 2691–2706.

[4] Bondi, L., Baroffio, L., Güera, D., Bestagini, P., Delp, E. J., & Tubaro, S. (2016). First steps toward camera model identification with convolutional neural networks. IEEE Signal Processing Letters, 24(3), 259–263.

[5] Jain, H., Joshi, S., Gupta, G., & Khanna, N. (2020). Passive classification of source printer using text-line-level geometric distortion signatures from scanned images of printed documents. Multimedia Tools and Applications, 79(11–12), 7377–7400.

[6] Boenninghoff, B., Nickel, R. M., Zeiler, S., & Kolossa, D. (2019, May). Similarity learning for authorship verification in social media. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2457–2461). IEEE.

[7] Ullah, F., Wang, J., Jabbar, S., Al-Turjman, F., & Alazab, M. (2019). Source code authorship attribution using hybrid approach of program dependence graph and deep learning model. IEEE Access, 7, 141987–141999.

[8] Sayakkara, A., Le-Khac, N. A., & Scanlon, M. (2020). Facilitating electromagnetic side-channel analysis for IoT investigation: Evaluating the EMvidence framework. Forensic Science International: Digital Investigation, 33, 301003.