Teaching Cybersecurity and Machine Learning

Breaking Down the Python Barriers and Integrating With Splunk to Learn ML

Photo by Christina @ wocintechchat.com on Unspash

Teaching Cybersecurity and Machine Learning

Breaking Down the Python Barriers and Integrating With Splunk to Learn ML

Well, in teaching, you often need to innovate and try out new things. Some things will work, and others won’t. And, so, this week we introduced a lecture and lab on cybersecurity and machine learning (ML). Why? Well, ML is often taught from a data science point-of-view, and its presentation typically has no real application into cybersecurity. Students in cybersecurity and networking can then sometimes struggle to fully see the importance of the topic. Along, with this, I think machine learning is one of the least understood areas of cybersecurity, and just understanding the core concepts is a significant step forward. There’s a small barrier to get over, and it’s often just understanding the key areas of knowledge.

Another problem is that students often get presented with Python code and using an sk-Learn integration. While researchers see this as a natural way of presenting machine learning it can also provide a barrier to the understanding of the methods. The link between the core principles and the presentation of the Python code can often break-down the learning process. And, so, this week, we used Splunk to present the basic methods, and used data sets which are most relevant to cybersecurity:

For this, I presented a lecture and lab to our (excellent) Year 3 students. The lecture provided a background to intelligence:

And proposed all the different aspects of intelligence. Followed by the main coverage of ML and Cybersecurity:

But, it is the practical side of this, that should bring the topic alive, and, so, we allowed students access to the Splunk machine learning experimental area [here]:

To me, a topic comes alive when you can relate theory to practice. A core part of this is to use data sets that are relevant to cybersecurity and networking. While we still use some standard ones, such as the US Asthma data set and the iris one, it generally tries to explain ML through data sets that relate to malware analysis and networked traffic flows.

What do students need to understand?

For cybersecurity students, the main thing to learn is the understanding of the confusion matrix, such as for assessing our true-positives and false-positives. If possible, we want to maximise the percentage of true-positives and minimise the number of false-positives. In order to understand how well our models work, we thus take these values and determine metrics such as accuracy and recall. This applies to categorical prediction, such as with logistic regression, SVM (Support Vector Machine) and Random Forest Classifier.

If we want to predict numeric values from our numeric features, we can use a method such as linear regression, Random Forest Regressor and Lasso. In order to understand our success in this, we use evaluation metrics such as R² and RMSE (Root Mean Square Error). Overall, RMSE measures the distance between our real values and the predicted values, and where we want to get the value as near zero as possible. An R² score of 1 (the best score), means we have a perfect fit, and a score of zero (the worst score), it is a very poor fit.

For clustering, we can put our data into a number of clusters, and where we define the number of clusters we have. For this, we can use a method such as k-means.

Another key classification area in ML for cybersecurity is the usage of anomaly detection and outlier detection. This is useful when detecting something that does not fit into the norm. This could relate to a customer whose spending ranges in a normal range, but where they purchase something that is well outside this. Or where our Web traffic on a Monday morning ranges from 10–100Mbps, but we see traffic flows that are outside that region.

Overall, when we are fitting a model, we select a field to predict (such as whether a data packet has malware or not) and the fields to use for the prediction (such as the TCP source port and IP detection address). If we have any strings for this data, we need to use categorical prediction, and if we want to predict a numerical value, we can use numerical prediction.

Conclusions

I hope using the experimental setup in Splunk has broken down the barrier of Python, and that students can see the worth of machine learning for the subject area. And so machine learning is not just for data scientists, it is for everyone to learn. It can be applied to so many areas, and cybersecurity needs it more than virtually any other area.

So, go build a smarter future, and perhaps stop showing our Python code as the natural way to learn about machine learning. Using the Splunk platform can be a way to show how ML would be applied to real-world cybersecurity problems, and break-down some of the barriers we have in understanding the core methods involved.