ROC’ing The Data Science of Cyber Security

As you may know, we are developing the Data Science of Cyber Security course, and a core part of this is the investigation of machine…

ROC’ing The Data Science of Cyber Security

As you may know, we are developing the Data Science of Cyber Security course, and a core part of this is the investigation of machine learning. Within Cybersecurity, we are increasingly swamped with data, and in many different formats and from many different sources. The detection of a data breach often involves searching through Terabytes of log data, in order to find the trace of the scanning of the network by an intruder (the reconnaissance phase), or on the dropping of malware onto a site (the delivery phase), or in the running of malicious script to install a backdoor (the installation phase), or the call back to a master control network (the command and control phase), or even the transfer of files (the action phase).

But if an organisation has good defences in place, it could detect the threat at an early phase, and thus make plans to stop the threat from progressing into an action phase. Many organisations are thus moving to Security Operations Centres for 24x7 montoring of their data infrastructure. Unfortunately the security analysts could be swamped by the amount of alert/alarms being generated, or become desensitized by too many false alarm. Thus we increasingly use machine learning to classify our inputs, and thus aid the analysts in making reasoned descisions. We must thus understand how we classify data, and in how we discover our thresholds. This article outlines some of the metrics we use for this, and in the creation of the ROC (Receiver Operating Characteristic) Curve.

Introduction

In our data analysis for Cybersecurity we must often classify our data in order that we can efficiently search for things, or use it to trigger alerts from rules. A rule might relate to the blocking of access to a remote site. Thus our classification might relate to building up a list of whitelisted IP addresses and blacklisted ones. For each access, we may have to access the trustworthiness of an IP address, such as whether it is listed as malicious, or whether its domain name has existed for a specified amount of time. It we get it right, we have a successful classification, if it is wrong, then we have been unsuccessful. The success rate will then be defined as:

True-positive (TP). That something has been successfully classified as the thing that we want it classified as.
False-positive (FP). That we have classified something that is incorrect. This would be defined as a Type I error. An example might be where a system classifies an alert as a hack and where a user enters an incorrect password a number of times, but, on investigation, it is found that the valid user had just forgotten their password.
True-negative (TN). That we rejected something, and it is not a match.
False-negative (FN). That we have dismissed something, but, in fact, it is true. This is defined as a miss and is a Type II error. With this, a hacker might try a number of passwords for a user, and but where the system does not create an alert for the intrusion.

Within Cybersecurity, we must be careful not to desensitize the analyst, and thus it is important not to have too make false-positives, and also not for them to lack trust in the matching process by seeing too many false-negatives. So a quality metric might relate to the accuracy ratio of the classification. Let’s say we search for credit card details from a data source of 100 data records. Our search then returns 10 which are true positives (TPs), and five which have been misclassified as credit card numbers (FP), but are not. It also gives three records which have credit card details but have not been matched (FN). We then have 82 which do not have credit card details, and which have been correctly identified as not credit cards (TN). We can then define a confusion matrix for a binary class classification problem:

The Accuracy of the analysis could then be defined as:

Accuracy = (TP+TN)/total =92/100 = 0.92

The Sensitivity (or the True Positive Rate) is then the number of true positives (TP) against the number of times we have found a match (TP+FN):

Sensitivity = TP/(TP+FN) = 10/13 = 0.77

The Sensitivity is also known as Recall.

Within Cybersecurity analytics, it is often the Sensitivity Rate which is seen to be a strong measure of the trustworthiness of the system. Thus the higher the Sensitivity rate, the higher the confidence that an analyst will have in the machine classification/search. If the Sensitivity Rate is low, the human analysts may lose trust in the machine to properly classify and find things correctly. A rate of 0.1, would mean that only one-in-ten classifications were actually correct. But this metric is not measuring the number of false positives, and these could confuse an analyst by investigating something that is not correct. So an improved measure might relate to the precision of the matches and will be the ratio of the true positive matches to the number of positive matches. In this example, we would have a precision of:

Precision = TP/(TP+FP) = 10/(10+5)= 0.67

In this case the analyst would have to deal with one in three false positives. For the Specificity we define the True Negative Rate, and when we do not have a match, how often do we predict it correctly? In our example, this we be:

Specificity = TN/(TN+FP) = 82/(82+5) = 0.94

And the False Positive Rate (FPR) is then 1 minus the Specificity. In our example, this we be:

False Positive Rate = FN/(TN+FN) = 5/(82+5) = 0.06

Prevalence then defines how often the system identifies something as being correct, as a ratio of all the data records that have been sampled. In our example, this we be:

Prevalence = (FN+TP)/Total = (10+3)/100 = 0.13

The ROC Curve

Along with these measures, we can also define a ROC (Receiver Operating Characteristic) Curve. This plots the performance of the classifier for all possible thresholds. With this, we plot the True Positive Rate (TPR) — the sensitivity — against the False Positive Rate (FPR) — the specificity. It is used to understand the real cost/benefits within the decision-making process. The ROC curve is used in many areas and in understanding the ability of humans in reacting to stimuli. From this curve, we can analyse the effect that different threshold would have in creating true positives, and with false alarms. For example, we might define a threshold of three incorrect logins for the user as generating an alert. Unfortunately, many users when they forget their password will try it more than three times. This would trigger a false positive — a false alarm — for an attempted hack on the system for valid users but would generate a true positive for an attacker who is continually trying different passwords. We may thus train our system for valid users trying different passwords, against an attacking using brute force.

In the example in Listing 1 we will try and determine Bob from Eve, but sampling their typing speeds. For Eve we determine that she types at 20, 25, 16, 42 and 22 characters per minute, and Bob types at 50, 41, 60, 54 and 39 characters per minute. We then use the metrics.roc_curve() method, in order to generate a range of threshold values that could be used to differentiate Bob from Eve, and also return the FPR and TPR for these threshold values. In this case, the result is:

FPR: [0. 0. 0. 0.2 0.2 1. ]
TPR: [0. 0.2 0.6 0.6 1. 1. ]
Thresholds: [61 60 50 42 39 16]

We can then see that if we use a threshold of 61 characters per minute, the TPR and FPR values will both be zero, and thus we will not be able to detect Bob or Eve. Next, a threshold of 60 gives us a TPR of 0.2, and so we can detect Bob in one-in-five occurrences (as once his typing speed was 60 characters per second). Next, we have a threshold of 50 characters per second. In this case, we will increase the TPR to 0.6, and still have a zero FPR. We could now detect Bob in 60% of the occurrences from the training set. If we reduce the threshold down to 42 characters per second, the TPR stays the same, but we now get a 0.2 value for FPR. This is because Eve had a rate of 42 characters per second in the training set, and thus if we used this threshold we would detect Bob in 3-in-5 of his samples, and Eve in 1-in-5. If we now drop to 39 characters per second, we will always detect Bob (TPR=1.0), and have an FPR of 0.2. A threshold of 16 characters per second then gives us a TPR of 1.0 and an FPR rate of 1.0. The resulting ROC curve is given in Figure 1.

Coding

The coding is [here]:

And a sample run is [here]:

['Eve', 'Eve', 'Eve', 'Eve', 'Eve', 'Bob', 'Bob', 'Bob', 'Bob', 'Bob']
[30, 25, 16, 42, 22, 50, 41, 60, 54, 80]
FPR: [0.  0.  0.  0.2 0.2 1. ]
TPR: [0.  0.2 0.8 0.8 1.  1. ]
Thresholds: [81 80 50 42 41 16]

And the chart [here]:

Conclusion

If you are interested in the Data Science of Cyber Security, get in contact with me, as we will be running taster courses in September and October.