From Little Acorns Do Might Oaks Grow: The Data Science of Cyber Security

One of the great things of being in academia, is that you have the chance to grow seeds that can become might oaks. And so in June, Dr…

From Little Acorns Do Mighty Oaks Grow: The Data Science of Cyber Security

One of the great things of being in academia, is that you have the chance to grow seeds that can become might oaks. And so in June, Dr Owen Lo and I start on a new road. After asecuritysite.com (with over three million visitors per year) and the Bright Red Digital Zone (with over 120K registered users and being currently used by tens of thousands pupils in Scotland for the N5/Higher/Advanced Higher exams), we embark on a new route.

With Data Labs funding, we are now creating a Mooc related to the Data Science of Cyber Security. If there’s one area of our economy which needs Data Science, it is Cyber Security, and we need our future data scientists to be more security-aware, and our cyber security professionals to have the tools to find ‘a pin-prick on the moon’.

Owen and I worked on the Digital Zone, and we are up for the challenge of creating an immersive environment for learning Data Science for Cybersecurity. It will integrate Python into a browser, and be supported by videos, challenges, and even with a book. We now have open source tools at our fingertips, and we are going to use them. At the core, must be deep learning.

The content will provide one of the most extensive coverage of Data Science within Cyber Security, and split into four main units — and which can either be taken as units or cognitive subject areas. The units and subjects are:

Unit 1: Fundamentals

  • Subject 1: Fundamentals and Threats. Analyse key threats and data models for incident response and reporting, and outline the elements of networked systems and their data infrastructures. A key focus is on signature based and anomaly detection.
  • Subject 2: Fundamentals of Data Capture. Using Data Science techniques to analyse the trails of evidence that are gathered within network and host logs.
  • Subject 3: Data Protection: Cryptography and Access Rights. This will outline some of the fundamental methods used in protecting data as it is transmitted, at-rest, and in-process, while defining the fundamentals of cryptography.
  • Subject 4: Blockchain and distributed ledgers. This chapter will outline the basic operations of distributed ledgers and how blockchain methods can be used to increase the trustworthiness of transactions and log events.

Unit 2: Fundamental Data Science Methods for Cyber Security

  • Subject 5: Log File Analysis, e-Discovery and Timelining. This will investigate the mining of network and host logs for the detection of threats, and then timeline these.
  • Subject 6: Data Analysis. This chapter will investigate the key methods used to search, aggregate, link, parse and join data sets.

Unit 3: Machine Learning for Cyber Security

  • Subject 7: Core Methods. This chapter will outline the usage of the key methods including: Neural networks; Decision trees, Linear regression, Generalised-linear models; Random forests; Association rules; Cluster models; Naïve Bayes models; and support vector machines. A key element of this will be a critical review of how each of these methods fit within Cyber Security.
  • Subject 8: Machine Learning Methods. Machine learning methods will be studied in conenction to data capturing; Data pre-processing; Feature Extraction; the Analysis engine; and the Decision engine.
  • Subject 9: Data Transformation. This part will outline the key elements of building a machine learning model, including: normalisation, mapping and discretization process; data aggregation; and associated function calls.

Unit 4: Threat models

  • Subject 10: Spear Phishing Attack. This unit will provide practical data sets in the detection of a phishing email.
  • Subject 11: Insider Detection. This unit will provide practical data sets for the analysis of insider threat detection, including using Data Science to detect anomalies in user behaviour. This will include predictive and real time detection.
  • Subject 12: Intelligence Gathering. This unit will analyse how social media logs can be crawled and analysed for the detection of threats. It will include OSINT methodologies and tools.

Throughout the course, the data sets presented will either be open source information or information generated for scenario-based training. No personal information will be revealed within the course, and the legal and ethical context of the material will be integrated directly into the content. Each subject will contain tutorial and test elements, and these will use Python examples, with full on-line demonstrations of solutions.

So thank you to The Data Labs for the funding, and watch this space for demos and workshops. We aim to use the latest open source tools, but our core it to produce content that will allow Data Scientists to learning about Cybersecurity, and Cybersecurity professionals to learn about Data Science. In fact, we want to engage as many people as we can to learn about the future of our world. If you want to come and help us, please say.

From Little Acorns Do Mighty Oaks Grow … ask Zonefox, Symphonic and Cyan Forensics.