“Machine Learning is Mainly Not Data Science!”

Are we creating an amazing new world of AI, or a rats-nest of problems?

Photo by Pietro Jeng on Unsplash

“Machine Learning is Mainly Not Data Science!”

Are we creating an amazing new world of AI, or a rats-nest of problems?

Well, a quote stuck out for me last week from Arun Ghosh at KPMG:

a large part of machine learning is not data science but data engineering

and then he followed up with:

“It’s cleaning and collating and integrating information, and then you run the algorithm. What we are finding is that you can compress the data engineering process by adding a trusted layer that is immutable by nature.”

And I realised that some people in the industry are starting to understand the need to rebuild our systems in a trustworthy way. Many in the industry currently promote the throwing a whole lot of data at AI, and let the machine sort things out. But how can we trust the veracity of the data, and even if it is correct in the first place? Just like in finance, we now need audit trails and the ability to trace back to its source.

Corporations should be now analysing their data and asking serious questions … “Who owns it?”, “Do we have their consent?” “Where did it come from?” “Has it been modified?”, “Can we retrain without using these parts of the data set”, and so on. We need to be able to audit our data sources!

For AI, we often have a “black box” approach, and which scares many organisations, as they have no idea about what is inside the black box, and how it will react to certain conditions. Microsoft was caught out with this when the released Tay, and which aimed to learn from on-line activity, but she ended up learning from the worst of humankind:

With an expert system, we knew exactly what the knowledge was, and could predict how our system would react to a range of situations. For AI, one bit of bad data could shut our companies down!

I remember when I worked on a consultancy project in St Fergus, and which looked to optimize a gas compressor (and which pumped in around 40% of all the gas in the UK at the time). I created a Neural Network solution for optimizing it, but the company did not want it, as their control engineers could not see how it worked. So, in the end, it was an expert system which ran — mostly — and then the NN kicked in to perform a bit of optimization, but would then be kicked-off if it moved too far away from the norm.

And so to overcome these concerns, Microsoft has released Azure Blockchain Data Manager — and which is ledger-agnostic. Overall it provides a way to transact data from nodes in a trustworthy way, and with the integration of smart contracts. A core focus is around creating trustworthy IoT infrastructures. This approach can then be used to prove the provenance of the data before it is fed into the AI infrastructure, and then as an audit trail for the learning process. Data could thus be encoded into different forms, but the core data would still be there to trace back too. This has particular uses within the supply chain industry and could be used to trace foodstuffs from their production to consumption.

Conclusions

So, data architects should be asking … “where did this data come from, and can you show me the transaction that generated this data? … only then can we have provable systems using AI.