Estimated reading time 5 min

AI without good data is just RS

In addition to huge amounts of data, machine learning needs the decisions made by humans in form of training data. Only this way AI can learn to make decisions on its own.


Jyrki Suokas


All forms of artificial intelligence (AI) – and more specifically its burgeoning form, machine learning (ML) – need data, and lots and lots of it. Machine learning algorithms also require teaching data – meaning raw data sets and decisions made by humans on each particular set – so the adaptive algorithms can be used to make decisions.

In fact, these algorithms are not very smart – the simple trick is to use correct and correctly sampled teaching data so the algorithms can start making decisions on behalf of humans. The key here is the definition of word “correct”: if the decisions (again, made by human beings) contain bias the ML algorithm very efficiently learns or “inherits” the organisational bias and starts to make the same decisions that have already been made, but, by the very nature of being technology, makes them in a very opaque fashion. Algorithms by themselves are never biased. Neither is data by itself biased. Only the decisions that are used to teach the algorithm can contain bias.

There are not many organisations in the world that have the needed 1. volume (amount, history, etc.), 2. scope (breadth of different data points within a particular domain) and 3. quality (correct, timely and attributable to me) of data. This high-quality data needs to be fed to their algorithms as raw material, for a truly 360-degree view of me can be formed – whether in terms of the past, present or future. Consequently, innovators of new services in which ML or other AI technologies are used to create insights out of data are forced to source the data from third parties. But after the GDPR’s inauguration on 25 May 2018, companies find it much more difficult to acquire supplementary data from other sources, as no single organisation by itself has all the data there is on me in its possession.

Not having all the data is a two-sided problem.

First, it is a problem for me: I share my banking business across a handful of banks; I am a customer of three or four mobile operators; I buy my daily groceries from several supermarket chains, dutifully remembering every time to show my loyalty card at the checkout. None of these organisations can create the full picture of me and for me, even within their own field of industry, let alone across numerous different fields. I think this is probably the reason why I get strange and often oddly biased marketing messages from all of them, created of course with the latest and greatest marketing tools, the newest of which use ML. Another example of strange results is when somebody in our extended family forgets to log in to our joint Netflix account with their own avatar and I, being the primary user, start getting recommendations along the lines of “Because you liked Driving Miss Daisy so much you should watch…” No, I want my Altered Carbon, thank you very much.

Second, it is a huge problem for the organisations that need the data: all of them seem to have acquired these top-of-the-line 1:1 marketing tools to either stay abreast or get ahead of the game. And these tools and systems do not come easy or cheap.

Where are the tool vendor’s promised up-selling, cross-selling, instore-selling, yard-selling uplifts (insert any other preferred techno jargon terms here)? Where is the ROI for the investment? With half of the data you get less than half of the results and results that rapidly diminish in proportion to the share of wallet these organisations have on their individual customers. And we have not even started to talk about the majority of large organisations whose data and system landscape is “non-optimal”, that is, to use proper human-speak, just “plain messy”. Installing AI/ML systems on top of this kind of environment is more or less like putting lipstick on a pig.

We should start making a clear distinction between data management and data use.

So, what should be done? First and foremost, we should stop muddying the waters when we talk about AI and ML and we should start making a clear distinction between data management and data use. If data is not available, we cannot make decisions.AI is particularly relevant to the middle box. Of course, AI algorithms can be used to combine and cleanse the data, but for simplicity’s sake it is easier to reserve the middle box for AI/ML where the usage is warranted and creates added value. Key here is the loop back from the third box – automated decision-making – is looped automatically back to operational systems. This way decisions can be made faster, encompass all targets and most importantly be devoid of human mistakes.

In summary, when we have more than one data source, we need to be able to combine these multiple data streams into one harmonised set to enable us to perform analysis upon which we can make pre-emptive, predictive and/or descriptive decisions. This is not easy in this day and age of GDPR when the consent of the individual is needed for many actions. If each organisation creates its own consent-management processes, we as individuals will drown in a sea of consents in the same way we were inundated by the tidal wave of GDPR-related emails last May.

When the difficult part that is consent management is solved and individuals are able to enforce their GDPR-given superpowers by using their own “GDPR consent remote controls”, it will be easier for organisations to move data and to process that data. Then it can be either a human being or a machine performing the analysis and in the case of more automated systems, machines can also make decisions on people’s behalf. But only if we can prove it makes sense and our business benefits from this kind of set-up.

And what was the “RS” in the headline? It stands for “real stupidity”.


What's this about?