Kuvaaja: Susa Junnola

Published December 22, 2016

What can science learn from Google?

Can data really replace scientific methods?

Digital space might grow tenfold by the year 2020. How can science benefit from this? Hannu-Pekka Ikäheimo discusses the topic in the Weekly notes blog. There was already a culminating point of sorts for the Big Data hubris on the other side of the pond in 2008. This was when Chris Andersson, the then-editor of Wired magazine, predicted that the exponential growth of data would result in the abandonment of scientific methods altogether in his article The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. According to Andersson, we may soon do away with scientific classifications, ontology, hypotheses and testing, as the numbers speak for themselves. Statistical algorithms make it possible to find patterns that enable finding faster and more accurate information about people’s true behaviour. Why people do what they do was perceived as insignificant by Andersson. He considered knowing how people act in real life to be enough. This inspired him to end his article with a question that has since become a classic: what can science learn from Google? In the scientific community, a cascade of protests emerged in the immediate aftermath of Andersson’s provocative theses. And lively discussion on the subject of the data deluge has been ongoing ever since the article’s publication. In his scientific article published in 2014, Big Data: a big mistake, economist Tim Harford reminded those immersed in the data craze of the basic principles of statistics. While it had been possible to predict flu outbreaks based on Google searches for several consecutive seasons, the predictions were no longer correct during one season. Why was that? Of course, there could be numerous reasons, as the relationship between the searches and the spread of influenza is correlative, not causal. According to Harford, this is a good example of how examination devoid of theories may be misleading. The purpose of science also includes understanding reasons and producing explanations. Therefore, findings cannot be verified purely based on data, as there is also a need for a wider understanding of the surrounding reality. Based on a report on Big Data (link in Finnish) by the Ministry of Transport and Communications, digital space will see annual growth of 40%, increasing tenfold by the year 2020. As described above, from the viewpoint of scientific research, the growth in the amount of available data does not provide a shortcut to a better understanding of the world. No amount of data and technical skills in data mining will suffice if there is a lack of understanding of the shortcomings of data or an inability to pose questions relevant to it. Even so, it is obvious that datafication will also provide a lot of new opportunities for the world of research. I will highlight three such opportunities that will perhaps also provide answers to the challenge presented by Andersson. 1. Increasingly extensive research data The digitisation of data reserves, automation of data collection and reduction in data storage costs will make it possible to manage increasingly extensive datasets in research. At best, this could lead to an improvement in the quality of empirical research, even though it is good to keep one rule of thumb in mind: no matter the scale, data is not valuable in itself. Instead, potentially valuable knowledge is produced through data refinement, organisation and analysis. The humanities have also reacted to the possibilities provided by big data by starting to use the methods of data science in collecting, managing and analysing research material. One of the most interesting ongoing Finnish projects is the Citizen Mindscapes research consortium project (in Finnish), which entails examining data from the Suomi24 discussion forum using the methods of statistics and language technology, as well as visual tools. The data, which has been made available for research purposes by Aller Oy, includes over 70 million messages written by Finns in a period of over 15 years. The work has only just begun, but one of the tasks of the project is actually to lead the way for research in the social sciences by making use of digital text data. 2. More accurate information on human behaviour Devices connected to the internet, as well as social media, sensor networks and geodata, enable the obtaining of increasingly accurate, versatile and real-time data on actual human behaviour. Automatic data collection is not as affected by similar cognitive bias or delays as methods such as questionnaire surveys, which brings models representing reality closer to the actual real-life conditions. A study by Etla observed that including information from Google search queries in existing models enables increasingly accurate predictions of unemployment in Finland, both at present and in the near future. The Google searches particularly tend to improve the prediction accuracy around turning points. Based on this finding, Etla has developed the ETLAnow forecasting tool in collaboration with 28 European research institutes. The tool predicts the development of unemployment in all EU counties. ETLAnow is the first economic forecasting tool using big data on the internet that has been made available to the general public. The tool uses Google’s search data and official data from Eurostat in its forecasts. In the future, the aim is to also expand the tool’s field of operations to other economic phenomena, such as the development of the real estate market. 3. New and surprising findings Big and increasingly versatile digital datasets enable discovering connections that have previously remained hidden. A prime example of this is a Big Data experiment by the Dutch tax authorities. In the experiment, combining data from different authorities led to the discovery that people whose marriage had recently ended in divorce were considerably more likely to make mistakes in filing their tax returns that the average person. Corrective measures were taken accordingly. While such findings are obviously valuable for the authorities, similar discoveries by researchers can also lead into an increasingly thorough understanding of the way people and society act. Emphasising the meaning of collaboration According to the research company Gartner, the honeymoon is over for Big Data. It is becoming an established phenomenon, and people are already expecting it to produce concrete results. Nevertheless, researcher Sami Holopainen argues that Finns and Europeans have only recognised the opportunities presented by the Big Data phenomenon in the last couple of years. While large information companies have rushed to adopt products related to Big Data as part of their portfolios, university education has been falling behind. Indeed, Holopainen estimates in his article published in the Futura journal that universities fail to perceive Big Data as a particularly significant phenomenon. Nearly all studies and reports on Big Data highlight the idea that working with big, structured and unstructured datasets requires new kinds of dialogic, interdisciplinary and multi-method approaches. In fact, one of the objectives of the Citizen Mindscapes project is bridging the gap between linguists and content researchers in different fields, as well as language technologies and data analysts. At best, this sort of interaction can lead to brand new analysis tools and methods, in the humanities and social sciences as well. There is therefore good reason for also establishing such experiments more firmly in Finnish higher education institutions. A lack of competence and experts is considered one of the most essential bottlenecks limiting the potential of Big Data. The Weekly Notes blog deals with current topics in Sitra’s strategy and research teams. You can find more Weekly Notes blogs here.

What's this about?