Big datasets made researchers chase Newton's dream
Hannu-Pekka Ikäheimo explores the world of culturomics.
Google converted books into data and in doing so ended up creating a new field of science – culturomics. What does it really mean and what can we expect from it? In 2004, Google started a megalomaniac project in its typical style. The aim was to convert every book ever written into electronic form and make them openly accessible online. For this purpose, Google developed a scanner that automatically turns pages, allowing it to scan millions of books. It only took a couple of years to set up the world’s largest digital library. However, as the books had been stored online as images, mining at the level of words remained impossible, and users had to know what they were searching for. For this reason, Google decided to convert the books into data with optical character recognition software, which could recognise letters, words and sentences out of the images. The outcome is the world’s largest library of computerised data which is not only readable and interpretable by humans, but also computers. Currently, Google’s datafied digital library contains over 30 million books. Based on Google’s rough estimates, the library already covers 15 to 20 per cent of the written heritage of the world. Researchers and research groups have been naturally overjoyed by this database of unprecedented proportions. In fact, the project has even resulted in a brand new field of science, culturomics, which aims to understand human behaviour and cultural trends based on quantitative analysis. Erez Aiden and Jean-Baptiste Michel, pioneers in culturomics, have used the datafied material to examine the origins and prevalence of words during different periods of time. One of their main research findings was that around half of the words used in English cannot be found in dictionaries. These studies also inspired establishing the Google Ngram Viewer search engine, which makes it easy to examine the emergence and increase in the prevalence of words or even the popularity of different people in various periods of time. Aiden and Michel have called culturomics a completely new kind of “cultural telescope” and modestly compared the opportunities it provides with Galileo Galilei, who they argue removed Earth from the centre of the universe with a telescope only 30 times more efficient than the human eye. However, researcher Helga Nowotny asserts that Isaac Newton would be a much more natural point of reference for culturomics. After all, one of Newton’s less well-known big dreams was to shed light on the origins of civilisations by creating a numerical codex derived from astrology to be applied in texts, particularly Biblical ones. According to Nowotny, over the years his hobby led Newton to compile an extensive bank of material, big data of sorts of his time, and used it to attempt to understand the history of humankind in mathematical terms; in other words, by quantifying and measuring the output of human culture. The idea that unconditional, accurate and predictable rules underlie the variation, arbitrariness and apparent chaos of the world and can be mathematically calculated and explained was irresistible in the 17th century. Nevertheless, Newton, who discovered laws of nature, never witnessed the realisation of his dream of discovering the “laws” underlying human activities. What can we expect from culturomics, then? Will the great expectations for this new method face the same fate as Newton’s dreams? At least for now, we are yet to see any major breakthroughs. One concrete step forward can probably be considered to have already emerged from the fact that the digitised online library has made finding new and interesting sources easier while at the same time complicating plagiarism. It also seems that Google has used the huge mass of text in its endeavours to develop automated translation. However, when it comes to research, work is only in its initial stages. Already the unprecedentedly extensive datasets can help researchers make new discoveries and focus their sights on phenomena that could be easily overlooked with traditional datasets. Using this methodology in studying big news data has already led to interesting network analyses of the “natural civilisations” of the world, including detecting an increase in tensions in North African nations before the Arab Spring, albeit after the events. If the capacity of machines to analyse natural language develops as expected, understanding quick summaries and cultural changes should also be possible in the near future. For now, traditional research methods and analyses made by humans continue to be required to obtain a deeper understanding of data. According to Nowotny, there is also a need for increasingly in-depth collaboration between fields of science if we wish to get more than mere peculiar historical details out of culturomics. The Viikon varrelta (Weekly notes) blog deals with current topics in Sitra’s strategy and research teams. You can find the Weekly notes blogs here.