Google Flu Trends 09 by charlene mcbride, Flickr Creative Commons
Following on from my last post on Big Data, this story illustrates beautifully the strengths and weaknesses of big data.
However, firstly I would like to draw attention to and highly recommend an article entitled “Big data: are we making a big mistake” by Tim Harford in the Financial Times magazine. It provides a thorough account of problems with big data, highlighted with examples such as that of Google and influenza.
The story begins in 2008, when Google beat the Centers for Disease Control and Prevention (CDC) in predicting the spread of influenza (“the flu”) across the United States.
Publishing their results in Nature (February, 2009), Google described how they aggregated historical logs of the (50 million) most common online search queries between 2003 and 2008.
Google was faster at tracking the flu outbreak because they found a correlation between people’s web searches and whether they had flu symptoms. The CDC took around a week to track the flu as they had to form the picture by collating data “on the ground” – that is from individual practices. In contrast, Google’s tracking took only about a day.
So, Google Flu Trends (GFT), working solely on data and algorithms, was quick, accurate and cheap. There was no antecedent theory, no null hypothesis on the correlation between certain search terms and the spread of the disease itself.
Now, skip ahead four years: in February 2013, Nature News reported that GFT had over-estimated the spread of flu. It had predicted double the number of episodes compared with the CDC. GFT used big data, whilst the CDC used traditional methods of data collection and analysis and were proved right.
So why, after accurately predicting flu patterns over the preceding winters, had GFT suddenly failed with its big data?
The first big problem was that the GFT team did not know what connected the search terms and the actual spread of flu. They were not looking for causation. They were simply looking at correlation, and finding patterns.
Apparently, as discussed in my earlier post, this is common when companies look at big data: it is far cheaper to look for correlation than causation. The latter can be impossible, and perhaps not cost-effective.
So the failure of GFT was a result of them not knowing what was the reason for the correlation, and then what might have caused the correlation to collapse. For example, flu scares in the previous winter may have triggered web searches by healthy people.
search engine land by Google Flu Shot Locator, Flickr Creative Commons
“The Parable of Google Flu: Traps in Big Data Analysis” is a paper that discusses the problems encountered by GFT, which are also translatable to other organisations.
Published by Harvard authors David Lazer, Ryan Kennedy, Gary King and Alessandro Vespignani, it explores two main issues that led to GFTs failure – which they call “big data hubris” and “algorithm dynamics”.
The former refers to the challenges of properly analysing the quantity of data; the latter are the programming tweaks made by the operators to improve service (and also by users of that particular service).
Changes in GFT’s search algorithm and user behaviour (the dynamics) probably affected GFT’s flu tracking programme, leading to their incorrect prediction of flu prevalence.
The common explanation for the error – (media fuelled) flu-panic the previous year – does not explain why GFT had missed predictions by wide margins for over two years. Earlier versions of GFT did not succumb to previous flu scares.
One likely cause was a change made by GFT’s algorithm itself.
Certain differences – such as searches for flu treatments and searches for information on differentiating flu from the common cold – appeared to follow GFT’s errors.
Another learning point from GFT concerns reproducibility (or replicability as they call it) and transparency, both of which are causes for concern. Several difficulties were encountered when trying to replicate the original algorithm. Search terms are unclear, and both access to Google’s data and the possibility of replicating GFTs analysis have limitations e.g. privacy.
Remember the “multiple-comparisons problem”? If you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.
Correlation does not equal causation
The problems discussed above are not limited to GFT. Although valuable, big data cannot yet replace traditional data collection, methods and analysis.
At the end of “The Parable of Google Flu”, the authors suggest an “all data revolution,” where advanced analysis of both traditional “small data” and new big data might provide the clearest picture of the world.
Big Data has become a mainstream commodity in science, technology and business. But it must be handled carefully.
Google Flu will no doubt return, refreshed and upgraded. For now, however, it serves as a lesson on looking at big data and avoiding previous mistakes.
‘KamiPhuc’ by Google Flu Trends, Flickr Creative Commons