Category Archives: Big Data

The story of Nature, Google and the flu

GFT 09 charlene mcbride

Google Flu Trends 09 by charlene mcbride, Flickr Creative Commons 

Following on from my last post on Big Data, this story illustrates beautifully the strengths and weaknesses of big data.

However, firstly I would like to draw attention to and highly recommend an article entitled “Big data: are we making a big mistake” by Tim Harford in the Financial Times magazine. It provides a thorough account of problems with big data, highlighted with examples such as that of Google and influenza.

The story begins in 2008, when Google beat the Centers for Disease Control and Prevention (CDC) in predicting the spread of influenza (“the flu”) across the United States.

Publishing their results in Nature (February, 2009), Google described how they aggregated historical logs of the (50 million) most common online search queries between 2003 and 2008.

Google was faster at tracking the flu outbreak because they found a correlation between people’s web searches and whether they had flu symptoms. The CDC took around a week to track the flu as they had to form the picture by collating data “on the ground” – that is from individual practices. In contrast, Google’s tracking took only about a day.

So, Google Flu Trends (GFT), working solely on data and algorithms, was quick, accurate and cheap. There was no antecedent theory, no null hypothesis on the correlation between certain search terms and the spread of the disease itself.

Now, skip ahead four years: in February 2013, Nature News reported that GFT had over-estimated the spread of flu. It had predicted double the number of episodes compared with the CDC. GFT used big data, whilst the CDC used traditional methods of data collection and analysis and were proved right.

So why, after accurately predicting flu patterns over the preceding winters, had GFT suddenly failed with its big data?

The first big problem was that the GFT team did not know what connected the search terms and the actual spread of flu. They were not looking for causation. They were simply looking at correlation, and finding patterns.

Apparently, as discussed in my earlier post, this is common when companies look at big data: it is far cheaper to look for correlation than causation. The latter can be impossible, and perhaps not cost-effective.

So the failure of GFT was a result of them not knowing what was the reason for the correlation, and then what might have caused the correlation to collapse. For example, flu scares in the previous winter may have triggered web searches by healthy people.

search-engine-land by Google Flu Shot Locator

search engine land by Google Flu Shot Locator, Flickr Creative Commons 

The Parable of Google Flu: Traps in Big Data Analysis” is a paper that discusses the problems encountered by GFT, which are also translatable to other organisations.

Published by Harvard authors David Lazer, Ryan Kennedy, Gary King and Alessandro Vespignani, it explores two main issues that led to GFTs failure – which they call “big data hubris” and “algorithm dynamics”.

The former refers to the challenges of properly analysing the quantity of data; the latter are the programming tweaks made by the operators to improve service (and also by users of that particular service).

Changes in GFT’s search algorithm and user behaviour (the dynamics) probably affected GFT’s flu tracking programme, leading to their incorrect prediction of flu prevalence.

The common explanation for the error – (media fuelled) flu-panic the previous year – does not explain why GFT had missed predictions by wide margins for over two years. Earlier versions of GFT did not succumb to previous flu scares.

One likely cause was a change made by GFT’s algorithm itself.

Certain differences – such as searches for flu treatments and searches for information on differentiating flu from the common cold – appeared to follow GFT’s errors.

Another learning point from GFT concerns reproducibility (or replicability as they call it) and transparency, both of which are causes for concern. Several difficulties were encountered when trying to replicate the original algorithm. Search terms are unclear, and both access to Google’s data and the possibility of replicating GFTs analysis have limitations e.g. privacy.

Remember the “multiple-comparisons problem”? If you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.

Correlation does not equal causation

The problems discussed above are not limited to GFT. Although valuable, big data cannot yet replace traditional data collection, methods and analysis.

At the end of “The Parable of Google Flu”, the authors suggest an “all data revolution,” where advanced analysis of both traditional “small data” and new big data might provide the clearest picture of the world.

Big Data has become a mainstream commodity in science, technology and business. But it must be handled carefully.

Google Flu will no doubt return, refreshed and upgraded. For now, however, it serves as a lesson on looking at big data and avoiding previous mistakes.

KamiPhuc by GFT

‘KamiPhuc’ by Google Flu Trends, Flickr Creative Commons 


The power and problems of BIG data

by Thierry Gregorius

Cartoon Big Data by Thierry Gregorius, Flickr Creative Commons

I remember (back in 2000) being asked about the world’s biggest databases. My mentor told me the story of Walmart: how from early in its creation, it began to collect customer data. By the year 2000, it had the largest database in the world. That store is still vast (several hundred terabytes), though no longer the biggest, and requires a space over one hectare in Missouri, the so-called “Area 71”.

Walmart’s database provides a powerful marketing resource.

Companies could pay Walmart to look at customer data in order to pitch advertising. By looking at what products a particular individual bought, it was possible to get an idea of household income.

From there, a car company for example could decide what model of car might be best suited for a particular person and direct promotions that way.

Big data is perhaps a somewhat vague term for the sheer scale of data that now exists, which may in be measured in petabytes or exabytes – Paul Bradshaw recently reminded me that around 2-3 exabytes of data are created every day.

Big data describes large and complex collections of data, which cannot be processed by traditional data processing tools. Challenges arise from its collection to interpretation.

In February 2001, an analyst named Doug Laney at the META Group described the three ‘V’s of big data: volume, velocity and variety.

Volume simply refers to the increasing amount of data, the cause of which is multi-factorial. This includes “unstructured data” from social media to quantitative data collected by machine sensors.

Velocity is the speed of data coming in and going out. Dealing with the rapid influx of data quickly enough is a challenge for many organisations.

Variety refers to the full range of data types and sources. Formats may be structured or unstructured, both of which place demands on the analysts.

Big data sets are often beyond the capability of normal software tools to manage within an acceptable, or tolerable, timeframe. Add to this the difficulties of variability and complexity and it is easy to lose control.

Sense of Statistics

Making sense of big data requires the use of complex statistical techniques. Large data sets can be untidy, and are potentially full of biases (especially when just looking for correlations). Using big data – to figure out exactly what is going on and how to effect worthwhile change in a system – requires advanced statistics.

Big data causes a problem because there are many more possible combinations of (“linked”) data points that can be compared – and so there is a high chance of finding an association.

This is the “multiple-comparisons problem”: if you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.

Yet again, we must seek to find out whether the pattern is “statistically significant” (a true finding), or whether it occurred by chance.

The aforementioned difficulties of managing big data with standard desktop software (for statistics and visualisation) means more advanced database systems are required. Terms such as inductive statistics and nonlinear system identification are essential jargon. These concepts essentially allow the user to identify true relationships within the data as well as predict outcomes.

David Spiegelhalter, statistics professor at Cambridge University, who has also talked at City University, gives a useful lecture on the trickiness of numbers, number hygiene and statistical significance. The new statistical techniques for big data will work by building on old methods.

With respect to big data he says, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.”

BD by Walt Stoneburner

Big Data by Walt Stoneburner, Flickr Creative Commons


Big data enthusiasts make four claims, which Spiegelhalter warns could be “complete bollocks”. Certainly, they are over-simplistic and are listed below – together with their flaws.

  1. Analysis of big data can yield accurate results.

– ignoring biases, statistical methods and causation means that we can overrate accuracy.

  1. Recording (almost) every individual data point renders former statistical sampling techniques obsolete.

– see Spiegelhalter’s comments above. The newer statistical methods are derived from former ones.

  1. Correlation gives us the necessary picture – causation is an outdated secondary issue.

– again, bias matters. To downgrade the importance of causation is only permissible in a stable environment. Making predictions without knowing about bias and causation does not work in a changing world.

  1. Statistical models are unnecessary because with big data “the numbers speak for themselves”.

– the numbers cannot reliably speak for themselves: random patterns / correlations exist in big data and these outnumber true (statistically significant) findings.

To conclude, companies are interested in big data sets because they are relatively cheap to collect for their size and they can be readily updated. A multitude of individual data points can also be used for many different purposes. All this helps in the marketplace.

Big data made a big splash. But to fulfil its potential it must be married with statistical insight. Only then will we fully appreciate the full impact of big data.