Cartoon Big Data by Thierry Gregorius, Flickr Creative Commons
I remember (back in 2000) being asked about the world’s biggest databases. My mentor told me the story of Walmart: how from early in its creation, it began to collect customer data. By the year 2000, it had the largest database in the world. That store is still vast (several hundred terabytes), though no longer the biggest, and requires a space over one hectare in Missouri, the so-called “Area 71”.
Walmart’s database provides a powerful marketing resource.
Companies could pay Walmart to look at customer data in order to pitch advertising. By looking at what products a particular individual bought, it was possible to get an idea of household income.
From there, a car company for example could decide what model of car might be best suited for a particular person and direct promotions that way.
Big data is perhaps a somewhat vague term for the sheer scale of data that now exists, which may in be measured in petabytes or exabytes – Paul Bradshaw recently reminded me that around 2-3 exabytes of data are created every day.
Big data describes large and complex collections of data, which cannot be processed by traditional data processing tools. Challenges arise from its collection to interpretation.
Volume simply refers to the increasing amount of data, the cause of which is multi-factorial. This includes “unstructured data” from social media to quantitative data collected by machine sensors.
Velocity is the speed of data coming in and going out. Dealing with the rapid influx of data quickly enough is a challenge for many organisations.
Variety refers to the full range of data types and sources. Formats may be structured or unstructured, both of which place demands on the analysts.
Big data sets are often beyond the capability of normal software tools to manage within an acceptable, or tolerable, timeframe. Add to this the difficulties of variability and complexity and it is easy to lose control.
Sense of Statistics
Making sense of big data requires the use of complex statistical techniques. Large data sets can be untidy, and are potentially full of biases (especially when just looking for correlations). Using big data – to figure out exactly what is going on and how to effect worthwhile change in a system – requires advanced statistics.
Big data causes a problem because there are many more possible combinations of (“linked”) data points that can be compared – and so there is a high chance of finding an association.
This is the “multiple-comparisons problem”: if you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.
Yet again, we must seek to find out whether the pattern is “statistically significant” (a true finding), or whether it occurred by chance.
The aforementioned difficulties of managing big data with standard desktop software (for statistics and visualisation) means more advanced database systems are required. Terms such as inductive statistics and nonlinear system identification are essential jargon. These concepts essentially allow the user to identify true relationships within the data as well as predict outcomes.
David Spiegelhalter, statistics professor at Cambridge University, who has also talked at City University, gives a useful lecture on the trickiness of numbers, number hygiene and statistical significance. The new statistical techniques for big data will work by building on old methods.
With respect to big data he says, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.”
Big Data by Walt Stoneburner, Flickr Creative Commons
Big data enthusiasts make four claims, which Spiegelhalter warns could be “complete bollocks”. Certainly, they are over-simplistic and are listed below – together with their flaws.
- Analysis of big data can yield accurate results.
– ignoring biases, statistical methods and causation means that we can overrate accuracy.
- Recording (almost) every individual data point renders former statistical sampling techniques obsolete.
– see Spiegelhalter’s comments above. The newer statistical methods are derived from former ones.
- Correlation gives us the necessary picture – causation is an outdated secondary issue.
– again, bias matters. To downgrade the importance of causation is only permissible in a stable environment. Making predictions without knowing about bias and causation does not work in a changing world.
- Statistical models are unnecessary because with big data “the numbers speak for themselves”.
– the numbers cannot reliably speak for themselves: random patterns / correlations exist in big data and these outnumber true (statistically significant) findings.
To conclude, companies are interested in big data sets because they are relatively cheap to collect for their size and they can be readily updated. A multitude of individual data points can also be used for many different purposes. All this helps in the marketplace.
Big data made a big splash. But to fulfil its potential it must be married with statistical insight. Only then will we fully appreciate the full impact of big data.