Category Archives: Interpretation

The story of Nature, Google and the flu

GFT 09 charlene mcbride

Google Flu Trends 09 by charlene mcbride, Flickr Creative Commons 

Following on from my last post on Big Data, this story illustrates beautifully the strengths and weaknesses of big data.

However, firstly I would like to draw attention to and highly recommend an article entitled “Big data: are we making a big mistake” by Tim Harford in the Financial Times magazine. It provides a thorough account of problems with big data, highlighted with examples such as that of Google and influenza.

The story begins in 2008, when Google beat the Centers for Disease Control and Prevention (CDC) in predicting the spread of influenza (“the flu”) across the United States.

Publishing their results in Nature (February, 2009), Google described how they aggregated historical logs of the (50 million) most common online search queries between 2003 and 2008.

Google was faster at tracking the flu outbreak because they found a correlation between people’s web searches and whether they had flu symptoms. The CDC took around a week to track the flu as they had to form the picture by collating data “on the ground” – that is from individual practices. In contrast, Google’s tracking took only about a day.

So, Google Flu Trends (GFT), working solely on data and algorithms, was quick, accurate and cheap. There was no antecedent theory, no null hypothesis on the correlation between certain search terms and the spread of the disease itself.

Now, skip ahead four years: in February 2013, Nature News reported that GFT had over-estimated the spread of flu. It had predicted double the number of episodes compared with the CDC. GFT used big data, whilst the CDC used traditional methods of data collection and analysis and were proved right.

So why, after accurately predicting flu patterns over the preceding winters, had GFT suddenly failed with its big data?

The first big problem was that the GFT team did not know what connected the search terms and the actual spread of flu. They were not looking for causation. They were simply looking at correlation, and finding patterns.

Apparently, as discussed in my earlier post, this is common when companies look at big data: it is far cheaper to look for correlation than causation. The latter can be impossible, and perhaps not cost-effective.

So the failure of GFT was a result of them not knowing what was the reason for the correlation, and then what might have caused the correlation to collapse. For example, flu scares in the previous winter may have triggered web searches by healthy people.

search-engine-land by Google Flu Shot Locator

search engine land by Google Flu Shot Locator, Flickr Creative Commons 

The Parable of Google Flu: Traps in Big Data Analysis” is a paper that discusses the problems encountered by GFT, which are also translatable to other organisations.

Published by Harvard authors David Lazer, Ryan Kennedy, Gary King and Alessandro Vespignani, it explores two main issues that led to GFTs failure – which they call “big data hubris” and “algorithm dynamics”.

The former refers to the challenges of properly analysing the quantity of data; the latter are the programming tweaks made by the operators to improve service (and also by users of that particular service).

Changes in GFT’s search algorithm and user behaviour (the dynamics) probably affected GFT’s flu tracking programme, leading to their incorrect prediction of flu prevalence.

The common explanation for the error – (media fuelled) flu-panic the previous year – does not explain why GFT had missed predictions by wide margins for over two years. Earlier versions of GFT did not succumb to previous flu scares.

One likely cause was a change made by GFT’s algorithm itself.

Certain differences – such as searches for flu treatments and searches for information on differentiating flu from the common cold – appeared to follow GFT’s errors.

Another learning point from GFT concerns reproducibility (or replicability as they call it) and transparency, both of which are causes for concern. Several difficulties were encountered when trying to replicate the original algorithm. Search terms are unclear, and both access to Google’s data and the possibility of replicating GFTs analysis have limitations e.g. privacy.

Remember the “multiple-comparisons problem”? If you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.

Correlation does not equal causation

The problems discussed above are not limited to GFT. Although valuable, big data cannot yet replace traditional data collection, methods and analysis.

At the end of “The Parable of Google Flu”, the authors suggest an “all data revolution,” where advanced analysis of both traditional “small data” and new big data might provide the clearest picture of the world.

Big Data has become a mainstream commodity in science, technology and business. But it must be handled carefully.

Google Flu will no doubt return, refreshed and upgraded. For now, however, it serves as a lesson on looking at big data and avoiding previous mistakes.

KamiPhuc by GFT

‘KamiPhuc’ by Google Flu Trends, Flickr Creative Commons 

Advertisements

Ambulance waiting times – not the best data story?

I’m using this article on ambulance waiting times from the BBC to illustrate some of the frustrations encountered with certain data stories. There is also a lesson to learn I hope.

Em amb by lydia_shiningbrightly

Picture: lydia_shiningbrightly, Flickr Creative Commons

This is a short critical analysis.  I like the BBC, but they are evidently susceptible to making news out of what many may consider to be non-stories.

Let’s start with the title:

Wales’ ambulance transfer times worst in UK

Now, regarding waiting or transfer times, longer probably equals worse. So this is essentially true.

However, does the longest wait (at six hours 22 minutes) equate to the worst service?  What if all their other times were less than 30 minutes?  They weren’t but anyway…

The fact is, the numbers tell us nothing about the distribution of the data: it gives no idea of proportion.

Normal distribution, or Gaussian distribution as it is sometimes called, is an important concept in mathematical probability and statistics.

The classic symmetrical bell-shaped curve, shown below, is used to demonstrate “normally” distributed data. When dealing with continuous data, the curve indicates all data points (observation) between two limits.

In many areas of scientific study, physical measurements often have normal distribution. This is very useful in science because Gaussian distribution is frequently used for random values where the actual distribution is unknown.

Furthermore, analysis of results becomes more straightforward when the relevant variables are normally distributed.

ND 1

The normal “Gaussian” distribution. Picture by Namal Perera

Applying the bell-curve principle to the story – is it really news if you take one extreme of the curve and make it seem like the norm? Excuse the minor pun.

It’s a point I’ve raised again and again. As journalists, it’s fine to report on the extremes. And often fun – consider the world’s largest hotdog. However, as a serious data journalist, it’s worth taking the time to consider how relevant your data point is in the context of the whole distribution story.

And just btw, the lesson still stands – actually it’s worse – if we consider asymmetrically distributed data.

pos skew d 2

Positively skewed data. Picture by Namal Perera

Here, the end of the curve is even less representative of the data distribution.

In fairness to the reporter(s), not named on the webpage, they do give some points of reference. For example, they state “no service saw its longest wait dip under an hour, with many around the two-hour mark”.

And for comparison, it mentions an ambulance service in the east of England whose longest single wait was 5 hours 51 minutes.

The report isn’t likely to be the full story (it rarely is from what I’ve learned during the Science Journalism MA). However, if you ask for the “longest waits” you will probably only be given the longest waits.

Yes it is unacceptable for patients to wait around for lengthy hand-overs, but if it doesn’t lead to harm and the majority of people are treated promptly how big an issue is this?

The article does quote a Welsh government spokesperson saying, “most people were waiting for an average of 20 minutes.”

One A&E consultant I spoke to said, “It’s no surprise. You see this a lot. It’s an easy target highlighting the weak areas. It’s a shame they don’t mention how quickly the sickest [patients] were transferred.”

So the next time you read a story about the longest, shortest, fastest, or healthiest please consider what data you’re looking at, and the context in which it is presented. Consider whether you have the full picture before forming your opinion.

Describing Data

Drowning by numbers by Jorge Franganillo

Drowning by numbers.  Picture by Jorge Franganillo, Flickr Creative Commons

Data is basically information – a set of quantitative or qualitative values.

As I said in my introduction, the term is used as a mass noun i.e. “the data shows…” (although “the data show…” is also correct).

An individual data point or value represents a piece of information.

Data is usually collected by measurement and visualised by images such as charts or graphs.

Raw data refers to unprocessed information in the form in which it was originally collected. This can be from scientific experiment (based on observation under laboratory conditions) or simply from the field.

However it is collected and in whatever form, it is first necessary to recognise exactly what type of data you are dealing with. The diagram below should give the reader a general idea of the different data types. It is just one way to look at data, and I hope it is clear.

Data types

 

Initially you can make two broad distinctions: whether the data is continuous or discrete.

Continuous data is always quantitative or numerical. It has a numerical value that may be an integer, ratio, or interval. This means it can be a whole number (1,2,3…etc.), or any number from zero to infinity with all decimal values in-between.

Discrete data is also called categorical data as it refers to data arranged in categories. Categorical data can be ordered or ranked such as first, second, third etc or mild, moderate, severe – this is therefore ordinal data.

Alternatively, categorical data may (often) be unranked such as colours of cars. This is nominal data. Both nominal and ordinal data do not have any numerical value – this makes them non-parametric. This is important when it comes to statistics, the subject that gives the data meaning and value.

There you have it – all the types of data. Although journalists probably don’t need to get bogged down in the details, it will always be handy to recognise exactly what you’re dealing with. This is especially true for science journalists I feel.

In case I don’t manage an easy-to-understand statistics post, it is worth me mentioning how we can handle data (as journalists or scientists). There are three broad stages…

  1. Collection: may be from surveys or (scientific) studies; for journalists, collection is usually from a source
  1. Presentation: usually in graphic format, with measurement of certain markers e.g. maximums, minimums, averages.
  1. Interpretation: using statistics is a major part of results analysis, although journalists perhaps rightly look to the expert discussion of the results too.

My aim is to understand the process better. I hope yours is too.

Introduction: one doc’s issues with data

I thought I might struggle with data journalism. And not just because of time constraints, or my own IT limitations. I guess it’s because I didn’t know what to expect.  Having learned the fundamentals of using data and a reasonable grasp of basic stats (during research at school, university and work), I enter the field of data journalism with an open mind but a little skepticism.

Sean MacEntee

Data Recovery.  Picture by Sean MacEntee, Flickr Creative Commons

I’ve often been irritated by stories in the press, which don’t do justice to the actual figures. “Data” seems to be thrown around in the media to add authority, a buzz-word to proclaim truth. I believe that the majority of journalists, like scientists, do a decent job when it comes to the numbers behind the stories. However from time to time, there appears to be a woeful lack of insight during the interpretation of these numbers.

Ben Goldacre’s Bad Science column / blog / book and Paul Bradshaw’s Online Journalism Blog are excellent resources for de-bunking scientific or medical myths. They highlight how journalists, politicians and scientists can mislead the public at almost every stage of research – from the methodology, to the interpretation of results.

So, where to start? Taking the advice of my journalism tutor and Dr Goldacre, I just started writing – if only to vent some of my frustrations. These mainly stem from several data stories that appear (to me, on closer reading) incomplete, misrepresented or overestimated in value.

Even the word itself – data, singular datum – causes contention. Let’s be clear: as a mass noun to signify information, it is perfectly acceptable to use data in the singular, although the (more pedantic?) academic types often prefer to acknowledge the Latin roots of the word and would say “these data show” as each piece of information is a datum.

I want people to see things as they truly are, through the objectivity that data offers. This requires both reliable sources and recording of data, and accurate interpretation. Data journalists will have their own styles and opinions, but robust data analysis should yield clear and consistent meanings.

So, here are a few things I’d like to cover: sources of data, presentation and basic analysis. The latter will not involve much in the way of statistics. I also aim to critique some data stories as well as try out software and online tools for my own data stories.

Lastly, for my blog-posts I’d like to invoke my own extension of the “KISS” acronym – now “KISSASS” =

keep it short, sweet, and simple, stupid

Short: I see that around 500 words is recommended for blog-posts, although I can’t say with any conviction what the ideal word is. “Sweet” really means selective and stimulating – one post for one (interesting) idea. And simple: where possible it should be understandable to almost everyone.

I hope it works out and I’m keen to hear your comments.

by bixentro

Picture by bixentro, Flickr Creative Commons