Category Archives: Journalism

The story of Nature, Google and the flu

GFT 09 charlene mcbride

Google Flu Trends 09 by charlene mcbride, Flickr Creative Commons 

Following on from my last post on Big Data, this story illustrates beautifully the strengths and weaknesses of big data.

However, firstly I would like to draw attention to and highly recommend an article entitled “Big data: are we making a big mistake” by Tim Harford in the Financial Times magazine. It provides a thorough account of problems with big data, highlighted with examples such as that of Google and influenza.

The story begins in 2008, when Google beat the Centers for Disease Control and Prevention (CDC) in predicting the spread of influenza (“the flu”) across the United States.

Publishing their results in Nature (February, 2009), Google described how they aggregated historical logs of the (50 million) most common online search queries between 2003 and 2008.

Google was faster at tracking the flu outbreak because they found a correlation between people’s web searches and whether they had flu symptoms. The CDC took around a week to track the flu as they had to form the picture by collating data “on the ground” – that is from individual practices. In contrast, Google’s tracking took only about a day.

So, Google Flu Trends (GFT), working solely on data and algorithms, was quick, accurate and cheap. There was no antecedent theory, no null hypothesis on the correlation between certain search terms and the spread of the disease itself.

Now, skip ahead four years: in February 2013, Nature News reported that GFT had over-estimated the spread of flu. It had predicted double the number of episodes compared with the CDC. GFT used big data, whilst the CDC used traditional methods of data collection and analysis and were proved right.

So why, after accurately predicting flu patterns over the preceding winters, had GFT suddenly failed with its big data?

The first big problem was that the GFT team did not know what connected the search terms and the actual spread of flu. They were not looking for causation. They were simply looking at correlation, and finding patterns.

Apparently, as discussed in my earlier post, this is common when companies look at big data: it is far cheaper to look for correlation than causation. The latter can be impossible, and perhaps not cost-effective.

So the failure of GFT was a result of them not knowing what was the reason for the correlation, and then what might have caused the correlation to collapse. For example, flu scares in the previous winter may have triggered web searches by healthy people.

search-engine-land by Google Flu Shot Locator

search engine land by Google Flu Shot Locator, Flickr Creative Commons 

The Parable of Google Flu: Traps in Big Data Analysis” is a paper that discusses the problems encountered by GFT, which are also translatable to other organisations.

Published by Harvard authors David Lazer, Ryan Kennedy, Gary King and Alessandro Vespignani, it explores two main issues that led to GFTs failure – which they call “big data hubris” and “algorithm dynamics”.

The former refers to the challenges of properly analysing the quantity of data; the latter are the programming tweaks made by the operators to improve service (and also by users of that particular service).

Changes in GFT’s search algorithm and user behaviour (the dynamics) probably affected GFT’s flu tracking programme, leading to their incorrect prediction of flu prevalence.

The common explanation for the error – (media fuelled) flu-panic the previous year – does not explain why GFT had missed predictions by wide margins for over two years. Earlier versions of GFT did not succumb to previous flu scares.

One likely cause was a change made by GFT’s algorithm itself.

Certain differences – such as searches for flu treatments and searches for information on differentiating flu from the common cold – appeared to follow GFT’s errors.

Another learning point from GFT concerns reproducibility (or replicability as they call it) and transparency, both of which are causes for concern. Several difficulties were encountered when trying to replicate the original algorithm. Search terms are unclear, and both access to Google’s data and the possibility of replicating GFTs analysis have limitations e.g. privacy.

Remember the “multiple-comparisons problem”? If you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.

Correlation does not equal causation

The problems discussed above are not limited to GFT. Although valuable, big data cannot yet replace traditional data collection, methods and analysis.

At the end of “The Parable of Google Flu”, the authors suggest an “all data revolution,” where advanced analysis of both traditional “small data” and new big data might provide the clearest picture of the world.

Big Data has become a mainstream commodity in science, technology and business. But it must be handled carefully.

Google Flu will no doubt return, refreshed and upgraded. For now, however, it serves as a lesson on looking at big data and avoiding previous mistakes.

KamiPhuc by GFT

‘KamiPhuc’ by Google Flu Trends, Flickr Creative Commons 


Buzzfeed’s infographic review of Beautiful Science

The British Library is running an exhibit entitled “Beautiful Science: Picturing Data, Inspiring Insight” from 20th February to 26th May 2014.

BL and St Panc by Jim Linwood

Picture: The British Library and St Pancras by Jim Linwood, Flickr Creative Commons

The exhibition explores how scientific stories are told by turning numbers into pictures – the story of infographics.

This historical review of infographics reveal how scientific understanding has developed together with people’s capacity to represent data in pictures and graphs. Buzzfeed have paid tribute to Beautiful Science with a look at “9 Glorious Infographics Through History”, which is both an appropriate and considerate choice. The display features a variety of designs spanning almost four centuries.

BF BS exhibit

Perhaps unsurprisingly, my favourite infographics were John Graunt’s “Bills of Mortality” from 1662 and Florence Nightingale’s “Rose Diagram” from 1854.

enhanced-buzz-wide-Graunt's Bills of Mortality

Picture: John Graunt’s Bills of Mortality (1662). From British Library.  Click to enlarge

Graunt’s table is one of the earliest publications of public health data. It was collated from early death notifications gathered by parish clerks in London at the turn of the 17th century, in an attempt to monitor deaths from plague.

Among the more interesting points were the three to four people per year who died from lethargy; the eight who died by “Wolf” between 1633 and 1636 (why none before or after – was there a cull of man-eaters?); and most sadly what appears to be 279 folk who died from grief over those 15-20 years.

John Graunt was a haberdasher by trade, although he is now considered to be one of the first epidemiologists.

The Rose Diagram by Florence Nightingale (below) also stands out as a fine public health infographic.

Nightingale is famous for looking after thousands of soldiers during the Crimean War (1853-6). But the Lady of the Lamp was also a splendid epidemiologist, who harnessed the power of the infographic and statistics to initiate change.

Flo N Rose Diagram

Picture: Florence Nightingale’s Rose Diagram (1854). From the British Library

Iconic may be too strong a word to describe Nightingale’s “rose diagram” but I think it is appropriate – this nineteenth century pie-chart is indeed a visual icon.

It shows seasonal variation in the cause of mortality of soldiers in the military field hospital.

At the end of the war, Nightingale wrote a report including this infographic, which carried a stark message: hospitals can kill. The majority of soldiers died from preventable diseases (in blue) rather than from battle wounds (in red).

The Rose Diagram was designed to show that improving sanitation in hospitals could save lives. It ultimately led to cleaner hospitals, where more lives were saved.

The Beautiful Science exhibit runs from 20th February to 26th May 2014 with free admission to the Folio Gallery.

Describing Data

Drowning by numbers by Jorge Franganillo

Drowning by numbers.  Picture by Jorge Franganillo, Flickr Creative Commons

Data is basically information – a set of quantitative or qualitative values.

As I said in my introduction, the term is used as a mass noun i.e. “the data shows…” (although “the data show…” is also correct).

An individual data point or value represents a piece of information.

Data is usually collected by measurement and visualised by images such as charts or graphs.

Raw data refers to unprocessed information in the form in which it was originally collected. This can be from scientific experiment (based on observation under laboratory conditions) or simply from the field.

However it is collected and in whatever form, it is first necessary to recognise exactly what type of data you are dealing with. The diagram below should give the reader a general idea of the different data types. It is just one way to look at data, and I hope it is clear.

Data types


Initially you can make two broad distinctions: whether the data is continuous or discrete.

Continuous data is always quantitative or numerical. It has a numerical value that may be an integer, ratio, or interval. This means it can be a whole number (1,2,3…etc.), or any number from zero to infinity with all decimal values in-between.

Discrete data is also called categorical data as it refers to data arranged in categories. Categorical data can be ordered or ranked such as first, second, third etc or mild, moderate, severe – this is therefore ordinal data.

Alternatively, categorical data may (often) be unranked such as colours of cars. This is nominal data. Both nominal and ordinal data do not have any numerical value – this makes them non-parametric. This is important when it comes to statistics, the subject that gives the data meaning and value.

There you have it – all the types of data. Although journalists probably don’t need to get bogged down in the details, it will always be handy to recognise exactly what you’re dealing with. This is especially true for science journalists I feel.

In case I don’t manage an easy-to-understand statistics post, it is worth me mentioning how we can handle data (as journalists or scientists). There are three broad stages…

  1. Collection: may be from surveys or (scientific) studies; for journalists, collection is usually from a source
  1. Presentation: usually in graphic format, with measurement of certain markers e.g. maximums, minimums, averages.
  1. Interpretation: using statistics is a major part of results analysis, although journalists perhaps rightly look to the expert discussion of the results too.

My aim is to understand the process better. I hope yours is too.

Introduction: one doc’s issues with data

I thought I might struggle with data journalism. And not just because of time constraints, or my own IT limitations. I guess it’s because I didn’t know what to expect.  Having learned the fundamentals of using data and a reasonable grasp of basic stats (during research at school, university and work), I enter the field of data journalism with an open mind but a little skepticism.

Sean MacEntee

Data Recovery.  Picture by Sean MacEntee, Flickr Creative Commons

I’ve often been irritated by stories in the press, which don’t do justice to the actual figures. “Data” seems to be thrown around in the media to add authority, a buzz-word to proclaim truth. I believe that the majority of journalists, like scientists, do a decent job when it comes to the numbers behind the stories. However from time to time, there appears to be a woeful lack of insight during the interpretation of these numbers.

Ben Goldacre’s Bad Science column / blog / book and Paul Bradshaw’s Online Journalism Blog are excellent resources for de-bunking scientific or medical myths. They highlight how journalists, politicians and scientists can mislead the public at almost every stage of research – from the methodology, to the interpretation of results.

So, where to start? Taking the advice of my journalism tutor and Dr Goldacre, I just started writing – if only to vent some of my frustrations. These mainly stem from several data stories that appear (to me, on closer reading) incomplete, misrepresented or overestimated in value.

Even the word itself – data, singular datum – causes contention. Let’s be clear: as a mass noun to signify information, it is perfectly acceptable to use data in the singular, although the (more pedantic?) academic types often prefer to acknowledge the Latin roots of the word and would say “these data show” as each piece of information is a datum.

I want people to see things as they truly are, through the objectivity that data offers. This requires both reliable sources and recording of data, and accurate interpretation. Data journalists will have their own styles and opinions, but robust data analysis should yield clear and consistent meanings.

So, here are a few things I’d like to cover: sources of data, presentation and basic analysis. The latter will not involve much in the way of statistics. I also aim to critique some data stories as well as try out software and online tools for my own data stories.

Lastly, for my blog-posts I’d like to invoke my own extension of the “KISS” acronym – now “KISSASS” =

keep it short, sweet, and simple, stupid

Short: I see that around 500 words is recommended for blog-posts, although I can’t say with any conviction what the ideal word is. “Sweet” really means selective and stimulating – one post for one (interesting) idea. And simple: where possible it should be understandable to almost everyone.

I hope it works out and I’m keen to hear your comments.

by bixentro

Picture by bixentro, Flickr Creative Commons