Violent crime at University

Uni Bath Chen Zhao

University of Bath by Chen Zhao, Flickr Creative Commons

Across the UK last week, on August 14th, thousands of students received their A-Level grades and will now making plans for starting university.

Once the euphoria has settled, it is time to face the practicalities. For many, they will be moving away from home to new towns or cities. One important issue is that of security – both on campus and in the new places the students find themselves in.

Violent crime is defined as those against the person (with injury) and covers a wide range of offences. In about half of all cases, fortunately the victim suffers no physical injury.

Using CartoDB, I’ve made a map showing violent crime rates across 29 university cities.

The UK data originally comes from the Home Office and ONS data (England and Wales); Justice Analytical Services (in Scotland); and the Central Survey Unit (Northern Ireland).

If you wish to see the rate of violent crime in say Leeds, simply click on the location (you should know the geography) – and in the info-window you should see the rate of violent crime per 1000 people.

The size of the bubble relates to how dangerous the city is in terms of violent crime.

CBD pic uni city

How dangerous is your university? Graphic by Namal Perera

Nottingham had the highest rate of violent crime in England, and also had high rates of burglary and robbery. Unfortunately, nobody from the university security was available to speak to me. Speaking to former students from Nottingham – they reported being aware of the city having a certain degree of notoriety attached to it, but had not experienced any particular violence against their person.

Mike Porter is the Security Manager at the University of Bath, which has been ranked as the safest university in the UK. There are probably many reasons for this – e.g. overall crime may be lower with a smaller population – but it’s useful to know how they address security issues there.

“At their induction, students usually receive a security lecture from student services” he said, “We [security] are present during Freshers’ Week and can give advice and additional information to students.”

Location is also important. Mr Porter says, “Being on top of a hill is beneficial from a crime prevention point of view.”

For more general information on England and Wales violent crime statistics click here.



How to make a map with CartoDB

World Map Parchment by Guy Sie

World Map Parchment by Guy Sie, Flickr Creative Commons

I must admit that I quite enjoyed the various tutorials on how to make maps. Learning to use different software for tabulating and geo-coding data had its moments, but ultimately it allowed me to develop some basic skills that I’ve applied in other parts of the course.

I found CartoDB was the easiest mapping software to use, and I’ve used it a couple of times for MA assignments. Although probably not as versatile as using Google pivot tables, it is a simple and user-friendly tool. Useful for someone like me.

So here is a stepwise guide to making a map. As always, the first step is sourcing a reliable set of data. From there it’s as easy as…

  1. Collate the data in a Microsoft Excel file
  1. Login to CartoDB. All that is needed is a valid email address, and once registered, there is the option for five free data tables (up to 50 megabytes).
  1. Click on ‘tables created’, then ‘add new table’: this allows direct import of data from a URL, Googledrive or Dropbox. Or you can simply upload your own file from a laptop.

Please NOTE: ensure your Excel file contains fully cleaned data – removing empty cells, deleting unwanted columns etc.

It also helps to have city geocodes in place, ideally in column next to the city. These can be found at

Alternatively, Google finds geocodes automatically through its mapping function, linked to pivot tables.

  1. Go to the dashboard, which shows how many and the names of existing data files. Upload your data file.
  • Believe it or not, you’re almost done. Once all the data is uploaded, there should be a complete table with same layout as the Excel document. The first column is automatically given a cartodb ID number. Again, please note that it is important to have the data in its finalised form before importing from Excel. It is difficult to edit data on CartoDB.

eg CDB chicago parking pay boxes by Steven Vance

Example of CartoDB table – Chicago parking pay boxes by Steven Vance, Flickr Creative Commons 

5. Once you have your complete table uploaded, go through the columns and under the heading choose whether the data is a string, date or number. You can generally ignore the Boolean option.

6. Go to map view – on the right hand side there is a task bar, where you can select wizards to present or highlight the data in different ways.

  • There are a number of different icons.
  • The info-window icon (“bubble”) allows you to choose what information appears in the window when you click on the map location.
  • I particularly liked the cluster wizard (paintbrush icon), which proportions the size of the bubble according to a particular data column. Please see my next post for an example of this.

7. Finally, click on ‘visualise’, give your map / URL a name and publish.

There you have it. Less than 10 steps to make a map. Simply click on the location, to access the relevant data.

If you have problems, the user support is fairly prompt. I tried to geocode regions of the UK once but failed.

Nick Jaremek from CartoDB Support initially thought the geocoding option might not be working because the codes did not have the right format. He later stated, “UK regions are something too specific to be geocoded right now.”

It does have its weaknesses, but for straightforward location mapping with attached numerical data, I found CartoDB to be a valuable and extremely easy open-source mapping tool.

Fatalities on the road

Crushed car RTA by Emilian Robert Vicol

Crushed-car by Emilian Robert Vicol, Flickr Creative Commons 

Ever so often, usually when stuck in traffic, I contemplate just how dangerous are roads in the UK compared to other countries. I suppose you do too. Probably at a similar time, whilst at a standstill on a motorway, when the police and ambulances go shrieking past.

In 2013 the World Health Organisation (WHO) published its Global Status Report on Road Safety, which again highlights that road traffic accidents (RTAs) are the leading cause of death for young people (aged 5-29), killing more people than malaria.

RTA chart

Chart showing fatalities from RTAs across different countries adjusted for population

Each year around 1.25m people are killed in traffic accidents globally. WHO Director-General, Margaret Chan, has previously said, “Road traffic crashes are a public health and development crisis,” adding,

The vast majority of those affected are young people in developing countries.

We are in the UN decade of Action for Road Safety. There is an ongoing drive (excuse the pun) to reduce deaths on the road by 50% by 2020, with experts estimating that five million lives could be saved. Currently, annual deaths are predicted to rise to 1.9m by the end of the decade.

So where are the world’s most dangerous roads? Using to make a tree-map, I’ve highlighted data pertaining to certain key countries at both ends of this fatal scale.

RTA fig

Road fatalities by country – Picture by Namal Perera on


Key points from WHO data:-

  • middle-income countries account for around 80% of RTA deaths but are home to only around 50% of the world’s registered vehicles: they therefore have a disproportionately high burden of deaths.
  • Eritrea is estimated to have the highest number of road deaths (48.4 per 100,000 people). This is taken from 2009 data however.
  • The world’s most populous countries, China and India, have the highest absolute number of recorded road deaths (275,983 and 243,475 respectively) but lie mid-table when adjusting for population.
  • In Africa, Nigeria has the largest population and buys the most cars. South Africa has the highest car ownership per capita. Both are in the top 10 (7th and 8th) when it comes to road fatalities (34 and 32 deaths per 100,000 population).
  • San Marino has the best record according to the WHO, with zero fatalities on its roads (2010 data). However, the tragic deaths of Formula 1 legend Ayrton Senna and Roland Ratzenberger during the 1994 grand prix weekend are more than enough for this tiny enclave to cope with.

The story of Nature, Google and the flu

GFT 09 charlene mcbride

Google Flu Trends 09 by charlene mcbride, Flickr Creative Commons 

Following on from my last post on Big Data, this story illustrates beautifully the strengths and weaknesses of big data.

However, firstly I would like to draw attention to and highly recommend an article entitled “Big data: are we making a big mistake” by Tim Harford in the Financial Times magazine. It provides a thorough account of problems with big data, highlighted with examples such as that of Google and influenza.

The story begins in 2008, when Google beat the Centers for Disease Control and Prevention (CDC) in predicting the spread of influenza (“the flu”) across the United States.

Publishing their results in Nature (February, 2009), Google described how they aggregated historical logs of the (50 million) most common online search queries between 2003 and 2008.

Google was faster at tracking the flu outbreak because they found a correlation between people’s web searches and whether they had flu symptoms. The CDC took around a week to track the flu as they had to form the picture by collating data “on the ground” – that is from individual practices. In contrast, Google’s tracking took only about a day.

So, Google Flu Trends (GFT), working solely on data and algorithms, was quick, accurate and cheap. There was no antecedent theory, no null hypothesis on the correlation between certain search terms and the spread of the disease itself.

Now, skip ahead four years: in February 2013, Nature News reported that GFT had over-estimated the spread of flu. It had predicted double the number of episodes compared with the CDC. GFT used big data, whilst the CDC used traditional methods of data collection and analysis and were proved right.

So why, after accurately predicting flu patterns over the preceding winters, had GFT suddenly failed with its big data?

The first big problem was that the GFT team did not know what connected the search terms and the actual spread of flu. They were not looking for causation. They were simply looking at correlation, and finding patterns.

Apparently, as discussed in my earlier post, this is common when companies look at big data: it is far cheaper to look for correlation than causation. The latter can be impossible, and perhaps not cost-effective.

So the failure of GFT was a result of them not knowing what was the reason for the correlation, and then what might have caused the correlation to collapse. For example, flu scares in the previous winter may have triggered web searches by healthy people.

search-engine-land by Google Flu Shot Locator

search engine land by Google Flu Shot Locator, Flickr Creative Commons 

The Parable of Google Flu: Traps in Big Data Analysis” is a paper that discusses the problems encountered by GFT, which are also translatable to other organisations.

Published by Harvard authors David Lazer, Ryan Kennedy, Gary King and Alessandro Vespignani, it explores two main issues that led to GFTs failure – which they call “big data hubris” and “algorithm dynamics”.

The former refers to the challenges of properly analysing the quantity of data; the latter are the programming tweaks made by the operators to improve service (and also by users of that particular service).

Changes in GFT’s search algorithm and user behaviour (the dynamics) probably affected GFT’s flu tracking programme, leading to their incorrect prediction of flu prevalence.

The common explanation for the error – (media fuelled) flu-panic the previous year – does not explain why GFT had missed predictions by wide margins for over two years. Earlier versions of GFT did not succumb to previous flu scares.

One likely cause was a change made by GFT’s algorithm itself.

Certain differences – such as searches for flu treatments and searches for information on differentiating flu from the common cold – appeared to follow GFT’s errors.

Another learning point from GFT concerns reproducibility (or replicability as they call it) and transparency, both of which are causes for concern. Several difficulties were encountered when trying to replicate the original algorithm. Search terms are unclear, and both access to Google’s data and the possibility of replicating GFTs analysis have limitations e.g. privacy.

Remember the “multiple-comparisons problem”? If you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.

Correlation does not equal causation

The problems discussed above are not limited to GFT. Although valuable, big data cannot yet replace traditional data collection, methods and analysis.

At the end of “The Parable of Google Flu”, the authors suggest an “all data revolution,” where advanced analysis of both traditional “small data” and new big data might provide the clearest picture of the world.

Big Data has become a mainstream commodity in science, technology and business. But it must be handled carefully.

Google Flu will no doubt return, refreshed and upgraded. For now, however, it serves as a lesson on looking at big data and avoiding previous mistakes.

KamiPhuc by GFT

‘KamiPhuc’ by Google Flu Trends, Flickr Creative Commons 

The power and problems of BIG data

by Thierry Gregorius

Cartoon Big Data by Thierry Gregorius, Flickr Creative Commons

I remember (back in 2000) being asked about the world’s biggest databases. My mentor told me the story of Walmart: how from early in its creation, it began to collect customer data. By the year 2000, it had the largest database in the world. That store is still vast (several hundred terabytes), though no longer the biggest, and requires a space over one hectare in Missouri, the so-called “Area 71”.

Walmart’s database provides a powerful marketing resource.

Companies could pay Walmart to look at customer data in order to pitch advertising. By looking at what products a particular individual bought, it was possible to get an idea of household income.

From there, a car company for example could decide what model of car might be best suited for a particular person and direct promotions that way.

Big data is perhaps a somewhat vague term for the sheer scale of data that now exists, which may in be measured in petabytes or exabytes – Paul Bradshaw recently reminded me that around 2-3 exabytes of data are created every day.

Big data describes large and complex collections of data, which cannot be processed by traditional data processing tools. Challenges arise from its collection to interpretation.

In February 2001, an analyst named Doug Laney at the META Group described the three ‘V’s of big data: volume, velocity and variety.

Volume simply refers to the increasing amount of data, the cause of which is multi-factorial. This includes “unstructured data” from social media to quantitative data collected by machine sensors.

Velocity is the speed of data coming in and going out. Dealing with the rapid influx of data quickly enough is a challenge for many organisations.

Variety refers to the full range of data types and sources. Formats may be structured or unstructured, both of which place demands on the analysts.

Big data sets are often beyond the capability of normal software tools to manage within an acceptable, or tolerable, timeframe. Add to this the difficulties of variability and complexity and it is easy to lose control.

Sense of Statistics

Making sense of big data requires the use of complex statistical techniques. Large data sets can be untidy, and are potentially full of biases (especially when just looking for correlations). Using big data – to figure out exactly what is going on and how to effect worthwhile change in a system – requires advanced statistics.

Big data causes a problem because there are many more possible combinations of (“linked”) data points that can be compared – and so there is a high chance of finding an association.

This is the “multiple-comparisons problem”: if you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.

Yet again, we must seek to find out whether the pattern is “statistically significant” (a true finding), or whether it occurred by chance.

The aforementioned difficulties of managing big data with standard desktop software (for statistics and visualisation) means more advanced database systems are required. Terms such as inductive statistics and nonlinear system identification are essential jargon. These concepts essentially allow the user to identify true relationships within the data as well as predict outcomes.

David Spiegelhalter, statistics professor at Cambridge University, who has also talked at City University, gives a useful lecture on the trickiness of numbers, number hygiene and statistical significance. The new statistical techniques for big data will work by building on old methods.

With respect to big data he says, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.”

BD by Walt Stoneburner

Big Data by Walt Stoneburner, Flickr Creative Commons


Big data enthusiasts make four claims, which Spiegelhalter warns could be “complete bollocks”. Certainly, they are over-simplistic and are listed below – together with their flaws.

  1. Analysis of big data can yield accurate results.

– ignoring biases, statistical methods and causation means that we can overrate accuracy.

  1. Recording (almost) every individual data point renders former statistical sampling techniques obsolete.

– see Spiegelhalter’s comments above. The newer statistical methods are derived from former ones.

  1. Correlation gives us the necessary picture – causation is an outdated secondary issue.

– again, bias matters. To downgrade the importance of causation is only permissible in a stable environment. Making predictions without knowing about bias and causation does not work in a changing world.

  1. Statistical models are unnecessary because with big data “the numbers speak for themselves”.

– the numbers cannot reliably speak for themselves: random patterns / correlations exist in big data and these outnumber true (statistically significant) findings.

To conclude, companies are interested in big data sets because they are relatively cheap to collect for their size and they can be readily updated. A multitude of individual data points can also be used for many different purposes. All this helps in the marketplace.

Big data made a big splash. But to fulfil its potential it must be married with statistical insight. Only then will we fully appreciate the full impact of big data.

Ambulance waiting times – not the best data story?

I’m using this article on ambulance waiting times from the BBC to illustrate some of the frustrations encountered with certain data stories. There is also a lesson to learn I hope.

Em amb by lydia_shiningbrightly

Picture: lydia_shiningbrightly, Flickr Creative Commons

This is a short critical analysis.  I like the BBC, but they are evidently susceptible to making news out of what many may consider to be non-stories.

Let’s start with the title:

Wales’ ambulance transfer times worst in UK

Now, regarding waiting or transfer times, longer probably equals worse. So this is essentially true.

However, does the longest wait (at six hours 22 minutes) equate to the worst service?  What if all their other times were less than 30 minutes?  They weren’t but anyway…

The fact is, the numbers tell us nothing about the distribution of the data: it gives no idea of proportion.

Normal distribution, or Gaussian distribution as it is sometimes called, is an important concept in mathematical probability and statistics.

The classic symmetrical bell-shaped curve, shown below, is used to demonstrate “normally” distributed data. When dealing with continuous data, the curve indicates all data points (observation) between two limits.

In many areas of scientific study, physical measurements often have normal distribution. This is very useful in science because Gaussian distribution is frequently used for random values where the actual distribution is unknown.

Furthermore, analysis of results becomes more straightforward when the relevant variables are normally distributed.

ND 1

The normal “Gaussian” distribution. Picture by Namal Perera

Applying the bell-curve principle to the story – is it really news if you take one extreme of the curve and make it seem like the norm? Excuse the minor pun.

It’s a point I’ve raised again and again. As journalists, it’s fine to report on the extremes. And often fun – consider the world’s largest hotdog. However, as a serious data journalist, it’s worth taking the time to consider how relevant your data point is in the context of the whole distribution story.

And just btw, the lesson still stands – actually it’s worse – if we consider asymmetrically distributed data.

pos skew d 2

Positively skewed data. Picture by Namal Perera

Here, the end of the curve is even less representative of the data distribution.

In fairness to the reporter(s), not named on the webpage, they do give some points of reference. For example, they state “no service saw its longest wait dip under an hour, with many around the two-hour mark”.

And for comparison, it mentions an ambulance service in the east of England whose longest single wait was 5 hours 51 minutes.

The report isn’t likely to be the full story (it rarely is from what I’ve learned during the Science Journalism MA). However, if you ask for the “longest waits” you will probably only be given the longest waits.

Yes it is unacceptable for patients to wait around for lengthy hand-overs, but if it doesn’t lead to harm and the majority of people are treated promptly how big an issue is this?

The article does quote a Welsh government spokesperson saying, “most people were waiting for an average of 20 minutes.”

One A&E consultant I spoke to said, “It’s no surprise. You see this a lot. It’s an easy target highlighting the weak areas. It’s a shame they don’t mention how quickly the sickest [patients] were transferred.”

So the next time you read a story about the longest, shortest, fastest, or healthiest please consider what data you’re looking at, and the context in which it is presented. Consider whether you have the full picture before forming your opinion.

Buzzfeed’s infographic review of Beautiful Science

The British Library is running an exhibit entitled “Beautiful Science: Picturing Data, Inspiring Insight” from 20th February to 26th May 2014.

BL and St Panc by Jim Linwood

Picture: The British Library and St Pancras by Jim Linwood, Flickr Creative Commons

The exhibition explores how scientific stories are told by turning numbers into pictures – the story of infographics.

This historical review of infographics reveal how scientific understanding has developed together with people’s capacity to represent data in pictures and graphs. Buzzfeed have paid tribute to Beautiful Science with a look at “9 Glorious Infographics Through History”, which is both an appropriate and considerate choice. The display features a variety of designs spanning almost four centuries.

BF BS exhibit

Perhaps unsurprisingly, my favourite infographics were John Graunt’s “Bills of Mortality” from 1662 and Florence Nightingale’s “Rose Diagram” from 1854.

enhanced-buzz-wide-Graunt's Bills of Mortality

Picture: John Graunt’s Bills of Mortality (1662). From British Library.  Click to enlarge

Graunt’s table is one of the earliest publications of public health data. It was collated from early death notifications gathered by parish clerks in London at the turn of the 17th century, in an attempt to monitor deaths from plague.

Among the more interesting points were the three to four people per year who died from lethargy; the eight who died by “Wolf” between 1633 and 1636 (why none before or after – was there a cull of man-eaters?); and most sadly what appears to be 279 folk who died from grief over those 15-20 years.

John Graunt was a haberdasher by trade, although he is now considered to be one of the first epidemiologists.

The Rose Diagram by Florence Nightingale (below) also stands out as a fine public health infographic.

Nightingale is famous for looking after thousands of soldiers during the Crimean War (1853-6). But the Lady of the Lamp was also a splendid epidemiologist, who harnessed the power of the infographic and statistics to initiate change.

Flo N Rose Diagram

Picture: Florence Nightingale’s Rose Diagram (1854). From the British Library

Iconic may be too strong a word to describe Nightingale’s “rose diagram” but I think it is appropriate – this nineteenth century pie-chart is indeed a visual icon.

It shows seasonal variation in the cause of mortality of soldiers in the military field hospital.

At the end of the war, Nightingale wrote a report including this infographic, which carried a stark message: hospitals can kill. The majority of soldiers died from preventable diseases (in blue) rather than from battle wounds (in red).

The Rose Diagram was designed to show that improving sanitation in hospitals could save lives. It ultimately led to cleaner hospitals, where more lives were saved.

The Beautiful Science exhibit runs from 20th February to 26th May 2014 with free admission to the Folio Gallery.

Describing Data

Drowning by numbers by Jorge Franganillo

Drowning by numbers.  Picture by Jorge Franganillo, Flickr Creative Commons

Data is basically information – a set of quantitative or qualitative values.

As I said in my introduction, the term is used as a mass noun i.e. “the data shows…” (although “the data show…” is also correct).

An individual data point or value represents a piece of information.

Data is usually collected by measurement and visualised by images such as charts or graphs.

Raw data refers to unprocessed information in the form in which it was originally collected. This can be from scientific experiment (based on observation under laboratory conditions) or simply from the field.

However it is collected and in whatever form, it is first necessary to recognise exactly what type of data you are dealing with. The diagram below should give the reader a general idea of the different data types. It is just one way to look at data, and I hope it is clear.

Data types


Initially you can make two broad distinctions: whether the data is continuous or discrete.

Continuous data is always quantitative or numerical. It has a numerical value that may be an integer, ratio, or interval. This means it can be a whole number (1,2,3…etc.), or any number from zero to infinity with all decimal values in-between.

Discrete data is also called categorical data as it refers to data arranged in categories. Categorical data can be ordered or ranked such as first, second, third etc or mild, moderate, severe – this is therefore ordinal data.

Alternatively, categorical data may (often) be unranked such as colours of cars. This is nominal data. Both nominal and ordinal data do not have any numerical value – this makes them non-parametric. This is important when it comes to statistics, the subject that gives the data meaning and value.

There you have it – all the types of data. Although journalists probably don’t need to get bogged down in the details, it will always be handy to recognise exactly what you’re dealing with. This is especially true for science journalists I feel.

In case I don’t manage an easy-to-understand statistics post, it is worth me mentioning how we can handle data (as journalists or scientists). There are three broad stages…

  1. Collection: may be from surveys or (scientific) studies; for journalists, collection is usually from a source
  1. Presentation: usually in graphic format, with measurement of certain markers e.g. maximums, minimums, averages.
  1. Interpretation: using statistics is a major part of results analysis, although journalists perhaps rightly look to the expert discussion of the results too.

My aim is to understand the process better. I hope yours is too.

Introduction: one doc’s issues with data

I thought I might struggle with data journalism. And not just because of time constraints, or my own IT limitations. I guess it’s because I didn’t know what to expect.  Having learned the fundamentals of using data and a reasonable grasp of basic stats (during research at school, university and work), I enter the field of data journalism with an open mind but a little skepticism.

Sean MacEntee

Data Recovery.  Picture by Sean MacEntee, Flickr Creative Commons

I’ve often been irritated by stories in the press, which don’t do justice to the actual figures. “Data” seems to be thrown around in the media to add authority, a buzz-word to proclaim truth. I believe that the majority of journalists, like scientists, do a decent job when it comes to the numbers behind the stories. However from time to time, there appears to be a woeful lack of insight during the interpretation of these numbers.

Ben Goldacre’s Bad Science column / blog / book and Paul Bradshaw’s Online Journalism Blog are excellent resources for de-bunking scientific or medical myths. They highlight how journalists, politicians and scientists can mislead the public at almost every stage of research – from the methodology, to the interpretation of results.

So, where to start? Taking the advice of my journalism tutor and Dr Goldacre, I just started writing – if only to vent some of my frustrations. These mainly stem from several data stories that appear (to me, on closer reading) incomplete, misrepresented or overestimated in value.

Even the word itself – data, singular datum – causes contention. Let’s be clear: as a mass noun to signify information, it is perfectly acceptable to use data in the singular, although the (more pedantic?) academic types often prefer to acknowledge the Latin roots of the word and would say “these data show” as each piece of information is a datum.

I want people to see things as they truly are, through the objectivity that data offers. This requires both reliable sources and recording of data, and accurate interpretation. Data journalists will have their own styles and opinions, but robust data analysis should yield clear and consistent meanings.

So, here are a few things I’d like to cover: sources of data, presentation and basic analysis. The latter will not involve much in the way of statistics. I also aim to critique some data stories as well as try out software and online tools for my own data stories.

Lastly, for my blog-posts I’d like to invoke my own extension of the “KISS” acronym – now “KISSASS” =

keep it short, sweet, and simple, stupid

Short: I see that around 500 words is recommended for blog-posts, although I can’t say with any conviction what the ideal word is. “Sweet” really means selective and stimulating – one post for one (interesting) idea. And simple: where possible it should be understandable to almost everyone.

I hope it works out and I’m keen to hear your comments.

by bixentro

Picture by bixentro, Flickr Creative Commons