Category Archives: Online

How to make a map with CartoDB

World Map Parchment by Guy Sie

World Map Parchment by Guy Sie, Flickr Creative Commons

I must admit that I quite enjoyed the various tutorials on how to make maps. Learning to use different software for tabulating and geo-coding data had its moments, but ultimately it allowed me to develop some basic skills that I’ve applied in other parts of the course.

I found CartoDB was the easiest mapping software to use, and I’ve used it a couple of times for MA assignments. Although probably not as versatile as using Google pivot tables, it is a simple and user-friendly tool. Useful for someone like me.

So here is a stepwise guide to making a map. As always, the first step is sourcing a reliable set of data. From there it’s as easy as…

  1. Collate the data in a Microsoft Excel file
  1. Login to CartoDB. All that is needed is a valid email address, and once registered, there is the option for five free data tables (up to 50 megabytes).
  1. Click on ‘tables created’, then ‘add new table’: this allows direct import of data from a URL, Googledrive or Dropbox. Or you can simply upload your own file from a laptop.

Please NOTE: ensure your Excel file contains fully cleaned data – removing empty cells, deleting unwanted columns etc.

It also helps to have city geocodes in place, ideally in column next to the city. These can be found at http://www.freegeocoder.co.uk/latitude-longitude-search/

Alternatively, Google finds geocodes automatically through its mapping function, linked to pivot tables.

  1. Go to the dashboard, which shows how many and the names of existing data files. Upload your data file.
  • Believe it or not, you’re almost done. Once all the data is uploaded, there should be a complete table with same layout as the Excel document. The first column is automatically given a cartodb ID number. Again, please note that it is important to have the data in its finalised form before importing from Excel. It is difficult to edit data on CartoDB.

eg CDB chicago parking pay boxes by Steven Vance

Example of CartoDB table – Chicago parking pay boxes by Steven Vance, Flickr Creative Commons 

5. Once you have your complete table uploaded, go through the columns and under the heading choose whether the data is a string, date or number. You can generally ignore the Boolean option.

6. Go to map view – on the right hand side there is a task bar, where you can select wizards to present or highlight the data in different ways.

  • There are a number of different icons.
  • The info-window icon (“bubble”) allows you to choose what information appears in the window when you click on the map location.
  • I particularly liked the cluster wizard (paintbrush icon), which proportions the size of the bubble according to a particular data column. Please see my next post for an example of this.

7. Finally, click on ‘visualise’, give your map / URL a name and publish.

There you have it. Less than 10 steps to make a map. Simply click on the location, to access the relevant data.

If you have problems, the user support is fairly prompt. I tried to geocode regions of the UK once but failed.

Nick Jaremek from CartoDB Support initially thought the geocoding option might not be working because the codes did not have the right format. He later stated, “UK regions are something too specific to be geocoded right now.”

It does have its weaknesses, but for straightforward location mapping with attached numerical data, I found CartoDB to be a valuable and extremely easy open-source mapping tool.

The power and problems of BIG data

by Thierry Gregorius

Cartoon Big Data by Thierry Gregorius, Flickr Creative Commons

I remember (back in 2000) being asked about the world’s biggest databases. My mentor told me the story of Walmart: how from early in its creation, it began to collect customer data. By the year 2000, it had the largest database in the world. That store is still vast (several hundred terabytes), though no longer the biggest, and requires a space over one hectare in Missouri, the so-called “Area 71”.

Walmart’s database provides a powerful marketing resource.

Companies could pay Walmart to look at customer data in order to pitch advertising. By looking at what products a particular individual bought, it was possible to get an idea of household income.

From there, a car company for example could decide what model of car might be best suited for a particular person and direct promotions that way.

Big data is perhaps a somewhat vague term for the sheer scale of data that now exists, which may in be measured in petabytes or exabytes – Paul Bradshaw recently reminded me that around 2-3 exabytes of data are created every day.

Big data describes large and complex collections of data, which cannot be processed by traditional data processing tools. Challenges arise from its collection to interpretation.

In February 2001, an analyst named Doug Laney at the META Group described the three ‘V’s of big data: volume, velocity and variety.

Volume simply refers to the increasing amount of data, the cause of which is multi-factorial. This includes “unstructured data” from social media to quantitative data collected by machine sensors.

Velocity is the speed of data coming in and going out. Dealing with the rapid influx of data quickly enough is a challenge for many organisations.

Variety refers to the full range of data types and sources. Formats may be structured or unstructured, both of which place demands on the analysts.

Big data sets are often beyond the capability of normal software tools to manage within an acceptable, or tolerable, timeframe. Add to this the difficulties of variability and complexity and it is easy to lose control.

Sense of Statistics

Making sense of big data requires the use of complex statistical techniques. Large data sets can be untidy, and are potentially full of biases (especially when just looking for correlations). Using big data – to figure out exactly what is going on and how to effect worthwhile change in a system – requires advanced statistics.

Big data causes a problem because there are many more possible combinations of (“linked”) data points that can be compared – and so there is a high chance of finding an association.

This is the “multiple-comparisons problem”: if you’re looking for many, or even just any patterns in a large data set, it’s likely that you’ll find one. Test enough different correlations and you’re bound to get some fluke results.

Yet again, we must seek to find out whether the pattern is “statistically significant” (a true finding), or whether it occurred by chance.

The aforementioned difficulties of managing big data with standard desktop software (for statistics and visualisation) means more advanced database systems are required. Terms such as inductive statistics and nonlinear system identification are essential jargon. These concepts essentially allow the user to identify true relationships within the data as well as predict outcomes.

David Spiegelhalter, statistics professor at Cambridge University, who has also talked at City University, gives a useful lecture on the trickiness of numbers, number hygiene and statistical significance. The new statistical techniques for big data will work by building on old methods.

With respect to big data he says, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.”

BD by Walt Stoneburner

Big Data by Walt Stoneburner, Flickr Creative Commons

 

Big data enthusiasts make four claims, which Spiegelhalter warns could be “complete bollocks”. Certainly, they are over-simplistic and are listed below – together with their flaws.

  1. Analysis of big data can yield accurate results.

– ignoring biases, statistical methods and causation means that we can overrate accuracy.

  1. Recording (almost) every individual data point renders former statistical sampling techniques obsolete.

– see Spiegelhalter’s comments above. The newer statistical methods are derived from former ones.

  1. Correlation gives us the necessary picture – causation is an outdated secondary issue.

– again, bias matters. To downgrade the importance of causation is only permissible in a stable environment. Making predictions without knowing about bias and causation does not work in a changing world.

  1. Statistical models are unnecessary because with big data “the numbers speak for themselves”.

– the numbers cannot reliably speak for themselves: random patterns / correlations exist in big data and these outnumber true (statistically significant) findings.

To conclude, companies are interested in big data sets because they are relatively cheap to collect for their size and they can be readily updated. A multitude of individual data points can also be used for many different purposes. All this helps in the marketplace.

Big data made a big splash. But to fulfil its potential it must be married with statistical insight. Only then will we fully appreciate the full impact of big data.

Introduction: one doc’s issues with data

I thought I might struggle with data journalism. And not just because of time constraints, or my own IT limitations. I guess it’s because I didn’t know what to expect.  Having learned the fundamentals of using data and a reasonable grasp of basic stats (during research at school, university and work), I enter the field of data journalism with an open mind but a little skepticism.

Sean MacEntee

Data Recovery.  Picture by Sean MacEntee, Flickr Creative Commons

I’ve often been irritated by stories in the press, which don’t do justice to the actual figures. “Data” seems to be thrown around in the media to add authority, a buzz-word to proclaim truth. I believe that the majority of journalists, like scientists, do a decent job when it comes to the numbers behind the stories. However from time to time, there appears to be a woeful lack of insight during the interpretation of these numbers.

Ben Goldacre’s Bad Science column / blog / book and Paul Bradshaw’s Online Journalism Blog are excellent resources for de-bunking scientific or medical myths. They highlight how journalists, politicians and scientists can mislead the public at almost every stage of research – from the methodology, to the interpretation of results.

So, where to start? Taking the advice of my journalism tutor and Dr Goldacre, I just started writing – if only to vent some of my frustrations. These mainly stem from several data stories that appear (to me, on closer reading) incomplete, misrepresented or overestimated in value.

Even the word itself – data, singular datum – causes contention. Let’s be clear: as a mass noun to signify information, it is perfectly acceptable to use data in the singular, although the (more pedantic?) academic types often prefer to acknowledge the Latin roots of the word and would say “these data show” as each piece of information is a datum.

I want people to see things as they truly are, through the objectivity that data offers. This requires both reliable sources and recording of data, and accurate interpretation. Data journalists will have their own styles and opinions, but robust data analysis should yield clear and consistent meanings.

So, here are a few things I’d like to cover: sources of data, presentation and basic analysis. The latter will not involve much in the way of statistics. I also aim to critique some data stories as well as try out software and online tools for my own data stories.

Lastly, for my blog-posts I’d like to invoke my own extension of the “KISS” acronym – now “KISSASS” =

keep it short, sweet, and simple, stupid

Short: I see that around 500 words is recommended for blog-posts, although I can’t say with any conviction what the ideal word is. “Sweet” really means selective and stimulating – one post for one (interesting) idea. And simple: where possible it should be understandable to almost everyone.

I hope it works out and I’m keen to hear your comments.

by bixentro

Picture by bixentro, Flickr Creative Commons