Open Refine (previously Google Refine) is a tool for manipulating and ‘cleaning’ data (more information is available on the new Open Refine site). If you use Excel to do general data jobs, then it’s worth looking at Refine to see if it can help.
I’ve used Refine in the past to do explore and clean up data, and been impressed by the tools it provides. I’m currently working on two projects – KnowledgeBase+ and GoKB which are concerned with collecting and organising data about electronic resources. As anyone who has had to deal with data about electronic journal collections knows (especially e-resource librarians), a lot of data is sent around in spreadsheets and as comma/tab separated values – for example the KBART Guidelines, which are used by some publishers/content providers to publish lists of e-journals, recommend a TSV format.
Today I wanted to understand how a particular e-journals package was changing in 2013. I had a list of the journals included in the package in 2012, and a new list of the journals that will be included in 2013. Each file contained the title of the journals, ISSNs, eISSNs, URLs, the year, volume and issue of the first journal issue included in the online package. However, the column order and names were not the same as can be seen here:
I wanted to know
- which titles were in the 2012 version that were not in the 2013 version
- which titles were in the 2013 package that were not in the 2012
- for titles that were in both, if there had been any changes in key information such as the first issue included
I’d seen a blog post by Tony Hirst (@psychemedia on Twitter) on merging datasets using Refine which shows how you can match values across two Refine projects. Following this tutorial I was able to start matching across the data sets. Since both data sets contained the ISSN and eISSN it seemed likely that one of these would be a good starting point for matching – alas, this was not the case. Not all entries had an ISSN or eISSN, in more than one case the what was recorded as eISSN in one was listed as the ISSN in the other. I decided to look for another approach to matching across the data sets.
In the end the most successful approach was using a three letter code that was used in the URLs in both files to identify the journal – this allowed me to get good matching between the two files and be pretty happy that I wasn’t missing matches. Interestingly the next most reliable matching mechanism I found was using a ‘fingerprint’ version of the journal title (‘fingerprint’ is a mechanism to try to standardised text strings, which is also described by Tony in his blog post).
Having found a decent way of matching between the files, I started to try to answer the questions above. Firstly from the 2013 file, I added in a new column which pulled the matching title from the 2012 file. Any blank cells in this column represented a 2013 title not in the 2012 file (easy to find using the Refine ‘faceting’ function).
To discover the titles in the 2012 file, not int the 2013 one, I did the same match process, but starting from the 2012 file. I couldn’t think of any other way of doing this which was a shame – it would be nice to get all the data into a single project and then do all the analysis from there – at the moment the only way I can think of doing this would be to somehow merge the files before importing into Refine – which would seem to defeat the point a bit.
Finally, I used the same matching mechanism to pull in the information relating to the first issue (first issue number, volume number, year) from the 2012 file into the 2013 file. I could then compare the 2012 version of each of these to the 2013 version
This compares the value in the “Frontfile 1st Issue Online Vol” with the value in the column I started from – and allocates a value of ‘same’ or ‘changed’. I found you needed to be careful when doing these comparisons that the data is of the same ‘type’. I had one example where I ended up comparing a ‘number’ type with a ‘string’ type and getting a ‘changed’ outcome even though both contained the character ‘4’.
By using facets to find all those that had ‘changed’, and using another facet on the 2012 title match to eliminate those titles that were not in the 2012 file, I was quickly able to find the 5 titles where the year had changed, and a further 16 titles where the volume of the first issue available was different, even though the year had stayed the same.
Could I have achieved the same outcome in Excel? Probably. However, Refine has several nice features (like faceting) that are very straightforward compared to doing the same thing in Excel. Perhaps, though, it’s greatest strength is the ability to easily view a history of all the changes that have been made to the file, undo changes, copy changes and re-apply by amending field names, and the ability to export all the changes applied in such a way that you can apply exactly the same changes to future files where the same actions are required.