The context of this post and related work is that our local public library service (Warwickshire Library and Information Service) is facing a cut of approximately 27% of its budget over the next 3 years (£2million from an overall budget of £7.4million). For more information see Facing the Challenge on libraries on the Warkwickshire County Council website.
One canard that seems to come up repeatedly in discussions around public libraries and their role is that ‘books are cheap’ and that this means there isn’t the need for public libraries that there once was. I thought an interesting experiment might be to work out how much our family’s use of the library service might equate to based on the cost of buying the books that we have borrowed from the library. So I set about writing a script to achieve this.
The first stage was to retrieve details of a loan history – luckily the system used by Warwickshire libraries (Vubis) do have a loan history which seems to go back 18 months (although with an absolute minimum of information – more on this later). Unluckily there is no way of querying this via an API, only via a web browser. This is not so unusual, and there are good ways of scraping data out of html which I’ve used before. However, in this case the page I needed to get the data from was protected by a login – as it was a loan history linked to an individual library account, and required a barcode and PIN to be input into a form on a webpage before accessing the loan history page.
To help with this I used a Ruby library called ‘Mechanize‘ (inspired by the Perl Mechanize library which does the same sort of thing). This essentially allows you to simulate web browser behaviour within a script – it can fill in forms for you and deal with things like cookies which the web page might try to write to the browser. (incidentally, if you are interested in doing stuff like automating checks for authenticated library resources online Mechanize is worth a look)
Once Mechanize has navigated through the login form and then to the loan history page the content of the page can be parsed as usual – luckily Mechanize supports parsing page content with another Ruby library called Nokogiri – which is my preferred HTML/XML parser in Ruby anyway. So grabbing the details from loan history page was fine once Mechanize had retrieved it.
Unfortunately the amount of information available in the loan history is minimal – just a title and a date. Since at minimum I need to have a price, I need more information. My first thought was I’d get prices from Amazon, but with only the title available I quickly realised I was going to get a lot of duplicate items – especially since my son (3 and a half) borrows books with titles like ‘Goldilocks and the three bears’ and ‘Shapes’ – not very unique.
So I needed to get more details – preferably an ISBN – before I start looking up prices. Any method of looking up books given only the title is going to suffer from the same problems of many duplicate titles. The way to minimise the number of duplicates is to search in the most limited pool of items for the titles from the loan history – which in this case is to search the Warwickshire library catalogue and limit the results to books in our local branch (Leamington Spa) – as this is the only branch we use.
The software that Warwickshire use to provide their library catalogue interface on the web is called ‘Aquabrowser‘ (as opposed to the one that stores administrative information like the loan history – which is called Vubis). Happily unlike Vubis, Aquabrowser does have some APIs that can be used to get information out of the system in a machine readable format. I have to admit that this functionality isn’t well advertised (neither by Warwickshire Library service nor by the vendor of the software), but luckily Chris Keene and Ed Chamberlain have both done some work in documenting the Aquabrowser API (here and here) which is in use at their place of work (University of Sussex and University of Cambridge respectively).
Aquabrowser supports what is meant to be a sector wide standard interface called SRU. The location of the SRU interface for each Aquabrowser interface seems to be in a standard place – http://aquabrowser-url/aquabrowser-path/sru.ashx? (in this case librarycatalogue.warwickshire.gov.uk/abwarwick/sru.ashx?) – then followed by the relevant SRU parameters (e.g. operation=explain&version=1.1&query=dogs&maximumRecords=100&recordSchema=dc – see http://www.loc.gov/standards/sru/ for full documentation on SRU). However, there are several issues with the Aquabrowser implementation of SRU. I’m not clear which issues are Aquabrowser issues, and which are down to local configuration decisions, but the result is a rather limited interface. In this case the main problems for me was that it didn’t allow me to limit a search by title, and it didn’t include the ISBN in the results.
As well as the ‘standard’ SRU interface, Aquabrowser also supports a proprietary interface which will provide search results as XML. This is pretty much undocumented as far as I can tell, but it seems you can take some of the URLs from the HTML interfaces and add the parameter ‘output=xml’ to the URL (I’m not 100% sure, but my guess is that the HTML interface actually is built over the XML API). A basic request looks like http://aquabrowser-url/aquabrowser-path/result.ashx?q=emily%20gravett&output=xml (in this case http://librarycatalogue.warwickshire.gov.uk/abwarwick/result.ashx?q=emily%20gravett&output=xml).
The default HTML interface uses frames which obscures some of what is happening – however, if you use the ‘Accessible’ version, the syntax in the URLs can be seen easily. Using this method I was able to copy the syntax for limiting the results to stock available in a specific branch – in this case the parameter required is ‘branch=LEA’.
The results returned by the XML interface in the Warwickshire instance include information lacking from the SRU interface – and ISBN is one of the elements included. However searching by title is still not possible as far as I can tell – although I came up with a work around for this.
So, once I have the list of titles from the loan history, I use each title to run a search against the Aquabrowser XML interface, limiting the results to stock in the Leamington Spa branch. Since I can’t limit to just title search instead I check the contents of the <title> tag in each record contained in the XML response (note in the XML response Aquabrowser wraps matches to your search terms <exact> tags so you have to ignore these when comparing search phrases). This technique also means I can do more exact matching (e.g. including case sensitivity) than a search would let me.
I decided that if I found duplicates having done this filtering I’d take each duplicate record and include them in my results for later manual tweaking.
So now I have a set of bibliographic records with generally a title, an author (or author statement) and an ISBN. An added bonus is a URL for a cover image for the book is in the response, so I grab that as well (it’s an Amazon cover image). What I don’t have is the price (price may well be in the library record somewhere, but it isn’t included in the XML response). So I then use the ISBN to look up the price on Amazon – I tried a couple of Ruby libraries designed to work with the Amazon ‘Product Advertising’ API (which is the one you need for stuff like prices), but the only one I really liked was ‘Sucker‘ – it is very lightweight, doesn’t hide the Amazon API that much, and again has Nokogiri somewhere underneath (and some other stuff) – which means dealing with the results is relatively straightforward if you’ve used Nokogiri.
So, from the Amazon response I grab the price (I check it is in GBP first) and also a URL for the Amazon product page (so I can provide a link back to Amazon). (This isn’t perfect – despite the library catalogue searching above, I don’t get ISBNs for every item and so can’t search them all on Amazon).
Now I have all the information I need but I don’t have a way to present it (and I still have some unwanted duplicate entries in the books list). For this I went back to a tactic I used previously for the (currently defunct) ‘What to Watch‘ application – importing the results of a script into WordPress as a series of blog posts. In this case created a csv file which allowed me to easily add in a few tags – such as names to indicate which account (mine, Damyanti’s or our son’s) the item had been borrowed on. The ‘text’ of the blog post was some simple html to display the cover image and a few links – including an option to add the item to a LibraryThing account in a single click. I also included some WordPress ‘custom fields’ to record the title, ISBN and price – to enable to additional functions in the blog.
Using the ‘csv importer’ WordPress plugin (a great plugin) to import the csv file the content was now in the blog. This is the point at which we removed those duplicate book entries (this did require us to remember what we’d borrowed!). The final step was to actually do the calculation I originally wanted to do – the cost of the books we’d borrowed. This final step was achieved by writing a WordPress plugin to support the display of a Widget which uses SQL to get values from the ‘custom field’ storing price information. Since I haven’t got prices for all books, I use those I have got, work out an average per book, and multiply this up by the total number of loans.
The final result is the ‘Overdue Books‘ blog showing every book we have borrowed as a family since September 2009 at http://www.meanboyfriend.com/overdue_books/. The headline figure is that if we’d bought those items we’ve borrowed in this time the cost would have been approximately £3,428.26 – or around £190 per month – or £44 pounds a week. I think it is fair to say even a relatively well off family would find this steep!
Finally the code is available at https://github.com/ostephens/Savelibs – there are all kinds of things I’m unhappy about and want to improve – including modifying it so it can be used to regularly update the Overdue Books blog without any manual intervention from me – and if anyone wants to take it and make something that any library user could use to do the same type of calculation feel free (I’d like to do this, but not sure where I’m going to get the time).
In an oversight last night I forgot to add my apologies and thanks to the IT people at Warwickshire council and library – firstly they chose a system for their library catalogue with an API (despite some shortcomings, this is a big step!) and secondly they didn’t shout at me when I repeatedly ran hundreds of searches against their catalogue in quick succession, but rather got in touch politely so we could work together to ensure the scripts weren’t going to cause a problem on the system – if you are thinking of hacking stuff around like this, probably as well to get in touch with system owners first!
What I do hope is that there might be some opportunity for developers like me to work with the library system and data – it would be really good if we could see applications built on top of the library systems APIs (such as this one which won the first ‘Hack Warwickshire’ competition) – the attitude Warwickshire have to open data and wanting to get developers involved is really refreshing 🙂
5 thoughts on “Counting the cost”
Did some fiddling around when I was looking at doing a mobile interface for the library catalogue, and found that q=title:something seemed to work, so:
You still get the nonexact versions in there too, but if you surround the query in quotes, for example:
…it seems to come back with the exact titled versions, although I’ve not had a chance to test this. There might be a way of combining field searches, (so everything a particular title and a particular author), but I’ve not worked that out yet.
Can query on the ISBN with this too, so:
Ah – cool. I had this syntax but with an additional space between ‘title:’ and the search term – which simply added ‘title’ as another keyword – thanks!
A quick look at the phrase searching however doesn’t look successful – will have a further look.
Whups, yes – I think it just stops those nonexact results from appearing rather than an entire phrase, at least on the queries I’ve tried, which is a bit disappointing.
Boolean operators also seem to work, so:
…but again, I’ve not done much testing of this.