To scrape or not to scrape?

I’m currently participating in the #willhack online hackathon. This is an event being run by EDINA at the University of Edinburgh, as part of their Will’s World project, which in turn is part of the Jisc Discovery Programme.

The Discovery Programme came out of a Jisc task force looking at how ‘resource discovery’ might be improved for researchers in UK HE. The taskforce (catchily known as RDTF) outlined a vision, based on the idea of the publication of ‘open’ metadata (open in terms of licensing and access) by data owners, and the building of ‘aggregations’ of data with APIs etc., which would provide platforms for the development of user facing services by service providers.

The Will’s World project is building an aggregation of data relating to Shakespeare, and the idea of the hackathon is to test the theory that this type of aggregation can be a platform for building services.

As usual when getting involved in this kind of hackathon, I spent quite a lot of time unsure exactly what to do. The Will’s World registry has data from a number of sources, and the team have also posted other data sources that might be used, including xml markup of the plays.

I played around with some sample queries on the main registry (it supports the SOLR API for queries), but didn’t get that far – it was hard to know what you’d get back from any particular query, and I struggled to know what queries to throw at it beyond the title of each play – which inevitably brought back large numbers of hits.

I also had a couple of other data sources I was interested in – one was Theatricalia – a database of performances of plays in the UK including details of venue, play, casts etc. This is crowdsourced data, and the site was created and is maintained by Matthew Sommerville (@dracos).

The other was a database called ‘Designing Shakespeare‘. Designing Shakespeare was originally an AHDS (Arts and Humanities Data Service) project, which is now hosted by RHUL (Royal Holloway, University of London). The site contains information about London-based Shakespeare productions including the cast lists, pictures from productions, interviews with designers and even VRML models of the key theatres spaces. Designing Shakespeare is one of those publicly funded resources that I think never gets the exposure or love it deserves – a really interesting resource that is (probably) underused (I don’t have any stats on usage, so that’s just me being pessimistic!)

Both these sites about performance made me think there was potential to link plays with performance information, and then maybe some other information from the Will’s World registry. I liked the idea of using the cast lists to add some interest – many of the London based performances at least have that “oh I didn’t realise X played Y” feel to them (Sean Bean as Puck anyone?). Unfortunately neither Theatricalia nor Designing Shakespeare have an API to get at their data programmatically. So I decided I’d write a scraper to extract the cast lists. Having done some quick checking, I found the Designing Shakespeare cast lists tended to be more complete, so I decided to scrape those first. While there is lots of information about the copyright nature of many materials on the Designing Shakespeare site (pictures/audio/video etc.) there is no mention of copyright on the cast lists. Since this is very factual data and my only reason for extracting the data was to point back at the website, I felt reasonably OK scraping this data out.

As always with these things, it wasn’t as straightforward as I hoped, and it’s taken me much longer to get the data out than I expected. Time I probably should have spent actually developing the main product I want to produce for the hack, but now it’s done (using the wonderful ScraperWiki) –  you can access all the data at https://scraperwiki.com/scrapers/designing_shakespeare_cast_lists/ – around 24,000 role/actor combinations (that seems very high – I’m hoping that the data is all good at the moment!)

You can access the data via API or using SQL – I hope others will find it useful as well.

Now I need to find some time to move onto the next part of my hack – my aim is to build a WordPress plugin that starts from the basis of plays expressed as a series of WordPress posts, and adds, via widgets, links to other sources based on the text you are viewing – this will be hopefully a mixture of  at least the cast lists I’ve extracted, and searches of the Will’s World registry. Other stuff if I have time!

If anyone want’s to collaborate on this, I’d be very happy to do so. I’ll be posting code (once I have any) on GitHub at https://github.com/ostephens/shakespearepress.

 

 

4 thoughts on “To scrape or not to scrape?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.