Shakespeare as you like it

December 19, 2012December 19, 2012 · ostephens

This is a slightly delayed final post on my Will Hack entry – which I’m really happy to say won the “Best Open Hack” prize in the competition.

I should start by acknowledging the other excellent work done in the competition, with special mention of the overall winner, a ‘second screen’ app to use when watching a Shakespeare play by Kate Ho and Tom Salyers. Also the team behind the whole Will Hack event at Edina – Muriel, Nicola, Neil and Richard – the idea of an online hackathon was great, and I hope they’ll be writing up the experience of running it.

I presented my hack as part of the final Google+ Hangout and you can watch the video on YouTube. Here I’ll describe the hack, but also reflect on the nature and potential of the Will’s World Registry which I used in the hack.

The Will’s World Registry and #willhack are part of the Jisc Discovery Programme – which is a programme I’ve been quite heavily involved in. The idea of an ‘aggregation’ which brings together data from multiple sources (which is what the Will’s World Registry does) was part of the original Vision document which informed the setting up of the Discovery programme. When I sat down to start my #willhack, I really wanted to see if the registry fulfilled any of the promise that the Vision outlined.

The Hack

When I started looking at the registry, it was difficult to know what I should search for – what data was in there? what searches would give interesting results? So I decided that rather than trying to construct a search service over the top of the registry (which uses Solr and so supports the Solr API – see the Querying Data section in this tutorial), I’d see if I could extract relevant data from the plays (e.g. names of characters, places mentioned etc.) and use those to create queries for the registry and return relevant results.

It seemed to me this approach could provide a set of resources alongside the plays – a starting point for someone reading the play and wanting more information. As I’ve done with previous hacks I decided that I’d use WordPress as a platform to deliver the results, and so would build the hack as a WordPress plugin. I did consider using Moodle, the learning management system, instead of WordPress, as I wanted the final product to be something that could be used easily in an educational context – however in the end, I went with WordPress as having a larger audience.

The first thing I wanted to do was import the text of the plays into a WordPress install, and then start extracting relevant keywords to throw at the registry. This ended up being a lot more time consuming than I expected. I got the basics of downloading a play in xml (I used the simple xml provided by the Will’s World team (http://wwsrv.edina.ac.uk/wworld/plays/index) and creating posts quite quickly. However, it turned out my decision to create one WP post per line of dialogue was an expensive one – creating posts in WP seems to be quite a slow process, and so it would take minutes to load a play. This in turn led to timeout issues – the web server would timeout while waiting for the php script to run. It took me some considerable time to amend the import process to import only one act at a time based on the user pressing a ‘Next’ button. The final result isn’t ideal but it does the job – to improve this would take a re-write of the code and inserting posts directly using SQL rather than the pre-packaged ‘insert post’ function provided by WordPress. I also realised very late on in the hack that my own laptop was a hell of a lot faster than my online host – and I could have avoided these issues altogether if I’d done development locally – but then I guess that would have detracted from the ‘open’ aspect I won a prize for!

I also spent some time working out how to customise WordPress to display the Play text in a useful way. I drew on some experience of developing ‘multi faceted documents’ in WordPress – although I didn’t go quite as far as I would have liked. I also benefited from using WordPress and having access to existing WordPress plugins. For example – putting up a tag cloud immediately gives you information on which characters in the play have the most lines (as I automatically tagged each post with the speaking character name, as well as the Act and Scene it was in)

As I developed I posted all the code to Github, and kept an online demonstration site up to date with the latest version of the plugin.

I now had a working plugin that imported the play and displayed it in a useful way. So I was ready to go back to the purpose of the hack – to draw data out of the registry. Earlier when I’d been looking for ideas of what to do I’d also created a data store on ScraperWiki of cast lists from various productions of Shakespeare plays and so an obvious starting point was a page per character in the play that would display this information, plus results from the registry.

I started to put this together (unfortunately as you can see in the tag cloud, the character names I got from the xml are actually ‘CharIDs’ and have had white space removed – meaning that while this approach works well for ‘Beatrice’ it fails for ‘Don Pedro’ which is converted to ‘DONPEDRO’. I could solve this either by using a different source for the plays, or by linking with the work Richard Light did for #willhack where he established URIs for each character with their real name and this CharID from the xml. (I also threw in some schema.org markup into each post – this was a bit of an after-thought and I’m not sure if it is useful, or indeed is marked up correctly!)

I found immediately that just throwing a character name at the registry didn’t return great results (perhaps not suprisingly) but combining the character name with the name of the play was not bad generally. Where it tended to provide less interesting results was where the character name was also in the name of the play – for example searching “Romeo AND Romeo and Juliet” doesn’t improve over just searching for “Romeo and Juliet” – and the top hits are all versions of the play from the Open Library data – which is OK, but to be honest not very interesting.

However, at its best, quite a basic approach here creates some interesting results, such as this one for the character of Dogberry from Much Ado About Nothing:

and this interesting snippet from the Culture Grid

The British Museum data proved particularly rich having many images of Shakespearean characters.

I finished off the hack with creating a ‘summary’ page for the play, which tries to get a summary of the play via the DuckDuckGo API (in turn this tends to get the data from Wikipedia – but the MediaWiki API seemed less well documented and harder to use). It also tries to get a relevant podcast and epub version of the play from the Oxford University “Approach Shakespeare” series.

Once the posts and pages have been created they are static – they are created on the fly, but don’t update at all. This means that the blog owner can then edit them as they see fit – maybe adding in descriptions of the characters, or annotating parts of the play etc. All the usual WordPress functionality is available so you could add more plugins etc. (although the play layout depends on you using the Theme that is delivered as part of the plugin).

I think this could be a great starting point for creating a resource aimed at schools – a teacher gets a great starting point, a website with the play text and pointers to more resources. I hope that illustrations of characters, and information about people who have played them (especially where that’s a recognisable name like Sean Bean playing Puck) bring the play to life a bit and give some context. It also occurred to me that I could create some ‘plan a trip’ pages which would present resources from a particular collection – like the British Museum – in a single page, pointing at objects you could look at when you visited the museum.

You can try the plugin right now – just setup a clean WordPress install, and download the ShakespearePress plugin from Github, drop it into the ‘plugins’ directory, install it via the WP admin interface, then go to the settings page (under the settings menu) – and it will walk you through the process. All feedback very welcome. You can also browse a site created by the plugin at http://demonstrators.ostephens.com/willhack.

I was going to comment on my experience of using the Registry here, but I’ve already gone on too much – that will have to be a separate post!

The time is out of joint

December 10, 2012 · ostephens

This is an update on my progress with my #willhack project. As I wrote in my previous post:

my aim is to build a WordPress plugin that starts from the basis of plays expressed as a series of WordPress posts, and adds, via widgets, links to other sources based on the text you are viewing – this will be hopefully a mixture of at least the cast lists I’ve extracted, and searches of the Will’s World registry. Other stuff if I have time!

I’ve made some progress with this, and there is now code for a WordPress plugin up on GitHub although it’s still a work in progress.

I decided to use WordPress as in the past I’ve found this a shortcut for getting a hack up on the web in a usable form quickly, and the widespread use of WordPress means that by packaging the hack as a WordPress plugin I can make it accessible to a wide audience easily. I did consider using Moodle as an alternative to WordPress to target an audience in education more directly, but thought I’d probably find WordPress quicker.

I was able to write a plugin which loaded the text of the first Act of “Much Ado about Nothing” using the XML version of the text provided by the Will’s World service at http://wwsrv.edina.ac.uk/wworld/plays/Much_Ado_about_Nothing.xml. Each ‘paragraph’ in the play – which equate to either a stage direction, or a piece of dialogue by a character – is made into a WordPress post. The plugin also includes a WordPress Theme which is automatically applied, which subverts the usual ‘blog’ style used by WordPress to group all these posts by Act and Scene. The default view is the whole of the text available, but using built in WordPress functionality you can view a single Act, a single Scene or a single line of dialogue. Using the built in ‘commenting’ facility you can annotate or comment on any line of dialogue.

Each ‘post’ is also tagged with the name of the character speaking the line. This means it is possible to view all lines by a particular role, and also via a Tag cloud easily see who get’s the most lines.

So far, so good. Unfortunately at this point I hit a problem – which was when I tried to load an entire play (rather than just the first Act), I consistently got a http status 500 error. I’m pretty sure this is because either PHP or Apache is hitting a timeout limit when it tries to create so many posts in one go (each line of dialogue is a post so a play is hundreds of posts). I spent far too long trying to tweak the timeout limits to resolve this problem with no luck.

The time I spent on the timeout issue I really should have been spending developing the planned widget to display contextual information from other sources such as cast lists, and the Will’s World Registry. I’d originally hoped I might be able to achieve something approaching “Every story has a beginning” – an example of a linked data + javascript ‘active’ document from Tim Sherrat (@wragge). I’ve now had to concede to myself that I’m not going to have time to do this (although I have put some rudimentary linked data markup into the play text for each speaking character, using schema.org markup) .

So, instead of a widget I’ve decided to create a page for each character in the play which uses at least data from the cast lists I’ve scraped and the Will’s World registry. As I’ve already got timeout issues when getting the play text into WordPress, I’m going to have to work out a way of adding the creation of these pages without hitting timeout issues again. I think I’ve got an approach which, although a bit clunky, should break both tasks (creation of the play text as posts, and creation of ‘character’ pages) into smaller chunks that the person installing the plugin can activate in turn – so I just need to make sure no single step takes too long to carry out. I’m sure there must be a better way of doing this but I haven’t found any decent examples so far – I suspect one approach would be to break down into steps as I’m proposing to do, but trigger each step in turn by some javascript rather than require the user to click a ‘next’ button, but I don’t want to spend any more time on this than is absolutely necessary at this stage.

If you want to try out the plugin as it currently stands you can get the code from from Shakespearepress on GitHub (N.B. don’t install on a WordPress blog you care about at the moment – it creates lots of posts, and changes the theme and it might break). Once the plugin is activated you need to go to plugin settings page to import the text of a play (at the moment it supports the first Act of Much Ado about Nothing or the first Act of King Lear).

If you want to see an example of a WordPress blog that has been Shakespeared then take a look at http://demonstrators.ostephens.com/willhack, which uses the first act of Much Ado about Nothing:

All of Act 1: http://demonstrators.ostephens.com/willhack/tag/act-1/
Act 1, Scene 1 only: http://demonstrators.ostephens.com/willhack/tag/scene-1
All lines by Beatrice: http://demonstrators.ostephens.com/willhack/tag/beatrice
Prototype Character page for Beatrice using data from https://scraperwiki.com/scrapers/designing_shakespeare_cast_lists/: http://demonstrators.ostephens.com/willhack/beatrice-3/
Google Rich Snippets view of the embedded microdata: http://www.google.com/webmasters/tools/richsnippets?url=http%3A%2F%2Fdemonstrators.ostephens.com%2Fwillhack&html=

Hopefully before the submission deadline I can extend the Character pages and sort the installation process so you can load a whole play and create a page per (main?) character.

To scrape or not to scrape?

December 6, 2012 · ostephens

I’m currently participating in the #willhack online hackathon. This is an event being run by EDINA at the University of Edinburgh, as part of their Will’s World project, which in turn is part of the Jisc Discovery Programme.

The Discovery Programme came out of a Jisc task force looking at how ‘resource discovery’ might be improved for researchers in UK HE. The taskforce (catchily known as RDTF) outlined a vision, based on the idea of the publication of ‘open’ metadata (open in terms of licensing and access) by data owners, and the building of ‘aggregations’ of data with APIs etc., which would provide platforms for the development of user facing services by service providers.

The Will’s World project is building an aggregation of data relating to Shakespeare, and the idea of the hackathon is to test the theory that this type of aggregation can be a platform for building services.

As usual when getting involved in this kind of hackathon, I spent quite a lot of time unsure exactly what to do. The Will’s World registry has data from a number of sources, and the team have also posted other data sources that might be used, including xml markup of the plays.

I played around with some sample queries on the main registry (it supports the SOLR API for queries), but didn’t get that far – it was hard to know what you’d get back from any particular query, and I struggled to know what queries to throw at it beyond the title of each play – which inevitably brought back large numbers of hits.

I also had a couple of other data sources I was interested in – one was Theatricalia – a database of performances of plays in the UK including details of venue, play, casts etc. This is crowdsourced data, and the site was created and is maintained by Matthew Sommerville (@dracos).

The other was a database called ‘Designing Shakespeare‘. Designing Shakespeare was originally an AHDS (Arts and Humanities Data Service) project, which is now hosted by RHUL (Royal Holloway, University of London). The site contains information about London-based Shakespeare productions including the cast lists, pictures from productions, interviews with designers and even VRML models of the key theatres spaces. Designing Shakespeare is one of those publicly funded resources that I think never gets the exposure or love it deserves – a really interesting resource that is (probably) underused (I don’t have any stats on usage, so that’s just me being pessimistic!)

Both these sites about performance made me think there was potential to link plays with performance information, and then maybe some other information from the Will’s World registry. I liked the idea of using the cast lists to add some interest – many of the London based performances at least have that “oh I didn’t realise X played Y” feel to them (Sean Bean as Puck anyone?). Unfortunately neither Theatricalia nor Designing Shakespeare have an API to get at their data programmatically. So I decided I’d write a scraper to extract the cast lists. Having done some quick checking, I found the Designing Shakespeare cast lists tended to be more complete, so I decided to scrape those first. While there is lots of information about the copyright nature of many materials on the Designing Shakespeare site (pictures/audio/video etc.) there is no mention of copyright on the cast lists. Since this is very factual data and my only reason for extracting the data was to point back at the website, I felt reasonably OK scraping this data out.

As always with these things, it wasn’t as straightforward as I hoped, and it’s taken me much longer to get the data out than I expected. Time I probably should have spent actually developing the main product I want to produce for the hack, but now it’s done (using the wonderful ScraperWiki) – you can access all the data at https://scraperwiki.com/scrapers/designing_shakespeare_cast_lists/ – around 24,000 role/actor combinations (that seems very high – I’m hoping that the data is all good at the moment!)

You can access the data via API or using SQL – I hope others will find it useful as well.

Now I need to find some time to move onto the next part of my hack – my aim is to build a WordPress plugin that starts from the basis of plays expressed as a series of WordPress posts, and adds, via widgets, links to other sources based on the text you are viewing – this will be hopefully a mixture of at least the cast lists I’ve extracted, and searches of the Will’s World registry. Other stuff if I have time!

If anyone want’s to collaborate on this, I’d be very happy to do so. I’ll be posting code (once I have any) on GitHub at https://github.com/ostephens/shakespearepress.

Overdue Ideas

Ideas linking Libraries, Computing, E-learning, and anything else that springs to mind.

Monthly Archives ⇒ December 2012

Shakespeare as you like it

The time is out of joint

To scrape or not to scrape?