Shakespeare as you like it

This is a slightly delayed final post on my Will Hack entry – which I’m really happy to say won the “Best Open Hack” prize in the competition.

I should start by acknowledging the other excellent work done in the competition, with special mention of the overall winner, a ‘second screen’ app to use when watching a Shakespeare play by Kate Ho and Tom Salyers. Also the team behind the whole Will Hack event at Edina – Muriel, Nicola, Neil and Richard – the idea of an online hackathon was great, and I hope they’ll be writing up the experience of running it.

I presented my hack as part of the final Google+ Hangout and you can watch the video on YouTube. Here I’ll describe the hack, but also reflect on the nature and potential of the Will’s World Registry which I used in the hack.

The Will’s World Registry and #willhack are part of the Jisc Discovery Programme – which is a programme I’ve been quite heavily involved in. The idea of an ‘aggregation’ which brings together data from multiple sources (which is what the Will’s World Registry does) was part of the original Vision document which informed the setting up of the Discovery programme. When I sat down to start my #willhack, I really wanted to see if the registry fulfilled any of the promise that the Vision outlined.

The Hack

When I started looking at the registry, it was difficult to know what I should search for – what data was in there? what searches would give interesting results? So I decided that rather than trying to construct a search service over the top of the registry (which uses Solr and so supports the Solr API – see the Querying Data section in this tutorial), I’d see if I could extract relevant data from the plays (e.g. names of characters, places mentioned etc.) and use those to create queries for the registry and return relevant results.

It seemed to me this approach could provide a set of resources alongside the plays – a starting point for someone reading the play and wanting more information. As I’ve done with previous hacks I decided that I’d use WordPress as a platform to deliver the results, and so would build the hack as a WordPress plugin. I did consider using Moodle, the learning management system, instead of WordPress, as I wanted the final product to be something that could be used easily in an educational context – however in the end, I went with WordPress as having a larger audience.

The first thing I wanted to do was import the text of the plays into a WordPress install, and then start extracting relevant keywords to throw at the registry. This ended up being a lot more time consuming than I expected. I got the basics of downloading a play in xml (I used the simple xml provided by the Will’s World team (http://wwsrv.edina.ac.uk/wworld/plays/index) and creating posts quite quickly. However, it turned out my decision to create one WP post per line of dialogue was an expensive one – creating posts in WP seems to be quite a slow process, and so it would take minutes to load a play. This in turn led to timeout issues – the web server would timeout while waiting for the php script to run. It took me some considerable time to amend the import process to import only one act at a time based on the user pressing a ‘Next’ button. The final result isn’t ideal but it does the job – to improve this would take a re-write of the code and inserting posts directly using SQL rather than the pre-packaged ‘insert post’ function provided by WordPress. I also realised very late on in the hack that my own laptop was a hell of a lot faster than my online host – and I could have avoided these issues altogether if I’d done development locally – but then I guess that would have detracted from the ‘open’ aspect I won a prize for!

Willhack Much Ado Tag Cloud

I also spent some time working out how to customise WordPress to display the Play text in a useful way. I drew on some experience of developing ‘multi faceted documents’ in WordPress – although I didn’t go quite as far as I would have liked. I also benefited from using WordPress and having access to existing WordPress plugins. For example – putting up a tag cloud immediately gives you information on which characters in the play have the most lines (as I automatically tagged each post with the speaking character name, as well as the Act and Scene it was in)

As I developed I posted all the code to Github, and kept an online demonstration site up to date with the latest version of the plugin.

I now had a working plugin that imported the play and displayed it in a useful way. So I was ready to go back to the purpose of the hack – to draw data out of the registry. Earlier when I’d been looking for ideas of what to do I’d also created a data store on ScraperWiki of cast lists from various productions of Shakespeare plays and so an obvious starting point was a page per character in the play that would display this information, plus results from the registry.

I started to put this together (unfortunately as you can see in the tag cloud, the character names I got from the xml are actually ‘CharIDs’ and have had white space removed – meaning that while this approach works well for ‘Beatrice’ it fails for ‘Don Pedro’ which is converted to ‘DONPEDRO’. I could solve this either by using a different source for the plays, or by linking with the work Richard Light did for #willhack where he established URIs for each character with their real name and this CharID from the xml. (I also threw in some schema.org markup into each post – this was a bit of an after-thought and I’m not sure if it is useful, or indeed is marked up correctly!)

I found immediately that just throwing a character name at the registry didn’t return great results (perhaps not suprisingly) but combining the character name with the name of the play was not bad generally. Where it tended to provide less interesting results was where the character name was also in the name of the play – for example searching “Romeo AND Romeo and Juliet” doesn’t improve over just searching for “Romeo and Juliet” – and the top hits are all versions of the play from the Open Library data – which is OK, but to be honest not very interesting.

However, at its best, quite a basic approach here creates some interesting results, such as this one for the character of Dogberry from Much Ado About Nothing:

Dogberry page created by ShakespearePress WP plugin

and this interesting snippet from the Culture Grid

CultureGrid resource on Dogberry

The British Museum data proved particularly rich having many images of Shakespearean characters.

I finished off the hack with creating a ‘summary’ page for the play, which tries to get a summary of the play via the DuckDuckGo API (in turn this tends to get the data from Wikipedia – but the MediaWiki API seemed less well documented and harder to use). It also tries to get a relevant podcast and epub version of the play from the Oxford University “Approach Shakespeare” series.

Once the posts and pages have been created they are static – they are created on the fly, but don’t update at all. This means that the blog owner can then edit them as they see fit – maybe adding in descriptions of the characters, or annotating parts of the play etc. All the usual WordPress functionality is available so you could add more plugins etc. (although the play layout depends on you using the Theme that is delivered as part of the plugin).

I think this could be a great starting point for creating a resource aimed at schools – a teacher gets a great starting point, a website with the play text and pointers to more resources. I hope that illustrations of characters, and information about people who have played them (especially where that’s a recognisable name like Sean Bean playing Puck) bring the play to life a bit and give some context. It also occurred to me that I could create some ‘plan a trip’ pages which would present resources from a particular collection – like the British Museum – in a single page, pointing at objects you could look at when you visited the museum.

You can try the plugin right now – just setup a clean WordPress install, and download the ShakespearePress plugin from Github, drop it into the ‘plugins’ directory, install it via the WP admin interface, then go to the settings page (under the settings menu) – and it will walk you through the process. All feedback very welcome. You can also browse a site created by the plugin at http://demonstrators.ostephens.com/willhack.

I was going to comment on my experience of using the Registry here, but I’ve already gone on too much – that will have to be a separate post!

The time is out of joint

This is an update on my progress with my #willhack project. As I wrote in my previous post:

 my aim is to build a WordPress plugin that starts from the basis of plays expressed as a series of WordPress posts, and adds, via widgets, links to other sources based on the text you are viewing – this will be hopefully a mixture of  at least the cast lists I’ve extracted, and searches of the Will’s World registry. Other stuff if I have time!

I’ve made some progress with this, and there is now code for a WordPress plugin up on GitHub although it’s still a work in progress.

I decided to use WordPress as in the past I’ve found this a shortcut for getting a hack up on the web in a usable form quickly, and the widespread use of WordPress means that by packaging the hack as a WordPress plugin I can make it accessible to a wide audience easily. I did consider using Moodle as an alternative to WordPress to target an audience in education more directly, but thought I’d probably find WordPress quicker.

I was able to write a plugin which loaded the text of the first Act of “Much Ado about Nothing” using the XML version of the text provided by the Will’s World service at http://wwsrv.edina.ac.uk/wworld/plays/Much_Ado_about_Nothing.xml. Each ‘paragraph’ in the play – which equate to either a stage direction, or a piece of dialogue by a character – is made into a WordPress post. The plugin also includes a WordPress Theme which is automatically applied, which subverts the usual ‘blog’ style used by WordPress to group all these posts by Act and Scene. The default view is the whole of the text available, but using built in WordPress functionality you can view a single Act, a single Scene or a single line of dialogue. Using the built in ‘commenting’ facility you can annotate or comment on any line of dialogue.

Each ‘post’ is also tagged with the name of the character speaking the line. This means it is possible to view all lines by a particular role, and also via a Tag cloud easily see who get’s the most lines.

So far, so good. Unfortunately at this point I hit a problem – which was when I tried to load an entire play (rather than just the first Act), I consistently got a http status 500 error. I’m pretty sure this is because either PHP or Apache is hitting a timeout limit when it tries to create so many posts in one go (each line of dialogue is a post so a play is hundreds of posts). I spent far too long trying to tweak the timeout limits to resolve this problem with no luck.

The time I spent on the timeout issue I really should have been spending developing the planned widget to display contextual information from other sources such as cast lists, and the Will’s World Registry. I’d originally hoped I might be able to achieve something approaching “Every story has a beginning” – an example of a linked data + javascript ‘active’ document from Tim Sherrat (@wragge). I’ve now had to concede to myself that I’m not going to have time to do this (although I have put some rudimentary linked data markup into the play text for each speaking character, using schema.org markup) .

So, instead of a widget I’ve decided to create a page for each character in the play which uses at least data from the cast lists I’ve scraped and the Will’s World registry. As I’ve already got timeout issues when getting the play text into WordPress, I’m going to have to work out a way of adding the creation of these pages without hitting timeout issues again. I think I’ve got an approach which, although a bit clunky, should break both tasks (creation of the play text as posts, and creation of ‘character’ pages) into smaller chunks that the person installing the plugin can activate in turn – so I just need to make sure no single step takes too long to carry out. I’m sure there must be a better way of doing this but I haven’t found any decent examples so far – I suspect one approach would be to break down into steps as I’m proposing to do, but trigger each step in turn by some javascript rather than require the user to click a ‘next’ button, but I don’t want to spend any more time on this than is absolutely necessary at this stage.

If you want to try out the plugin as it currently stands you can get the code from from Shakespearepress on GitHub (N.B. don’t install on a WordPress blog you care about at the moment – it creates lots of posts, and changes the theme and it might break). Once the plugin is activated you need to go to plugin settings page to import the text of a play (at the moment it supports the first Act of Much Ado about Nothing or the first Act of King Lear).

If you want to see an example of a WordPress blog that has been Shakespeared then take a look at http://demonstrators.ostephens.com/willhack, which uses the first act of Much Ado about Nothing:

 

Hopefully before the submission deadline I can extend the Character pages and sort the installation process so you can load a whole play and create a page per (main?) character.

To scrape or not to scrape?

I’m currently participating in the #willhack online hackathon. This is an event being run by EDINA at the University of Edinburgh, as part of their Will’s World project, which in turn is part of the Jisc Discovery Programme.

The Discovery Programme came out of a Jisc task force looking at how ‘resource discovery’ might be improved for researchers in UK HE. The taskforce (catchily known as RDTF) outlined a vision, based on the idea of the publication of ‘open’ metadata (open in terms of licensing and access) by data owners, and the building of ‘aggregations’ of data with APIs etc., which would provide platforms for the development of user facing services by service providers.

The Will’s World project is building an aggregation of data relating to Shakespeare, and the idea of the hackathon is to test the theory that this type of aggregation can be a platform for building services.

As usual when getting involved in this kind of hackathon, I spent quite a lot of time unsure exactly what to do. The Will’s World registry has data from a number of sources, and the team have also posted other data sources that might be used, including xml markup of the plays.

I played around with some sample queries on the main registry (it supports the SOLR API for queries), but didn’t get that far – it was hard to know what you’d get back from any particular query, and I struggled to know what queries to throw at it beyond the title of each play – which inevitably brought back large numbers of hits.

I also had a couple of other data sources I was interested in – one was Theatricalia – a database of performances of plays in the UK including details of venue, play, casts etc. This is crowdsourced data, and the site was created and is maintained by Matthew Sommerville (@dracos).

The other was a database called ‘Designing Shakespeare‘. Designing Shakespeare was originally an AHDS (Arts and Humanities Data Service) project, which is now hosted by RHUL (Royal Holloway, University of London). The site contains information about London-based Shakespeare productions including the cast lists, pictures from productions, interviews with designers and even VRML models of the key theatres spaces. Designing Shakespeare is one of those publicly funded resources that I think never gets the exposure or love it deserves – a really interesting resource that is (probably) underused (I don’t have any stats on usage, so that’s just me being pessimistic!)

Both these sites about performance made me think there was potential to link plays with performance information, and then maybe some other information from the Will’s World registry. I liked the idea of using the cast lists to add some interest – many of the London based performances at least have that “oh I didn’t realise X played Y” feel to them (Sean Bean as Puck anyone?). Unfortunately neither Theatricalia nor Designing Shakespeare have an API to get at their data programmatically. So I decided I’d write a scraper to extract the cast lists. Having done some quick checking, I found the Designing Shakespeare cast lists tended to be more complete, so I decided to scrape those first. While there is lots of information about the copyright nature of many materials on the Designing Shakespeare site (pictures/audio/video etc.) there is no mention of copyright on the cast lists. Since this is very factual data and my only reason for extracting the data was to point back at the website, I felt reasonably OK scraping this data out.

As always with these things, it wasn’t as straightforward as I hoped, and it’s taken me much longer to get the data out than I expected. Time I probably should have spent actually developing the main product I want to produce for the hack, but now it’s done (using the wonderful ScraperWiki) –  you can access all the data at https://scraperwiki.com/scrapers/designing_shakespeare_cast_lists/ – around 24,000 role/actor combinations (that seems very high – I’m hoping that the data is all good at the moment!)

You can access the data via API or using SQL – I hope others will find it useful as well.

Now I need to find some time to move onto the next part of my hack – my aim is to build a WordPress plugin that starts from the basis of plays expressed as a series of WordPress posts, and adds, via widgets, links to other sources based on the text you are viewing – this will be hopefully a mixture of  at least the cast lists I’ve extracted, and searches of the Will’s World registry. Other stuff if I have time!

If anyone want’s to collaborate on this, I’d be very happy to do so. I’ll be posting code (once I have any) on GitHub at https://github.com/ostephens/shakespearepress.

 

 

Experimenting with British Museum data

[UPDATE 2014-11-20: The British Museum data model and has changed quite a bit since I wrote this. While there is still useful stuff in this post, the detail of any particular query, or comment may well now be outdated. I’ve added some updates in square brackets in some cases]

In September 2011 the British Museum started publishing descriptions of items in its collections as RDF (the data structure that underlies Linked Data). The data is available from http://collection.britishmuseum.org/ where the Museum have made a ‘SPARQL Endpoint’ available. SPARQL is a query language for extracting data from RDF stores – it can be seen as a parallel to SQL, which is a query language for extract data from traditional relational databases.

Although I knew what SPARQL was, and what it looked like, I really hadn’t got to grips with it, and since I’d just recently purchased “Learning SPARQL” it seemed like a good opportunity to get familiar with the British Museum data and SPARQL syntax. So I had a play (more below). Skip forward a few months, and I noticed some tweets from a JISC meeting about the Pelagios project (which is interested in the creation of linked (geo)data to describe ‘ancient places’), and in particular from Mia Ridge and Alex Dutton which indicated they were experiementing with the British Museum data. My previous experience seemed to gel with the experience they were having, and prompted me to finally get on with a blog post documenting my experience so hopefully others can benefit.

Perhaps one reason I’ve been a bit reluctant to blog this is that I struggled with the data, and I don’t want this post to come across as overly critical of the British Museum. The fact they have their data out there at all is amazing – and I hope other museums (and archives and libraries) follow the lead of the British Museum in releasing data onto the web. So I hope that all comments/criticisms below come across as offering suggestions for improving the Museum data on the web (and offering pointers to others doing similar projects), and of course the opportunity for some dialogue about the issues. There is also no doubt that some of the issues I encountered were down to my own ignorance/stupidity – so feel free to point out obvious errors.

When you arrive at the British Museum SPARQL endpoint the nice thing is there is a pre-populated query that you can run immediately. It just retrieves 10 results, of any type, from the data – but it means you aren’t staring at a blank form, and those ten results give a starting point for exploring the data set. Most URIs in the resulting data are clickable, and give you a nice way of finding what data is in the store, and to start to get a feel for how it is structured.

For example, running the default search now brings back the triple:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Object type :: marriage equipment ::

 

Which is intriguing enough to make you want to know more (I am married, and have to admit I don’t remember any special equipment). Clicking on the URI http://collection.britishmuseum.org/id/object/EAF119772 in a browser takes you to an HTML representation of the resource – a list of all the triples that make statements about the item in the British Museum identified by that URI.

While I think it would be an exaggeration to say this is ‘easily readable’, sometimes, as with the triple above, there is enough information to guess the basics of what is being said – for example:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Acquisition date :: 1994 ::

 

From this it is perhaps easy enough to see that there is some item (identified by the URI http://collection.britishmuseum.org/id/object/EAF119772) which has a note related to it stating that it was acquired (presumably by the museum) in 1994.

So far, so good. I’d got an idea of the kind of information that might be in the database. So the next question I had was “what kind of queries could I throw at the data that might produce some interesting/useful results?” Since I’d recently been playing around with data about composers I thought it might be interesting to see if the British Museum had any objects that were related to a well-known composer – say Mozart.

This is where I started to hit problems…. In my initial explorations, while some information was obvious, I’d also realised that the data was modelled using something called CIDOC CRM, which is intended to model ‘cultural heritage’ data. With some help from Twitter (including staff at the British Museum) I started to read up on CIDOC CRM – and struggled! Even now I’m not sure I’d say I feel completely on top of it, but I now have a bit of a better understanding. Much of the CIDOC model is based around ‘events’ – things that happened at a certain time/in a certain place. This means that often what might seem like a simple piece of information – such as where a item in the museum originates from – become complex.

To give a simple example, the ‘discovery’ of an item is a kind of event. So to find all the items in the British Museum ‘discovered’ in Greenwich you have to first find all the ‘discovery’ events that ‘took place at’ Greenwich, then link these discovery events back to the items they are a related to:

An item -> was discovered by a discovery event -> which took place at Greenwich

This adds extra complexity to what might seem initially (naively?) a simple query. This example was inspired by discussion at the Pelagios event mentioned earlier – the full query is:

SELECT ?greenwichitem WHERE
{
	?s <http://collection.britishmuseum.org/id/crm/P7F.took_place_at> <http://collection.britishmuseum.org/id/thesauri/x34215> .
	?subitem <http://collection.britishmuseum.org/id/crm/bm-extensions/PX.was_discovered_by> ?s .
	?greenwichitem <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?subitem
}

and the results can be seen at http://bit.ly/vojTWq.

[UPDATE 2014-11-20: This query no longer works. The query is now simpler:

PREFIX ecrm: <http://erlangen-crm.org/current/>
SELECT ?greenwichitem WHERE 
{ 
 ?find ecrm:P7_took_place_at <http://collection.britishmuseum.org/id/place/x34215> .
 ?greenwichitem ecrm:P12i_was_present_at ?find
}

END UPDATE]

To make things even more complex the British Museum data seems to describe all items actually as made up of (what I’m calling) ‘sub-items’. In some cases this makes some sense. If a single item is actually made up of several pieces, all with their own properties and provenance, it clearly makes sense to describe each part separately. Each part of the object will have it’s own properties and provenance, and it makes sense to describe these separately.

However, the British Museum data describes even single items as made up of ‘pieces’ – just that the single item consists of a single piece – and it is then that piece that has many of the properties of the item associated with it. To illustrate. A multi-piece item is like:

Which makes sense to me. But a single piece item is like:

 

Which I found (and continue to find) this confusing. This isn’t helped in my view by the fact that some properties are attached the the ‘parent’ object, and some to the ‘child’ object, and I can’t really work out the logic associated with this. For example it is the ‘parent’ object that belongs to a department in the British Museum, while it is the ‘child’ object that is made of a specific material. Both the parent and child in this situation are classified as physical objects, and this feels wrong to me.

Thankfully a link from the Pelagios meeting alerted me to some more detailed documentation around the British Museum data (http://www.researchspace.org/Stage-2-Outputs), and this suggests that the British Museum are going to move away from this model:

Firstly, after much debate we have concluded that preserving the existing modelling relationship as described earlier whereby each object always consists of at least one part is largely nonsense and should not be preserved.

While arguments were put forward earlier for retaining this minimum one part per object scheme, it has now been decided that only objects which are genuinely composed of multiple parts will be shown as having parts.

The same document notes that the current modelling “may be slightly counter-intuitive” – I can back up this view!

So – back to finding stuff related to Mozart… apart from struggling with the data model, the other issue I encountered was that it was difficult to approach the dataset through anything except a URI for a entity. That is to say, if you knew the URI for ‘Wolfgang Amadeus Mozart’ in the museum data set, the query would be easy, but if you only know a name, then it is much more difficult. How could I find the URI for Mozart, to then find all related objects?

Just using SPARQL, there are two approaches that might work. If you know the exact (and I mean exact) form of the name in the data, you can query for a ‘literal’ – i.e. do a SPARQL query for a textual string such as “Mozart, Wolfgang Amadeus”. If this is the exact for used in the data, the query will be successful, but if you get this slightly wrong then you’ll fail to get any result. A working example for the British Museum data is:

SELECT * WHERE 
{ 
	?s ?p "Mozart, Wolfgang Amadeus"
}

The second approach you can use is to do a more general query and ‘filter’ the result using a regular expression. Regular expressions are ways of looking for patterns in text strings, and are incredibly powerful (supporting things like wildcards, ignoring case etc. etc.). So you can be a lot less precise than searching for an exact string, and for example, you might try to retrieve all the statements about ‘people’ and filter for those containing the (case insensitive) word ‘mozart’. While this would get you Leopold Mozart as well as Wolfgang Amadeus if both are present in the data, there are probably a small enough number of mozarts that you would be able to pick out WA Mozart by eye, and get the relevant URI which identifies him.

A possible query of this type is:

SELECT * WHERE 
{ 
	?s <http://xmlns.com/foaf/0.1/Name> ?o 
	FILTER regex(?o, "mozart", "i") 
}

Unfortunately these latter type of ‘filter’ queries are pretty inefficient, and the British Museum SPARQL endpoint has some restrictions which mean that if you try to retrieve more than a relatively small amount of data at one time you just get an error. Since this is essentially how ‘filter’ queries work (retrieve a largish amount of data first, then filter out the stuff you don’t want), I couldn’t get this working. The issue of only being able to retrieve small sets of data was a bit of a frustration overall with the SPARQL endpoint, not helped by the fact that it seemed to be relatively arbitrary in terms of what ‘size’ of result set caused an error – I assume it is something about the overall amount of data retrieved, as it seemed unrelated to the actual number of results retrieved – for example using:

SELECT * WHERE
{
	?s ?p ?o
}

You can retrieve only 123 results before you get an error, while using

SELECT ?s WHERE
{
	?s ?p ?o
}

You can retrieve over 300 results without getting an error.

This limitation is an issue in itself (and the British Museum are by no means alone in having performance issues with an RDF triple store), but it is doubly frustrating that the limit is unclear.

The difficulty of exploring the British Museum data from a simple textual string became a real frustration as I explored the data – it made me realise that while the Linked Data/RDF concept of using URIs and not literals is something I understand and agree with, as people all we know is textual strings that describe things, so to make the data more immediately usable, supporting textual searches (e.g. via a solr index over the literals in the data) might be a good idea.

I got so frustrated that I went looking for ways of compensating. The British Museum data makes extensive use of ‘thesauri’ – lists of terms for describing people, places, times, object types, etc. In theory these thesauri would give the text string entry points into the data, and I found that one of the relevant thesauri (object types) was available on the Collections Link website (http://www.collectionslink.org.uk/assets/thesaurus/Objintro.htm). Each term in this data corresponds to a URI in the British Museum data, and so I wrote a ScraperWiki script which would search for each term in the British Museum data and identify the relevant URI and record both the term and the URI. At the same time a conversation with @portableant on twitter alerted me to the fact that the ‘Portable Antiquities‘ site uses a (possibly modified) version of the same thesaurus for classifying objects, so I added in a lookup of the term on this site to start to form connections between the Portable Antiquities data and the British Museum data. This script is available at https://scraperwiki.com/scrapers/british_museum_object_thesaurus/, but comes with some caveats about the question of how up to date the thesaurus on the Collections Link website is, and the possible imperfections of the matching between the thesaurus and the British Museum data.

Unfortunately it seems that this ‘object type’ thesaurus is the only one made publicly available (or at least the only one I could find), while clearly the people and place thesauri would be really interesting, and provide valuable access points into the data. But really ideally these would be built from the British Museum data directly, rather than being separate lists.

So, finally back to Mozart. I discovered another way into the data – which was via the really excellent British Museum website, which offers the ability to search the collections via a nice web interface. This is a good search interface, and gives access to the collections – to be honest already solving problems such as the one I set myself here (of finding all objects related to Mozart) – but nevermind that now!  If you search this interface and find an object, when the you view the record for the object, you’ll probably be at a URL something like:

http://www.britishmuseum.org/research/search_the_collection_database/search_object_details.aspx?objectid=3378094&partid=1&searchText=mozart&numpages=10&orig=%2fresearch%2fsearch_the_collection_database.aspx&currentPage=1

If you extract the “objectid” (in this case ‘3378094’) from this, you can use this to look up the RDF representation of the same object using a query like:

SELECT * WHERE
{
	?s <http://www.w3.org/2002/07/owl#sameAs> <http://collection.britishmuseum.org/id/codex/3378094>
}

This gives you the URI for the object, which you can then use to find other relevant URIs. So in this case I was able to extract the URI for Wolfgang Amadeus Mozart (http://collection.britishmuseum.org/id/person-institution/39629) and so create a query like:

SELECT ?item WHERE
{
	?s ?p <http://collection.britishmuseum.org/id/person-institution/39629> .
	?item <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?s
}

To find the 9 (as of today) items that are in someway related to Mozart (mostly pictures/engravings of Mozart).

The discussion at the Pelgios meeting identified several ‘anti-patterns’ related to the usability of Linked Data – and some of these jumped out at me as being issues when using the British Museum data:

Anti-patterns

  • homepages that don’t say where data can be found
  • not providing info on licences
  • not providing info on RDF syntaxes
  • not providing egs of query construction
  • not providing easy way to get at term lists
  • no html browsing
  • complex data models
The Pelagios wiki has some more information on ‘stumbling blocks’ at http://pelagios.pbworks.com/w/page/48544935/Stumbling%20Blocks, and also the group exploring (amongst other things) the British Museum data made notes at http://pelagios.pbworks.com/w/page/48535503/UK%20Cultural%20Heritage. Also I know that Dominic Oldman from the British Museum was at the meeting, and was keen to get feedback on how they could improve the data or the way it is made available.
One thing I felt strongly when I was looking at the British Museum data is that it would have been great to be able to ‘go’ somewhere that others looking at/using the data would also be to discuss the issues. The British Museum provide an email to send feedback (which I’ve used), but what I wanted to do was say things like “am I being stupid?” and “anyone else find this?” etc. As a result of discussion at the Pelagios meeting, and on twitter, Mia Ridge has setup a wiki page for just such a discussion.
A final thought. The potential of ‘linked data’ is to bring together data from multiple sources, and combine to give something that is more than the sum of it’s parts. At the moment the British Museum data sits in isolation. How amazing would it be to join up the British Museum ‘people’ records such as http://collection.britishmuseum.org/id/person-institution/39629 with the VIAF (http://viaf.org/viaf/32197206/) or Library of Congress (http://id.loc.gov/authorities/names/n80022788) identifier for the same person, and start to produce searches and results that build on the best of all this data?

Spotlight on Names

A few people have been kind enough to test out my Composed bookmarklet and give some feedback (here on Google+ amongst other places). A couple of people identified composers on COPAC for which my bookmarklet didn’t produce any information, and when I checked this was because the underlying data I’m using, the MusicNet Codex, didn’t have a record for those examples.

This reinforced an idea that keeps coming back to me as I think about Library data and Linked Data – which is that we need good ways of capturing and then re-expressing in our data feedback from consumers/users of the data. In the case of the Composed bookmarklet it seems sensible to have a way to allow people to say at least:

  • “This record contains a composer [name or identifier] you don’t have in your data]”
  • “For this record the bookmarklet displays a composer not mentioned in the record”

This triggered some further thoughts that a bookmarklet could be a nice way of generally allowing those interested (librarians or others) to add information to the record – specifically structured information and identifiers for people (and perhaps some other entities). Then discussing the #discodev competition with Mathieu D’Aquin at the Open University today he mentioned the DBpedia ‘Spotlight’ tool which does entity extraction and gives back identifiers from DBpedia.

So how about a bookmark which:

  • Grabs the ‘people’ fields (100, 700, 600 – others?)
  • Passes contents to Spotlight and gets back possible DBPedia matches
  • Links to VIAF (think this is possible where VIAF has a Wikipedia URI) (possibly do this after decision made below)
  • Allows the user to confirm or reject the suggestions – if they confirm allows them to state a relationship as defined by MARC relators (available as Linked Data at http://id.loc.gov/vocabulary/relators.html)
  • Posts a triple expressing a link between the catalogue record, the relator, and DBPedia URI and/or VIAF URI

This could then be harvested back by libraries or others to get more expressive linked data relating bibliographic entities to people entities with a meaningful relationship. I haven’t looked at this in detail – but I don’t think it would be very difficult – my guess is just a few hours work.

I think this also starts to address another issue that always comes up when discussing libraries and linked data which is how might linked data start to become part of the metadata creation process in libraries – although since it relies on an existing record it doesn’t really get there – but if libraries are going to successfully exploit linked data we need to play around with interfaces that help us integrate linked data into our data as it is created.

Compose yourself

‘Composed’ is my entry for the UK Discovery #discodev developer competition. Composed helps you link between information about composers of classical music by exploiting the MusicNet Codex.

Specifically currently Composed enables linking from COPAC catalogue records mentioning a composer, to other information and records about the composer.

What is it and how do I install it?
Composed is a Bookmarklet which you can install by dragging this link to your browser bookmark bar: Composed.

How do I use it?
If you haven’t already, drag this Composed bookmarklet to your browser bookmark bar (or otherwise add to your bookmarks). Next find a record on COPAC which mentions a classical composer –  such as this one, and once you are viewing it in your browser, click the bookmarklet.

Assuming it is all working, you should find that the display is enhanced by the addition of some links on the right hand side of the screen. If you’ve used the example given above you should see:

  • An image of the composer (Handel) [UPDATE 5th August 2011: See my comment below about problems with displaying images]
  • Links to:
    • Records in the British Library catalogue
    • More records in COPAC
    • The entry for Handel in the Grove Dictionary of Music (subscribers only)
    • Record in RISM (a music and music literature database)
    • A page about Handel on the BBC
    • A page about Handel in the IMDB
    • The Wikipedia page for Handel

These are based on the information available for Handel from the MusicNet Codex and the BBC.

Example COPAC page enhanced by 'Composed'

If the record you are looking at contains multiple composers, you should see multiple entries in the display. If there are no links available for the record you are viewing, you should see a message to this effect.

How does it work?
The mechanics are pretty simple. The bookmarklet itself is a simple piece of javascript, which calls another script. This script finds the COPAC Record ID from the COPAC record currently in the browser. This ID is passed to some php which uses a database I built specifically for this purpose to match a COPAC record ID to one or more MusicNet Codex URIs. For each MusicNet Codex URI retrieved, the script requests information from the URI, gets back some RDF, from which it extracts the links that are passed back to the javascript, which inserts them into the COPAC display. If the MusicNet RDF contains a link to the BBC, further RDF is grabbed from the BBC and the relevant information is added into the data passed back to the display.

So what were the challenges?

Challenge 1

Manipulating RDF. Although I’ve done quite a bit of work with RDF in one form or another, I’ve never actually written scripts that consume it – so this was new to me. I ended up using php because the Graphite RDF parser, written by Chris Gutteridge at the University of Southampton made it so easy to retrieve the RDF and grab the information I needed – although it took me a little while to get my head around navigating a graph rather than a tree (being pretty used to parsing xml).

So I guess I owe Chris a pint for that one 🙂

Challenge 2

The major challenge was getting a list of COPAC record IDs which mapped to MusicNet Codex URIs. Actually – I wasn’t able to do this and what I have is an approximation – almost certainly you can find examples where the bookmarklet populates the screen with a composer when there is no composer mentioned in the record.

Unfortunately MusicNet is unable to point at a COPAC identifier or URI for a person – like many library catalogues, COPAC identifies items in libraries (or perhaps more accurately, the records that describe these items), but not any of the entities (people, places, etc.) within the record. This means that while MusicNet can point at the a specific URI for the BBC that represents (for example) ‘Handel’, with COPAC all it does is give a URL which triggers a search which should bring back a set of records all of which mention Handel.

There is a whole load of background as to how MusicNet got to this point, and how they build the searches to COPAC – but essentially it is based on the text strings in COPAC that were found by the MusicNet project to refer to the same composer. These text strings are what are used to build the search against COPAC. This is also the explanation as to why you sometimes see multiple links to COPAC/the British Library catalogue in the Composed bookmarklet display – because there are multiple strings that MusicNet found represent the same composer.

What I’ve done to create a rough mapping between MusicNet and COPAC records is to run each search that MusicNet defines for COPAC and grab all the record ids in the resultant record set. This gives a rough and ready mapping, but there are bound to be plenty of errors in there. For example one of the  searches MusicNet holds for the composer Franz Schubert on the British Library catalogue is http://catalogue.bl.uk/F/?func=find-b&request=Schubert&find_code=WNA – which will actually find everything by anyone called ‘Schubert’ – if there are any similar searches in the COPAC data I’ll be grabbing a lot of irrelevant records in my searching. Since the number of searches, and resultant records, is relatively high (e.g. over 30k records mention Mozart), at the time of writing I’m still in the process of populating my mapping – it is currently listing around 50k [Update: 31/7/2011 at 15:33 – final total is 601,286] COPAC IDs, but I’ll add more as my searches run and produce results in the background.

I’m talking to the MusicNet team to see if they are able at this stage to track back to the original COPAC records they used to derive their data, and so we could get an exact mapping of their URIs to lists of record IDs on COPAC – this would be incredibly useful and allow functions such as mine to work much more reliably.

None of this should be seen as a criticism of either the MusicNet or COPAC teams – without these sources I wouldn’t have even been able to get started on this!

Final Thoughts

I hope this shows how data linked across multiple sources can bring together information that would otherwise be separated. There is no reason in theory why the bookmarklet shouldn’t be expanded to take in the other data sources MusicNet knows about – and possibly beyond (as long as there is access to and ID that can finally be brought back to MusicNet).

Libraries desperately need to move beyond ‘the record’ as the way they think about their data – and start adding the identifiers they already have available to their data – this would make this type of task much easier.

If you want to build other functionality on my rough and ready MusicNet to COPAC record mapping, you can pass COPAC IDs to the script:

http://www.meanboyfriend.com/composed/composed.php?copacid=<copac_record_id>

You’ll get back some JSON containing information about one or more composers with a Name, Links, and an Image if the BBC have a record of an image in their data.

Discovering Discovery

As I mentioned in a recent post I’ve been involved in UK Discovery (http://discovery.ac.uk) – an initiative to enable resource discovery through the publication and aggregation of metadata according to simple, open, principles.

Discovery is currently running a Developer competition. Others have already blogged the competition, but what I wanted to do here was note the reasons for running the competition, capture some ideas that I’ve had, and hopefully inspire others to enter the competition (as I hope to myself).

Firstly – why the developer competition? For me I hope we can achieve three things through the competition:

  1. Engage developers in/get them excited about Discovery
  2. Get feedback from developers on what works for them in terms of building on Discovery
  3. Start building a set of examples of what can be achieved in the Discovery ecosystem

If we achieve any of these I’ll be pretty happy. We are still at early days in building an environment of open (meta)data for libraries, archives and museums, but the 10 data sets we are featuring in the competition provide good examples of the type of data we hope will be published with the encouragement and advice of the Discovery initiative.

On to ideas. The list below is basically just me brainstorming – my hope is that others might be inspired by one of the ideas, or others might contribute more ideas via the comments. (I’ve already picked one of the ideas below that I’m going to try and turn into an entry of my own – but for the purposes of dramatic tension, I won’t reveal this until the end of the post!)

  • Linked Library Catalogue. Rather than having a catalogue made up of MARC (or other format of choice) records, rather simply a list of URIs which point to the bibliographic entities on the web. Build an OPAC on top of this list by crawling the URIs for metadata and indexing locally (e.g. with Solr). Could use Cambridge University Library, Jerome and BNB featured datasets as well as other bibliographic information on the web.
  • What’s hot in research? Use the Mosaic Activity Data, the OpenURL Router data and other relevant data (e.g. from research publication repositories) to look at trends in research areas. Possibly mash up with Museum/Archive data to highlight relevant collections to the research community based on the current ‘hot topics’?
  • Composer Bookmarklet. Use the MusicNet Codex to power a bookmarklet that when installed and used would link from relevant pages/records in COPAC/BL/RISM/Grove/BBC/DbPedia/MusicBrainz to other sources. Focus on providing links from library catalogue records to other relevant sources (like recordings/BBC programmes)
  • Heritage Britain. Map various cultural heritage items/collections onto a map of Britain. Out of the featured datasets English Heritage data is the obvious starting point, but could include data from Archives Hub, National Archives Flickr collection, and the Tyne and Wear Museums data.

Remember that although entries have to use data from one of the featured data sets (I’ve mentioned them all here), you can use whatever other data you like…

If you’ve got ideas (perhaps especially if you aren’t in a position to develop them yourself) that you think would be great demonstrations or just really useful, feel free to blog yourself, or comment here.

And the one I’m hoping to take forward? The Composer Bookmarklet – I’ll blog progress here if/when I make any (although don’t let that stop you if you want to develop one as well!)

 

An internal monologue on Metadata Licensing

I’ve been involved in discussions around the licensing of library/museum/archive metadata over the last couple of years, specifically through my work on UK Discovery (http://discovery.ac.uk) – an initiative to enable resource discovery through the publication and aggregation of metadata according to simple, open, principles.

In the course of this work I’ve co-authored the JISC Guide to Open Bibliographic Data and become a signatory of the Discovery Open Metadata Principles. Recently I’ve been discussing some of the issues around licensing with Ed Chamberlain and others (see Ed’s thoughts on licensing on the CUL-COMET blog), and over coffee this morning I was trying to untangle  the issues and reasons for specific approaches to licensing – for some reason they formed in my head as a set of Q&A so I’ve jotted them down in this form… at the moment this is really to help me with my thinking but I thought I’d share in case.

N.B. These are just some thoughts – not comprehensive and not official views from the Discovery intiative

Q1: Why apply an explicit license to your metadata?
A1.1: To enable appropriate re-use

Q2: What prevents appropriate re-use
A2.1: Uncertainty/lack of clarity about what can be done with the data
A2.2: Any barriers that add an overhead – could be technical or legal

Q3: What sort of barriers add overhead?
A3.1: Attribution licensing – where data from many sources are being mixed together this overhead can be considerable.
A3.2: Machine readable licensing data to be provided with data – adds complexity to data processing, potentially increases network traffic and slows applications
A3.3: Licensing requiring human intervention to determine rights for reuse at a data level – this type of activity effectively stops applications being built on the data as it isn’t possible for software to decide if a piece of data can be included or not (NB human intervention for whole datasets is less of an issue – building an app on a dataset where all data is covered by the same license which has been interpreted by a human in advance of writing software is not an issue)
A3.4: Licensing which is not clear about what type of reuse is allowed. The NC (Non-commercial) licenses exemplify this as the definition of what amounts to ‘commerical use’ is often unclear.
A3.5: Licensing not generally familiar to the potential consumers of the data (for re-use purposes) – e.g. writing a new license specific to your requirements rather than adopting a Creative Commons or other more widely used licence.

Q4: What does this suggest in terms of data licensing decisions?
A4.1: Putting data in public domain removes all doubt – it can be reused freely – a consumer doesn’t have to check anything etc.
A4.2: Putting data in public domain removes the overhead of attribution – where data from many sources are being mixed together this overhead can be considerable
A4.3: Where there is licensing beyond public domain, reuse will be encouraged if it is easy to establish (preferably in an automated way) what licensing is associated with any particular data
A4.4: Where data within a single set is available under different licensing, reuse will be encouraged by making it easy to address only data with a specified license attached. E.g. ‘only give me data that is CC0 or ODC-PDDL’

Comments/questions welcome…

Open Culture 2011 – Google Art Project

Laura Scott – Head of External Relations from Google UK

Art Project (http://www.googleartproject.com/) – v ambitious for Google – making art and information relating to art more accessible to people everywhere in the world. Google new to ‘arts’ – but have committment.

Google 20% time – if you have a great an idea, you can spend up to 20% time on that – e.g. Google Mail big success – but failure also embraced. Failure does cost – Google may well be able to take financial risk, but when things fail media will pick up on this…

Encourage experimentation – small things as many previous speakers have mentioned.

Partnerships are crucial for Google – Art project had partnerships with 9(?) galleries across the world.

Google not a curator – doesn’t want to be – wanted to make art immediately accessible – possible to ‘jump right in’ – but no sense in which mean to replace physical visit – and evidence so far this is not the case at all. Each museum chose one image to do in v v high resolution – can zoom in to very high levels – example of ‘The Harvesters’ from the Metropolitan Museum of Art.

Example of ‘No Woman, No Cry’ from Tate Britain – again v high levels of zoom – but also ‘view in darkness’ option – to reveal message about Stephen Lawrence – a key part of artwork.

Also ability to integrate other media – e.g. films of people commenting on works

This is the first version – Google very aware not perfect, and things to improve. Want to increase geographical spread of museums included. Have had over 10million visitors and 90k people create ‘my collections’ – importance of making things social and being able to share – and realised need to develop that feature further.

Art Project took 18 months to get up and running – felt like a long time – but takes time, and long term project for Google – next phase over next couple of years.

Open Culture 2011 – Funding Digital

Marco de Niet, Digitaal Erfgoed Nederland

All looking for new business models for digital – culture sector and commercial sector. Sometimes we are good a ‘doing’ but not so good at ‘sustaining’.

www.numeric.ws – survey showed there is more budget available for digitisation than there is planning what to digitise and why?

‘Business Model Innovation Cultural Heritage’ – booklet about new value propositions. Look at different levels:

analog ‘in house’
digital ‘in house’
digital in controlled network
‘out there’ on the web (this is the ‘new world’ of social media; open data; linked open data; semantic web)

How can we think about our role in this move towards digital. Business Model Canvass (Alex Osterwalder) – has following components:

  1. Value propositions
  2. Customer segments
  3. Channels
  4. Customer relationships
  5. Revenue stream
  6. Key activities
  7. Key resources
  8. Key partnerships
  9. Cost structure

1-5 are outward facing things, 6-9 are ‘back office’ tasks

Major obstacles for cultural heritage to achieve business model innovation…

  1. Organisations in transition
    • Didn’t realise that as we embraced technology, so did our audience
  2. Open ICT infrastructure
    • IT moving to generic solutions in collaboration; tendency for orgs traditionally to do custom; in-house
  3. Clearing copyright – four ways to dealing with it
    • Opt out – put it online and take it down if people complain. Can get away with it if you are a small player
    • Clearing by the institution – costly; time-consuming
    • Clearing through an outside organisation – outsource risk
    • Changing legislation
  4. Creating revenue – how can we make money with digitisation – 5 ways
    • Put stuff online, hope people come to the institution
    • Become a broker of digital collection – e.g. put low res images online and sell high-res
    • Digital curator – provide context not just objects
    • Digital branding – create and build the brand – e.g. Powerhouse Museum – become world renowned
    • Product bundle: trans-media combination of various sources of income

Fiona Talbott, Heritage Lottery FunPolicy Changes:
Traditional HLF only supported projects that had real objects at core (although there could be some digital component). Now possibility of allowing digital only projects – needs approval though.

Use of digital technology – should this be compulsory in all projects – but this was seen as over prescriptive

Digital policy issues – want to see sustainability as key. Project Management will have to go beyond just the ‘launch’.

Projects have to be about Audiences first and foremost. Resulting content must be publicly accessible – not paywalled – think Guardian not Times…

Make use of resources – your staff are you most valuable resource …
Fiona see’s possibility of HLF funding hackdays
Don’t re-invent the wheel with technology – use stuff that is already there – off-the-shelf; open source projects

Exploit social media – already lots doing this, but needs to go across the heritage sector…

Emma Wakelin, AHRC

Weren’t cut as badly as some … but much tighter guidelines and more strings attached.
Realised big need to fund academic to work with others – not about consultancy but about knowledge exchange and creation. £20 million pounds available

New call coming out for a centre to look at copyright and business models – covering creative industries, including cultural heritage – this is cross funder initiative – £5 million available.

Invite people to think about the impact of digital technologies on the way we do arts and humanities research – a theme which will be active of the next four years…

Also fund the ‘Digging into Data’ challenge – largescale data analysis; European Net Heritage project (http://heritageportal.eu)

AHRC won’t fund projects where sole aim is ‘to get stuff digitised’ – but can fund projects with digitisation components if greater aim in line with AHRC remit.

Jon Pratty, Arts Council England

Five main goals – and these are used when looking at funding applications:

  1. Talent and artistic excellence are thriving and celebrated
  2. More people experience and are inspired by the arts
  3. The arts are sustainable, resilient and innovative
  4. The arts leadership and workforce are diverse and highly skilled
  5. Every child and young person has the opportunity to experience the richness of the arts

Won’t necessarily have to meet everyone of these criteria – e.g. a digital project might meet 2, 3 and 4Arts Council going to be working with NESTA and AHRC…
NESTA R&D fund [£500k] – but this is to get moving in the direction of the ACE Digital Innovation and Development Fund [£15-20m]

Also announced with BBC ‘Building digital capacity in the Arts’ scheme (http://www.bbc.co.uk/academy/news/view/Arts-Council-Article)

The funding schemes broadly sector-agnostic – so not limited to specific areas or types of organisation.

Themes emerging from pilot period:

  1. User generated content and social media – harnessing the power of the Internet and social media to reach audiences and to give them a platform for discussion, participation and creativity
  2. Distribution – how digital can enable this
  3. Mobile, location and games – developing a new generation of mobile and location-based experience and service, including games
  4. Data and archive – making archives, collections and other data more widely available to other arts and cultural organisation and the general public
  5. Resources – using digital tech to improve way in which arts/culturla orgs are run – business efficiency and income generation and collaborations
  6. Education and learning – creating education resources/experiences for children, teachers, young people, adult learning etc.

Remember – existing ACE funding routes such as ‘Grants for the Arts’ are open to museums and libraries – Arts related collaborations or explorations – examples:

  • RAMM – artists working within collection
  • Fitzwilliam – China project
  • Hove – blacksmith artist ‘let loose’ in the collection

Will also be some millions as a ‘strategic digital’ fund – to fill gaps