Experimenting with British Museum data

[UPDATE 2014-11-20: The British Museum data model and has changed quite a bit since I wrote this. While there is still useful stuff in this post, the detail of any particular query, or comment may well now be outdated. I’ve added some updates in square brackets in some cases]

In September 2011 the British Museum started publishing descriptions of items in its collections as RDF (the data structure that underlies Linked Data). The data is available from http://collection.britishmuseum.org/ where the Museum have made a ‘SPARQL Endpoint’ available. SPARQL is a query language for extracting data from RDF stores – it can be seen as a parallel to SQL, which is a query language for extract data from traditional relational databases.

Although I knew what SPARQL was, and what it looked like, I really hadn’t got to grips with it, and since I’d just recently purchased “Learning SPARQL” it seemed like a good opportunity to get familiar with the British Museum data and SPARQL syntax. So I had a play (more below). Skip forward a few months, and I noticed some tweets from a JISC meeting about the Pelagios project (which is interested in the creation of linked (geo)data to describe ‘ancient places’), and in particular from Mia Ridge and Alex Dutton which indicated they were experiementing with the British Museum data. My previous experience seemed to gel with the experience they were having, and prompted me to finally get on with a blog post documenting my experience so hopefully others can benefit.

Perhaps one reason I’ve been a bit reluctant to blog this is that I struggled with the data, and I don’t want this post to come across as overly critical of the British Museum. The fact they have their data out there at all is amazing – and I hope other museums (and archives and libraries) follow the lead of the British Museum in releasing data onto the web. So I hope that all comments/criticisms below come across as offering suggestions for improving the Museum data on the web (and offering pointers to others doing similar projects), and of course the opportunity for some dialogue about the issues. There is also no doubt that some of the issues I encountered were down to my own ignorance/stupidity – so feel free to point out obvious errors.

When you arrive at the British Museum SPARQL endpoint the nice thing is there is a pre-populated query that you can run immediately. It just retrieves 10 results, of any type, from the data – but it means you aren’t staring at a blank form, and those ten results give a starting point for exploring the data set. Most URIs in the resulting data are clickable, and give you a nice way of finding what data is in the store, and to start to get a feel for how it is structured.

For example, running the default search now brings back the triple:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Object type :: marriage equipment ::

 

Which is intriguing enough to make you want to know more (I am married, and have to admit I don’t remember any special equipment). Clicking on the URI http://collection.britishmuseum.org/id/object/EAF119772 in a browser takes you to an HTML representation of the resource – a list of all the triples that make statements about the item in the British Museum identified by that URI.

While I think it would be an exaggeration to say this is ‘easily readable’, sometimes, as with the triple above, there is enough information to guess the basics of what is being said – for example:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Acquisition date :: 1994 ::

 

From this it is perhaps easy enough to see that there is some item (identified by the URI http://collection.britishmuseum.org/id/object/EAF119772) which has a note related to it stating that it was acquired (presumably by the museum) in 1994.

So far, so good. I’d got an idea of the kind of information that might be in the database. So the next question I had was “what kind of queries could I throw at the data that might produce some interesting/useful results?” Since I’d recently been playing around with data about composers I thought it might be interesting to see if the British Museum had any objects that were related to a well-known composer – say Mozart.

This is where I started to hit problems…. In my initial explorations, while some information was obvious, I’d also realised that the data was modelled using something called CIDOC CRM, which is intended to model ‘cultural heritage’ data. With some help from Twitter (including staff at the British Museum) I started to read up on CIDOC CRM – and struggled! Even now I’m not sure I’d say I feel completely on top of it, but I now have a bit of a better understanding. Much of the CIDOC model is based around ‘events’ – things that happened at a certain time/in a certain place. This means that often what might seem like a simple piece of information – such as where a item in the museum originates from – become complex.

To give a simple example, the ‘discovery’ of an item is a kind of event. So to find all the items in the British Museum ‘discovered’ in Greenwich you have to first find all the ‘discovery’ events that ‘took place at’ Greenwich, then link these discovery events back to the items they are a related to:

An item -> was discovered by a discovery event -> which took place at Greenwich

This adds extra complexity to what might seem initially (naively?) a simple query. This example was inspired by discussion at the Pelagios event mentioned earlier – the full query is:

SELECT ?greenwichitem WHERE
{
	?s <http://collection.britishmuseum.org/id/crm/P7F.took_place_at> <http://collection.britishmuseum.org/id/thesauri/x34215> .
	?subitem <http://collection.britishmuseum.org/id/crm/bm-extensions/PX.was_discovered_by> ?s .
	?greenwichitem <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?subitem
}

and the results can be seen at http://bit.ly/vojTWq.

[UPDATE 2014-11-20: This query no longer works. The query is now simpler:

PREFIX ecrm: <http://erlangen-crm.org/current/>
SELECT ?greenwichitem WHERE 
{ 
 ?find ecrm:P7_took_place_at <http://collection.britishmuseum.org/id/place/x34215> .
 ?greenwichitem ecrm:P12i_was_present_at ?find
}

END UPDATE]

To make things even more complex the British Museum data seems to describe all items actually as made up of (what I’m calling) ‘sub-items’. In some cases this makes some sense. If a single item is actually made up of several pieces, all with their own properties and provenance, it clearly makes sense to describe each part separately. Each part of the object will have it’s own properties and provenance, and it makes sense to describe these separately.

However, the British Museum data describes even single items as made up of ‘pieces’ – just that the single item consists of a single piece – and it is then that piece that has many of the properties of the item associated with it. To illustrate. A multi-piece item is like:

Which makes sense to me. But a single piece item is like:

 

Which I found (and continue to find) this confusing. This isn’t helped in my view by the fact that some properties are attached the the ‘parent’ object, and some to the ‘child’ object, and I can’t really work out the logic associated with this. For example it is the ‘parent’ object that belongs to a department in the British Museum, while it is the ‘child’ object that is made of a specific material. Both the parent and child in this situation are classified as physical objects, and this feels wrong to me.

Thankfully a link from the Pelagios meeting alerted me to some more detailed documentation around the British Museum data (http://www.researchspace.org/Stage-2-Outputs), and this suggests that the British Museum are going to move away from this model:

Firstly, after much debate we have concluded that preserving the existing modelling relationship as described earlier whereby each object always consists of at least one part is largely nonsense and should not be preserved.

While arguments were put forward earlier for retaining this minimum one part per object scheme, it has now been decided that only objects which are genuinely composed of multiple parts will be shown as having parts.

The same document notes that the current modelling “may be slightly counter-intuitive” – I can back up this view!

So – back to finding stuff related to Mozart… apart from struggling with the data model, the other issue I encountered was that it was difficult to approach the dataset through anything except a URI for a entity. That is to say, if you knew the URI for ‘Wolfgang Amadeus Mozart’ in the museum data set, the query would be easy, but if you only know a name, then it is much more difficult. How could I find the URI for Mozart, to then find all related objects?

Just using SPARQL, there are two approaches that might work. If you know the exact (and I mean exact) form of the name in the data, you can query for a ‘literal’ – i.e. do a SPARQL query for a textual string such as “Mozart, Wolfgang Amadeus”. If this is the exact for used in the data, the query will be successful, but if you get this slightly wrong then you’ll fail to get any result. A working example for the British Museum data is:

SELECT * WHERE 
{ 
	?s ?p "Mozart, Wolfgang Amadeus"
}

The second approach you can use is to do a more general query and ‘filter’ the result using a regular expression. Regular expressions are ways of looking for patterns in text strings, and are incredibly powerful (supporting things like wildcards, ignoring case etc. etc.). So you can be a lot less precise than searching for an exact string, and for example, you might try to retrieve all the statements about ‘people’ and filter for those containing the (case insensitive) word ‘mozart’. While this would get you Leopold Mozart as well as Wolfgang Amadeus if both are present in the data, there are probably a small enough number of mozarts that you would be able to pick out WA Mozart by eye, and get the relevant URI which identifies him.

A possible query of this type is:

SELECT * WHERE 
{ 
	?s <http://xmlns.com/foaf/0.1/Name> ?o 
	FILTER regex(?o, "mozart", "i") 
}

Unfortunately these latter type of ‘filter’ queries are pretty inefficient, and the British Museum SPARQL endpoint has some restrictions which mean that if you try to retrieve more than a relatively small amount of data at one time you just get an error. Since this is essentially how ‘filter’ queries work (retrieve a largish amount of data first, then filter out the stuff you don’t want), I couldn’t get this working. The issue of only being able to retrieve small sets of data was a bit of a frustration overall with the SPARQL endpoint, not helped by the fact that it seemed to be relatively arbitrary in terms of what ‘size’ of result set caused an error – I assume it is something about the overall amount of data retrieved, as it seemed unrelated to the actual number of results retrieved – for example using:

SELECT * WHERE
{
	?s ?p ?o
}

You can retrieve only 123 results before you get an error, while using

SELECT ?s WHERE
{
	?s ?p ?o
}

You can retrieve over 300 results without getting an error.

This limitation is an issue in itself (and the British Museum are by no means alone in having performance issues with an RDF triple store), but it is doubly frustrating that the limit is unclear.

The difficulty of exploring the British Museum data from a simple textual string became a real frustration as I explored the data – it made me realise that while the Linked Data/RDF concept of using URIs and not literals is something I understand and agree with, as people all we know is textual strings that describe things, so to make the data more immediately usable, supporting textual searches (e.g. via a solr index over the literals in the data) might be a good idea.

I got so frustrated that I went looking for ways of compensating. The British Museum data makes extensive use of ‘thesauri’ – lists of terms for describing people, places, times, object types, etc. In theory these thesauri would give the text string entry points into the data, and I found that one of the relevant thesauri (object types) was available on the Collections Link website (http://www.collectionslink.org.uk/assets/thesaurus/Objintro.htm). Each term in this data corresponds to a URI in the British Museum data, and so I wrote a ScraperWiki script which would search for each term in the British Museum data and identify the relevant URI and record both the term and the URI. At the same time a conversation with @portableant on twitter alerted me to the fact that the ‘Portable Antiquities‘ site uses a (possibly modified) version of the same thesaurus for classifying objects, so I added in a lookup of the term on this site to start to form connections between the Portable Antiquities data and the British Museum data. This script is available at https://scraperwiki.com/scrapers/british_museum_object_thesaurus/, but comes with some caveats about the question of how up to date the thesaurus on the Collections Link website is, and the possible imperfections of the matching between the thesaurus and the British Museum data.

Unfortunately it seems that this ‘object type’ thesaurus is the only one made publicly available (or at least the only one I could find), while clearly the people and place thesauri would be really interesting, and provide valuable access points into the data. But really ideally these would be built from the British Museum data directly, rather than being separate lists.

So, finally back to Mozart. I discovered another way into the data – which was via the really excellent British Museum website, which offers the ability to search the collections via a nice web interface. This is a good search interface, and gives access to the collections – to be honest already solving problems such as the one I set myself here (of finding all objects related to Mozart) – but nevermind that now!  If you search this interface and find an object, when the you view the record for the object, you’ll probably be at a URL something like:

http://www.britishmuseum.org/research/search_the_collection_database/search_object_details.aspx?objectid=3378094&partid=1&searchText=mozart&numpages=10&orig=%2fresearch%2fsearch_the_collection_database.aspx&currentPage=1

If you extract the “objectid” (in this case ‘3378094’) from this, you can use this to look up the RDF representation of the same object using a query like:

SELECT * WHERE
{
	?s <http://www.w3.org/2002/07/owl#sameAs> <http://collection.britishmuseum.org/id/codex/3378094>
}

This gives you the URI for the object, which you can then use to find other relevant URIs. So in this case I was able to extract the URI for Wolfgang Amadeus Mozart (http://collection.britishmuseum.org/id/person-institution/39629) and so create a query like:

SELECT ?item WHERE
{
	?s ?p <http://collection.britishmuseum.org/id/person-institution/39629> .
	?item <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?s
}

To find the 9 (as of today) items that are in someway related to Mozart (mostly pictures/engravings of Mozart).

The discussion at the Pelgios meeting identified several ‘anti-patterns’ related to the usability of Linked Data – and some of these jumped out at me as being issues when using the British Museum data:

Anti-patterns

  • homepages that don’t say where data can be found
  • not providing info on licences
  • not providing info on RDF syntaxes
  • not providing egs of query construction
  • not providing easy way to get at term lists
  • no html browsing
  • complex data models
The Pelagios wiki has some more information on ‘stumbling blocks’ at http://pelagios.pbworks.com/w/page/48544935/Stumbling%20Blocks, and also the group exploring (amongst other things) the British Museum data made notes at http://pelagios.pbworks.com/w/page/48535503/UK%20Cultural%20Heritage. Also I know that Dominic Oldman from the British Museum was at the meeting, and was keen to get feedback on how they could improve the data or the way it is made available.
One thing I felt strongly when I was looking at the British Museum data is that it would have been great to be able to ‘go’ somewhere that others looking at/using the data would also be to discuss the issues. The British Museum provide an email to send feedback (which I’ve used), but what I wanted to do was say things like “am I being stupid?” and “anyone else find this?” etc. As a result of discussion at the Pelagios meeting, and on twitter, Mia Ridge has setup a wiki page for just such a discussion.
A final thought. The potential of ‘linked data’ is to bring together data from multiple sources, and combine to give something that is more than the sum of it’s parts. At the moment the British Museum data sits in isolation. How amazing would it be to join up the British Museum ‘people’ records such as http://collection.britishmuseum.org/id/person-institution/39629 with the VIAF (http://viaf.org/viaf/32197206/) or Library of Congress (http://id.loc.gov/authorities/names/n80022788) identifier for the same person, and start to produce searches and results that build on the best of all this data?

What’s so hard about Linked Data?

My post on Linked Data from a couple of weeks ago got some good comments and definitely helped me in exploring my own understanding of the area. The 4 Principles of Linked Data as laid out by Tim Berners-Lee seem relatively straightforward, and although there are some things that you need to get your head around (some terminology, some concepts) the basic principles don’t seem that difficult.

So what is difficult about Linked Data (and what is not)?

Data Modelling

Data Modelling is “a method used to define and analyze data requirements needed to support the business processes of an organization“. The problem is that the real world is messy, and describing it in a way that can be manipulated by computers is always problematic.

Basically data modelling is difficult. It is probably true of any sector, but anyone working in libraries who has looked at how we represent bibliographic and related data, and library processes, in our systems will know it gets complicated extremely quickly. With library data you can easily get bogged down in philosophical questions (what is a book?, how do you represent an ‘idea’?).

This is not a problem unique to Linked Data – modelling is hard however you approach it, but my suspicion is that using a Linked Data approach brings these questions to the fore. I’m not entirely sure about this, but my guess is that if you store your data in a relational database, the model is much more in the software that you build on top of this than in the database structure. With Linked Data I think there is a tendency to try to build better models in the inherent data structure (because you can?), leaving less of the modelling decisions to the software implementation.

If I’m right about this, it means Linked Data forces you to think more carefully about the data model at a much earlier point in the process of designing and developing systems. It also means that anyone interacting with your Linked Data (consumers) need to not just understand your data, but also your model – which can be challenging.

I’d recommend having at a look at various presentations/articles/posts by those involved in implementing Linked Data for parts of the BBC website e.g this presentation on How the BBC make Websites from IWMW2009.

Also to see (or contribute to) the thought processes behind building a Linked Data model, have a look at this work in progress on modelling Science Museum data/collections by Mia Ridge.

Reuse

One of the concepts with Linked Data is that you don’t invent new identifiers, models and vocabularies if someone else has already done it. This sounds great, and is one of the promises that open Linked Data brings – if the BBC have already established an ‘identifier’ for the common Kingfisher species, then I shouldn’t need to do this again. Similarly if someone else has already established a Linked Data vocabulary for describing people, and I want to describe a person, I can simply use this existing vocabulary. More than this I can mix and match existing elements in new models – so if I want to describe books about wildlife, and their authors, I can use the BBC wildlife identifiers when I want to show a book is about a particular species, and I can use the FOAF vocabulary (linked above) to described the authors.

This all sounds great – and given that I’ve said modelling data is difficult, the idea that someone else may have done the hard work for you and you can just pick up their model sounds great. Unfortunately I think that reuse is actually much more difficult than it sounds.

First you’ve got to find the existing identifier/vocabulary, then you’ve got to decide if it does what you need it to do, and you may have to make some judgements about the provenance and longterm prospects for those things you are going to reuse. If you use the BBC URI for Kingfishers, are you sure they are talking about the same thing you are? If so, how much do you trust that URI to be there in a year? In 5 years? In 10 years? (my books are highly likely to be around for 10 years).

I would guess reuse will get easier as Linked Data becomes more established (assuming it does). The recently established Schemapedia looks like a good starting point for discovering existing vocabularies you may be able to reuse, while Sameas.org is a good place to find existing Linked Data identifiers. This is also an area where communities of practice are going to be very important. For libraries it isn’t too hard to imagine a collaborative effort to establish common Linked Data identifiers for authors (VIAF as Linked Data looks like it could well be a viable starting point for this)

RDF and SPARQL

In my previous post I question the mention of RDF and SPARQL as part of the Linked Data principles. However, I don’t particularly have an issue with RDF and SPARQL as such – however my perception is that others do. Recently Mike Ellis laid dow a challenge to the Linked Data community in which he says “How I should do this [publish linked data], and easily. If you need to use the word “ontology” or “triple” or make me understand the deepest horrors of RDF, consider your approach a failed approach”, which suggests that RDF is difficult, or at the least, complicated.

I’m not going to defend RDF as uncomplicated, but I do think it is an area of Linked Data that attracts some bad press, which is probably unwarranted. My argument is that RDF isn’t the difficult bit – it’s the data modelling that gets represented in RDF that is difficult. This is echoed by the comment in the Nodalities article from Tom Scott and Michael Smethurst from the BBC

The trick here isn’t the RDF mapping – it’s having a well thought through and well expressed domain model. And if you’re serious about building web sites that’s something you need anyway. Using this ontology we began to add RDF views to /programmes (e.g. www.bbc.co.uk/programmes/b00f91wz.rdf). Again the work needed was minimal.

So for those considering the Linked Data approach we’d say that 95% of the work is work you should be doing just to build for the (non-semantic) web. Get the fundamentals right and the leap to the Semantic Web is really more of a hop.

I do think that we are still lacking any close to decent consumer facing tools for interacting with RDF (although I’d be really happy to be shown some good examples). When I played around with an RDF representation of characters from Middlemarch I authored the RDF by hand, having failed to find an authoring tool I could use. I found a few more tools that were OK to use for visualising and exploring the data I created – but to be honest most of these seemed buggy or flaky in some way.

I have to admit that I haven’t got my head around SPARQL in any meaningful way yet, and I’m not convinced it deserves the prominence it seems to be currently getting in the Linked Data world. SPARQL is a language for querying RDF, and as such is clearly going to be an essential tool for those using and manipulating RDF. However, you could say the same about SQL (a language for querying data stored as tables with rows and columns) in relation to traditional databases – but most people neither know, nor case, about SQL.

Tony Hirst often mentions how widespread the use of spreadsheets to store tabular data is, and that this enables basic data manipulation to happen on the desktop. Many people are comfortable with representing sets of data as tables – and I suspect this embedded strongly in our culture. It may be we will see tools that start to bridge this divide – I was very very impressed by the demonstration videos of the Gridworks tool posted on the Freebase blog recently, and I’m really looking forward to playing with it when it is made publicly available.

Conclusion

I’m not sure I have a strong conclusion – sorry! What I am aware of is a shift in my thinking. I used to think the technical aspects of Linked Data were the hard bits – RDF, SPARQL, and a whole load of stuff I haven’t mentioned. While there is no doubt that these things are complicated, and complex, I now believe the really difficult bits are the modelling and reuse aspects. I also think that there is an overlap here with the areas where domain experts need to have an understanding of ‘computing’ concepts, and computing experts need to understand the domain – and this kind of crossover is always difficult.

Middlemash, Middlemarch, Middlemap

The next Mashed Library event was announced a few months ago, but now more details are available. Middlemash is happening at Birmingham City University on 30th November 2009. I hope to see you there.

In discussion with Damyanti Patel, who is organising Middlemash, we thought it would be nice to do a little project in advance of Middlemash. When we brainstormed what we could do I originally suggested that maybe someone had drawn a map of the fictional geography of Middlemarch, and if we could find one, we could make it interactive in some way. Unfortunately a quick search turned up no such map. However, what it did turn up was something equally interesting – this map of relationships between characters in Middlemarch on LibraryThing.

This inspired a new idea – whether this could be represented in RDF somehow. My first thought was FOAF, but initially this seemed limited as it doesn’t allow for the expression of different types of relationship. However, I then came across this post from Ian Davis (this is the first in a series of 3), which used the Relationship vocabulary in addition to FOAF to express more the kind of thing I was looking for.

The resulting RDF is at http://www.meanboyfriend.com/overdue_ideas/middlemash.rdf. However, if you want to explore this is a more user-friendly manner, you probably want to use an RDF viewer. Although there are several you could use, the one I found easiest as a starting point was the Zitgist dataviewer. You should be able to browse the file directly with Zitgist via this link. There are however a couple of issues:

  • Zitgist doesn’t seem to display the whole file, although if you browse through relationships you can view all records evenutally
  • At time of posting I’m having some problems with Zitgist response times, but hopefully these are temporary

This is the first time I’d written any RDF, and I did it by hand, and I was learning as I went along. So I’d be very glad to know what I’ve done wrong, and how to improve it – leave comments on this post please.

I did find some problems with the Relationship vocabulary. It still only expresses a specific range of relationships. It also seems to rely on inferred relationships in some cases. The relationships uncle/aunt/nephew/niece aren’t expressed directly in the relationship vocabulary – presumably on the basis that they could be inferred through other relationships of ‘parentOf’, ‘childOf’ and ‘siblingOf’ (i.e. your uncle is your father’s brother etc.). However, in Middlemarch there are a few characters who are described as related in this manner, but to my knowledge no mention of the intermediary relationships are made. So we know that Edward Causubon has an Aunt Julia, but it is not stated whether she is his father’s or mother’s sister, and further his parents are not mentioned (this is as far as I know, I haven’t read Middlemarch for many years, and I went from SparkNotes and the relationship map on LibraryThing).

Something that seemed odd is that the Relationship vocabulary does allow you explicitly to relate grandparents to grandchildren without relying on the inferrence from two parentOf relathionships.

Another problem, which is one that Ian Davis explores at length in his posts on representing Einsteins biography in RDF is the time element. The relationships I express here aren’t linked to time – so where someone has remarried it is impossible to say from the work I have done here whether they are polygamous or not! I suspect that at least some of this could have been dealt with by adding details like dates of marriages via the Bio vocabulary Ian uses, but I think this would be a problem in terms of the details available from Middlemarch itself (I’m not confident that dates would necessarily be given). It also looked like hard work 🙂

So – there you have it, my first foray into RDF – a nice experiment, and potentially an interesting way of developing representations of literary works in the future?