Linked Data is getting a lot of press at the moment – perhaps most notably last week Gordon Brown (the UK Prime Minister) said:
Underpinning the digital transformation that we are likely to see over the coming decade is the creation of the next generation of the web – what is called the semantic web, or the web of linked data.
This statement was part of a speech at “Building Britain’s Digital Future” (#bbdf) (for more on the context of this statement, see David Flanders ‘eye witness’ account of the speech, and his thoughts)
Last week I attended a ‘Platform Open Day‘ at Talis, which was about Linked Data and related technologies, so I thought I’d try to get my thoughts in order. I may well have misunderstood bits and pieces here and there, but I’m pretty sure that the gist of what I’m saying here is right (and feel free to post comments or clarifications if I’ve got anything wrong).
I’m going to start with considering what Linked Data is…
- Use URIs as names for things
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
- Include links to other URIs. so that they can discover more things.
What does this mean?
While most people are familiar with URLs, the concept of a URI is less well known. A URL is a resource locator – if you know the URL, you can locate the resource. A URI is a resource identifier – it simply identifies the resource. In fact, URLs are a special kind of URI – that is any URL is also a URI in that a URL both identifies and locates a resource. So – all URLs are also URIs, but not vice versa. You can read more about URIs on Wikipedia.
Further to this, an ‘HTTP URI’ is a URL as we are used to using on the web.
This means that the first two principles together basically say you should identify things using web addresses. This sounds reasonably straightforward. Unfortunately there is some quite tricky stuff hidden behind these straightforward principles, which basically come down to the fact that you have to be very careful and clear about what any particular http URI identifies.
For example this URI:
Doesn’t identify (as you might expect) Pride and Prejudice, but rather identifies the Amazon web page that describes the Penguin Classics edition of Pride and Prejudice. This may seem like splitting hairs, but if you want to start to make statements about things using their identifiers it is very important. I might want to state that the author of Pride and Prejudice is Jane Austen. If I say:
http://www.amazon.co.uk/Pride-Prejudice-Penguin-Classics-Austen/dp/0141439513/ref=sr_1_9?ie=UTF8&s=books&qid=1269423132&sr=8-9 is authored by Jane Austen, then strictly I’m saying Jane Austen wrote the web page, rather than the book described by the web page.
Moving on to principle 3, things get a little more controversial. I’m going to break this down into two parts. Firstly “When someone looks up a URI, provide useful information”. Probably the key thing to note here is that when you identify things with an http uri (as per principles 1 and 2), you are often going to be identifying things that can’t be delivered online. If I identify a physical copy of a book (for example, my copy of Pride and Prejudice, sitting on my bookshelf), I can give it a http URI to identify it, but if you type that URI into a web browser, or in some other way try to ‘retrieve’ that URI, you aren’t going to get the physical item appear before you – so if you lookup that URI the third principle says that you should get some ‘useful information’ – for example, you might get a description of my copy of Pride and Prejudice. There are some technical implications of this, as I have to make sure that you get some useful information about the item (e.g a description), while still being clear that the URI identifies the physical item, rather than identifying the description of the physical item – but I’m not going to worry too much about this now.
The second part of principle 3 is where we move into territory which tends to set off heated debate. This says “using the standards (RDF, SPARQL)”. Firstly it invokes ‘standards’, and secondly it lists two specific standards. I feel that the wording isn’t very helpful. It does make it clear that Linked Data is about doing things in a standardised way – this is clearly important, and yet also very difficult – as anyone who has worked with Bibliographic metadata will appreciate, achieving standards even across a relatively small and tight-knit community such as librarians is difficult enough – getting standardisation across larger, disparate, communities is very challenging indeed.
What I don’t think the principle makes very clear is what standards are being used – it lists two (RDF and SPARQL), but as far as I can tell most people would agree RDF is actually the key thing here, making this list of two misleading however you read it. I’m not going to describe RDF or SPARQL here, but may come back to them in future posts. In short RDF provides a structured way of making assertions about resources – there is a simple introduction my slideshare presentation on the Semantic Web. SPARQL is a language for querying RDF.
So finally, on to principle 4; “Include links to other URIs. so that they can discover more things.”. The first three principles are concerned with making your data linkable – i.e. making it possible for people to link to your data in meaningful ways. The fourth principle says you should link from your data to other things. For my example of representing my own copy of Pride and Pejudice, that could include linking to information about the book in a more general sense – rather than record the full details myself, I could (for example) link to an OpenLibrary record for the book. Supporting both inbound and outbound links is key to making a rich, interconnected, set of data, enabling the ‘browsing’ of data in the same way we currently ‘browse’ webpages.
I was originally intending to explore some of the arguments I’ve come across recently about ‘Linked Data’ – I especially wanted to tackle some of the issues raised by Mike Ellis in his ‘challenge’, but I think that this post is quite long enough already, so I’ll leave that for another time.