Linked Data

Linked Data is getting a lot of press at the moment – perhaps most notably last week Gordon Brown (the UK Prime Minister) said:

Underpinning the digital transformation that we are likely to see over the coming decade is the creation of the next generation of the web – what is called the semantic web, or the web of linked data.

This statement was part of a speech at “Building Britain’s Digital Future” (#bbdf) (for more on the context of this statement, see David Flanders ‘eye witness’ account of the speech, and his thoughts)

Last week I attended a ‘Platform Open Day‘ at Talis, which was about Linked Data and related technologies, so I thought I’d try to get my thoughts in order. I may well have misunderstood bits and pieces here and there, but I’m pretty sure that the gist of what I’m saying here is right (and feel free to post comments or clarifications if I’ve got anything wrong).

I’m going to start with considering what Linked Data is…

The principles of Linked Data are stated by Tim Berners-Lee as:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

What does this mean?

While most people are familiar with URLs, the concept of a URI is less well known. A URL is a resource locator – if you know the URL, you can locate the resource. A URI is a resource identifier – it simply identifies the resource. In fact, URLs are a special kind of URI – that is any URL is also a URI in that a URL both identifies and locates a resource. So – all URLs are also URIs, but not vice versa. You can read more about URIs on Wikipedia.

Further to this, an ‘HTTP URI’ is a URL as we are used to using on the web.

This means that the first two principles together basically say you should identify things using web addresses. This sounds reasonably straightforward. Unfortunately there is some quite tricky stuff hidden behind these straightforward principles, which basically come down to the fact that you have to be very careful and clear about what any particular http URI identifies.

For example this URI:

http://www.amazon.co.uk/Pride-Prejudice-Penguin-Classics-Austen/dp/0141439513/ref=sr_1_9?ie=UTF8&s=books&qid=1269423132&sr=8-9

Doesn’t identify (as you might expect) Pride and Prejudice, but rather identifies the Amazon web page that describes the Penguin Classics edition of Pride and Prejudice. This may seem like splitting hairs, but if you want to start to make statements about things using their identifiers it is very important. I might want to state that the author of Pride and Prejudice is Jane Austen. If I say:

http://www.amazon.co.uk/Pride-Prejudice-Penguin-Classics-Austen/dp/0141439513/ref=sr_1_9?ie=UTF8&s=books&qid=1269423132&sr=8-9 is authored by Jane Austen, then strictly I’m saying Jane Austen wrote the web page, rather than the book described by the web page.

Moving on to principle 3, things get a little more controversial. I’m going to break this down into two parts. Firstly “When someone looks up a URI, provide useful information”. Probably the key thing to note here is that when you identify things with an http uri (as per principles 1 and 2), you are often going to be identifying things that can’t be delivered online. If I identify a physical copy of a book (for example, my copy of Pride and Prejudice, sitting on my bookshelf), I can give it a http URI to identify it, but if you type that URI into a web browser, or in some other way try to ‘retrieve’ that URI, you aren’t going to get the physical item appear before you – so if you lookup that URI the third principle says that you should get some ‘useful information’ – for example, you might get a description of my copy of Pride and Prejudice. There are some technical implications of this, as I have to make sure that you get some useful information about the item (e.g a description), while still being clear that the URI identifies the physical item, rather than identifying the description of the physical item – but I’m not going to worry too much about this now.

The  second part of principle 3 is where we move into territory which tends to set off heated debate. This says “using the standards (RDF, SPARQL)”. Firstly it invokes ‘standards’, and secondly it lists two specific standards. I feel that the wording isn’t very helpful. It does make it clear that Linked Data is about doing things in a standardised way – this is clearly important, and yet also very difficult – as anyone who has worked with Bibliographic metadata will appreciate, achieving standards even across a relatively small and tight-knit community such as librarians is difficult enough – getting standardisation across larger, disparate, communities is very challenging indeed.

What I don’t think the principle makes very clear is what standards are being used – it lists two (RDF and SPARQL), but as far as I can tell most people would agree RDF is actually the key thing here, making this list of two misleading however you read it. I’m not going to describe RDF or SPARQL here, but may come back to them in future posts. In short RDF provides a structured way of making assertions about resources – there is a simple introduction my slideshare presentation on the Semantic Web. SPARQL is a language for querying RDF.

There is quite a bit of discussion about whether RDF is essential to ‘Linked Data’ including Andy Powell on eFoundations, Ian Davis on Internet Alchemy, and Paul Miller on Cloud of Data.

So finally, on to principle 4; “Include links to other URIs. so that they can discover more things.”. The first three principles are concerned with making your data linkable – i.e. making it possible for people to link to your data in meaningful ways. The fourth principle says you should link from your data to other things. For my example of representing my own copy of Pride and Pejudice, that could include linking to information about the book in a more general sense – rather than record the full details myself, I could (for example) link to an OpenLibrary record for the book. Supporting both inbound and outbound links is key to making a rich, interconnected, set of data, enabling the ‘browsing’ of data in the same way we currently ‘browse’ webpages.

I was originally intending to explore some of the arguments I’ve come across recently about ‘Linked Data’ – I especially wanted to tackle some of the issues raised by Mike Ellis in his ‘challenge’, but I think that this post is quite long enough already, so I’ll leave that for another time.

9 thoughts on “Linked Data

  1. Hi Stephen,

    I think that’s a nice and clear summary, but I think there’s also wider and simpler perspective. I think of it as the sandwich or layer view:

    The bottom layer is composed of very diverse and rapidly expanding data sets in all sorts of formats; access and proper databases, spreadsheets, EXIF and other metadata, tables in Word files, PDFs etc.

    The top layer is information that meets a particular audience’s need exactly, and is in a form that they are comfortable with (spreadsheets for accountants, groovy graphics for creative types, webpages etc.)

    Linked Data is the middle layer: it describes how you get from any of the bottom layer data sets to any of the documents in the top layer.

    In order for that to happen, bottom layer data must be available on the web for starters (i.e. Open Data), but it must also have a predictable form, give handles on its meaning, have a predictable form of transport, and you should be able to chop and change it in a predictable way.

    That’s where the standards come in: RDF for form, URIs for meaning, http for transport and SPARQL for chopping and changing.

    Ergo: no standards means no Linked Data.

    To be sure, there are other ways to get from the bottom to the top, but they’re all partial and particular to one data type or tool. Getting from one such column to another means changing tool sets, learning new methods & tools & patterns and often lots of brittle hackery. Like BBSs before the web.

    Does that mean that you have to have RDF to play in Linked Data space? Not necessarily: even if you don’t provide the RDF, someone else might convert your data for themselves or others. It just shifts the burden.

  2. No problem Wilbert – thanks for taking the time to reply (if only I identified myself by a http URI there would be no need for this type of ambiguity!)

    I think the description you offer is helpful, and I can see that without the relevant standards we don’t have Linked Data. However, I can also see how you could have ‘data which is linked’ by adopting http URIs as identifiers, but adopting other forms of data.

    I can see the arguments against this – use RDF as it allows you to process stuff in standard ways, and agree on the ‘meaning’ of things using Ontologies. However, I’m slightly sceptical (although not entirely) about these arguments – partly because I suspect ontologies will quickly be abused (humans are going to be responsible for implementing ontologies, and they are going to get it wrong), and the processing argument comes down to where the cost is as far as I can see – but these are things I’ll try and pick up in another post.

    The inclusion of SPARQL as a specified standard in the principles bothers me more than RDF. For example, the BBC publish ‘Linked Data’ as far as I understand it on a variety of pages, but they don’t provide a SPARQL endpoint (although the data is replicated in a Talis Platform store which does). I also feel it leads to the impression that the only way to interact with linked data is via SPARQL – whereas in terms of mashups, the benefit (I think) should come from share identifiers rather than anything else.

  3. Hi Owen, this is a very helpful addition to the Linked Data debate. I’ve also had a go at summarising the arguments over on my CETIS blog. In terms of TBL’s Linked Data principles I think it’s worth pointing out that they are a “personal note” and are not fully endorsed by W3C so there has to be some flexibility in how one interprets them. I also think it’s worth remembering the clause TBL added after these four principles: “I’ll refer to the steps above as rules, but they are expectations of behaviour. Breaking them does not destroy anything, but misses an opportunity to make data interconnected.”

    Which kind of reminds me of the pirate code: “…the code is more what you’d call “guidelines” than actual rules…” 😉

  4. Hi Stephen owl:sameAs OwenStephen

    Thanks for allowing me to think some of these issues through here…

    One of the things that distinguishes the Linked Data technology stack from earlier Semantic Web approaches is the de-emphasis of the more rarified aspects of ontologies. What you use in LD is probably more accurately called vocabularies.

    In practice, it’s the resilience in the face of abuse and different uses of such vocabs that convinces me most of RDF:
    you use a different URI for the same thing? Fine, I’ll use owl:sameAs (or skos:similarTo)
    I want to say something similar to what you’re saying, but be more specific? No problem, I’ll define my stuff as subclasses and subproperties from the ones you used
    You use the same term for something slightly different? Not a big deal, I’ll just treat your graphs accordingly
    You use some wildly different and convoluted vocabulary from me? If it’s worth it, I’ll just extract the info in my own terms using SPARQL’s CONSTRUCT

    Unlike some other infrastructures, nothing breaks catastrophically; it’s just a matter of balancing effort with optimisation.

    I think the BBC’s case illustrates perfectly where SPARQL is useful; if I want to mesh up the BBC’s data with other data, a SPARQL endpoint at the BBC is fairly useless. That just gives me answers about the BBC’s own data. If I want to meshup without constraints, I bring my own SPARQL endpoint and let that loose on their, and other people’s triples.

    I could, of course, use other technologies to meshup data-with-URIs, but why would I? I can’t see any that are as mature, give as much choice, are as widely implemented/implementable or are as vendor independent as RDF + SPARQL (though oData is one to watch)

    Lastly, with regard to the label thing, I think that it’s a simple categorisation issue, not a value judgment. A canal and a motorway are both worthy and useful pieces of transportation infrastructure, but recognising that fact doesn’t entail calling a motorway a canal. By analogy, there are many forms of Machine Readable Data infrastructures, of which Linked Data is just one.

  5. @Wilbert – thanks for the further thoughts – very useful.

    I’ve read your post on using your own SPARQL endpoint to query other peoples RDF (http://blogs.cetis.ac.uk/wilbert/2010/03/12/meshing-up-a-jisc-e-learning-project-timeline-or-its-linked-data-on-the-web-stupid/), and it definitely started to feel like this was a more useful application of SPARQL than I’d seen before. I’m still not entirely clear – if I install my own SPARQL endpoint does that mean I can query any data that is available in RDF, or only that in a triple store?

    I definitely agree with your comments about owl:SameAs (hadn’t realised there was a skos:similarTo but again, I definitely see the use). I’m less convinced about the statement “You use the same term for something slightly different? Not a big deal, I’ll just treat your graphs accordingly” – I’m not convinced this is as easy as you make it sound, and it indicates human intervention which I think starts to decrease the possible gains of using linked data. I probably need another post to expand on these concerns more fully.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.