My post on Linked Data from a couple of weeks ago got some good comments and definitely helped me in exploring my own understanding of the area. The 4 Principles of Linked Data as laid out by Tim Berners-Lee seem relatively straightforward, and although there are some things that you need to get your head around (some terminology, some concepts) the basic principles don’t seem that difficult.
So what is difficult about Linked Data (and what is not)?
Data Modelling is “a method used to define and analyze data requirements needed to support the business processes of an organization“. The problem is that the real world is messy, and describing it in a way that can be manipulated by computers is always problematic.
Basically data modelling is difficult. It is probably true of any sector, but anyone working in libraries who has looked at how we represent bibliographic and related data, and library processes, in our systems will know it gets complicated extremely quickly. With library data you can easily get bogged down in philosophical questions (what is a book?, how do you represent an ‘idea’?).
This is not a problem unique to Linked Data – modelling is hard however you approach it, but my suspicion is that using a Linked Data approach brings these questions to the fore. I’m not entirely sure about this, but my guess is that if you store your data in a relational database, the model is much more in the software that you build on top of this than in the database structure. With Linked Data I think there is a tendency to try to build better models in the inherent data structure (because you can?), leaving less of the modelling decisions to the software implementation.
If I’m right about this, it means Linked Data forces you to think more carefully about the data model at a much earlier point in the process of designing and developing systems. It also means that anyone interacting with your Linked Data (consumers) need to not just understand your data, but also your model – which can be challenging.
I’d recommend having at a look at various presentations/articles/posts by those involved in implementing Linked Data for parts of the BBC website e.g this presentation on How the BBC make Websites from IWMW2009.
Also to see (or contribute to) the thought processes behind building a Linked Data model, have a look at this work in progress on modelling Science Museum data/collections by Mia Ridge.
One of the concepts with Linked Data is that you don’t invent new identifiers, models and vocabularies if someone else has already done it. This sounds great, and is one of the promises that open Linked Data brings – if the BBC have already established an ‘identifier’ for the common Kingfisher species, then I shouldn’t need to do this again. Similarly if someone else has already established a Linked Data vocabulary for describing people, and I want to describe a person, I can simply use this existing vocabulary. More than this I can mix and match existing elements in new models – so if I want to describe books about wildlife, and their authors, I can use the BBC wildlife identifiers when I want to show a book is about a particular species, and I can use the FOAF vocabulary (linked above) to described the authors.
This all sounds great – and given that I’ve said modelling data is difficult, the idea that someone else may have done the hard work for you and you can just pick up their model sounds great. Unfortunately I think that reuse is actually much more difficult than it sounds.
First you’ve got to find the existing identifier/vocabulary, then you’ve got to decide if it does what you need it to do, and you may have to make some judgements about the provenance and longterm prospects for those things you are going to reuse. If you use the BBC URI for Kingfishers, are you sure they are talking about the same thing you are? If so, how much do you trust that URI to be there in a year? In 5 years? In 10 years? (my books are highly likely to be around for 10 years).
I would guess reuse will get easier as Linked Data becomes more established (assuming it does). The recently established Schemapedia looks like a good starting point for discovering existing vocabularies you may be able to reuse, while Sameas.org is a good place to find existing Linked Data identifiers. This is also an area where communities of practice are going to be very important. For libraries it isn’t too hard to imagine a collaborative effort to establish common Linked Data identifiers for authors (VIAF as Linked Data looks like it could well be a viable starting point for this)
RDF and SPARQL
In my previous post I question the mention of RDF and SPARQL as part of the Linked Data principles. However, I don’t particularly have an issue with RDF and SPARQL as such – however my perception is that others do. Recently Mike Ellis laid dow a challenge to the Linked Data community in which he says “How I should do this [publish linked data], and easily. If you need to use the word “ontology” or “triple” or make me understand the deepest horrors of RDF, consider your approach a failed approach”, which suggests that RDF is difficult, or at the least, complicated.
I’m not going to defend RDF as uncomplicated, but I do think it is an area of Linked Data that attracts some bad press, which is probably unwarranted. My argument is that RDF isn’t the difficult bit – it’s the data modelling that gets represented in RDF that is difficult. This is echoed by the comment in the Nodalities article from Tom Scott and Michael Smethurst from the BBC
The trick here isn’t the RDF mapping – it’s having a well thought through and well expressed domain model. And if you’re serious about building web sites that’s something you need anyway. Using this ontology we began to add RDF views to /programmes (e.g. www.bbc.co.uk/programmes/b00f91wz.rdf). Again the work needed was minimal.
So for those considering the Linked Data approach we’d say that 95% of the work is work you should be doing just to build for the (non-semantic) web. Get the fundamentals right and the leap to the Semantic Web is really more of a hop.
I do think that we are still lacking any close to decent consumer facing tools for interacting with RDF (although I’d be really happy to be shown some good examples). When I played around with an RDF representation of characters from Middlemarch I authored the RDF by hand, having failed to find an authoring tool I could use. I found a few more tools that were OK to use for visualising and exploring the data I created – but to be honest most of these seemed buggy or flaky in some way.
I have to admit that I haven’t got my head around SPARQL in any meaningful way yet, and I’m not convinced it deserves the prominence it seems to be currently getting in the Linked Data world. SPARQL is a language for querying RDF, and as such is clearly going to be an essential tool for those using and manipulating RDF. However, you could say the same about SQL (a language for querying data stored as tables with rows and columns) in relation to traditional databases – but most people neither know, nor case, about SQL.
Tony Hirst often mentions how widespread the use of spreadsheets to store tabular data is, and that this enables basic data manipulation to happen on the desktop. Many people are comfortable with representing sets of data as tables – and I suspect this embedded strongly in our culture. It may be we will see tools that start to bridge this divide – I was very very impressed by the demonstration videos of the Gridworks tool posted on the Freebase blog recently, and I’m really looking forward to playing with it when it is made publicly available.
I’m not sure I have a strong conclusion – sorry! What I am aware of is a shift in my thinking. I used to think the technical aspects of Linked Data were the hard bits – RDF, SPARQL, and a whole load of stuff I haven’t mentioned. While there is no doubt that these things are complicated, and complex, I now believe the really difficult bits are the modelling and reuse aspects. I also think that there is an overlap here with the areas where domain experts need to have an understanding of ‘computing’ concepts, and computing experts need to understand the domain – and this kind of crossover is always difficult.
3 thoughts on “What’s so hard about Linked Data?”
When I had a play with RDF / OWL I used protege http://protege.stanford.edu/ and found it very good.
We’re hopefully about to begin some work with Talis on linked data at the OU – focussing on the data and domain modelling stuff.
Thanks for making linked data clearer.
With regard to vocabulary control, I just wonder if the issue of users searching via different vocabulary to find the same thing, is resolved using linked data? eg one internet user might search via “finance”, another via “money” and a third via “cash” all looking for the same end result. Would this side of things be addressed by the search tools, rather than the way the data is linked? Does the vocabulary control only cover the data structure and not the data you put into it?
The answer to your last question is yes – the vocabularies I’ve talked about here control the data structure not the data (generally) – the terminology is probably not quite as clear as it could be. The documentation around this seems to use ‘vocabulary’ and ‘schema’ interchangably, but although I’ve used vocabulary in this post I think schema is a better description to be honest.
There is a lot more to go into on this – and having started an answer here, I decided it deserved its own post – I’ll try to write this soon.