Linked Data and Libraries: Keynote

Today at the second Linked Data and Libraries meeting organised by Talis.

Lynne Brindley is kicking off the day…
Noting that the broad agenda is about getting access to information, linking across domains etc. See potential of Linked Data approach to increasing the use of their catalogue & so collections. Bringing better discovery and therefor utility to researchers – and to exploit the legacy of data that has been created over long periods of time.

The British library is ‘up for it’ – but need to look at the costs, benefits and may need to convince sceptics. But BL has history of taking innovative steps – introduced public online information retrieval systems to the UK around 40 years ago (MEDLARS in the 1970s). 10 years later UK was one of the first countries to publish National Bibliography on CD-ROM (now ‘museum pieces’! says Lynne).

And now exposing national bibliography as linked open data… – some history:

BL involved in UK PubMed Central (UKPMC) – repository of papers, patents, reports etc. etc. Contains many data types from many organisations. Provides better access to hard to find reports, theses etc. For Lynne this is also about ‘linking’ even if not built on “Linked Data” technology stack. – sees it as part and parcel of same thing and movement in a direction of linking materials/collections.

Also ‘sounds’ – UK Sound Map http://sounds.bl.uk/uksoundmap/index.aspx – linked across domains and also involved public in capturing ‘sounds of Britain’ – via AudioBoo – added metadata and mashed up with Google Maps…

‘Evolving English’ exhibition – had a ‘map your voice’ element – many people recorded the same piece of material – which has been incorporated into a research database of linguistic recordings – global collaboration and massive participation.

Lynne says – it is pretty difficult to do multi-national, multi-organisational stuff – and should learn from these examples.

The BL Catalogue is primary tool to access, order and manage the BL collections. Long operated a priced service where the catalogue records are sold to others – in various forms. Despite pressure to earn money from Government, BL decided to take step of offering BNB records as RDF/XML under a CC0. Today will be announcing a linked data version of BNB – more later today from Neil Wilson.

Hope that the data will get used in a wide variety of ways. Key lesson for BL says Lynne – is ‘relinquish control, let go’ – however you think people are going to use what you put out there, they will use it in a different way.

Promise of linked data offers many benefits across sectors, across ‘memory institutions’. But the institutions involved will need to face cultural change to achieve this. ‘Curators’ in any context (libraries, archives, museums) are used to their ‘vested authority’ – and we need to both recognise this at the same time as ‘letting go’ – from the library point of view no-one can afford to stand on the sidelines – we need to get in there and experiment.

Need to get out of our institutional and metadata silos – and take a journey to the ‘mainstream future’. Partnerships are important – and everyone wants to ‘partner’ with the British Library – but often proposed partnerships are one sided – we need to look for win-win partnerships with institutions like the BL.

Final message – we are good at talking – but we need to ‘just do it’ – do it and show benefits and convince people.

What’s so hard about Linked Data?

My post on Linked Data from a couple of weeks ago got some good comments and definitely helped me in exploring my own understanding of the area. The 4 Principles of Linked Data as laid out by Tim Berners-Lee seem relatively straightforward, and although there are some things that you need to get your head around (some terminology, some concepts) the basic principles don’t seem that difficult.

So what is difficult about Linked Data (and what is not)?

Data Modelling

Data Modelling is “a method used to define and analyze data requirements needed to support the business processes of an organization“. The problem is that the real world is messy, and describing it in a way that can be manipulated by computers is always problematic.

Basically data modelling is difficult. It is probably true of any sector, but anyone working in libraries who has looked at how we represent bibliographic and related data, and library processes, in our systems will know it gets complicated extremely quickly. With library data you can easily get bogged down in philosophical questions (what is a book?, how do you represent an ‘idea’?).

This is not a problem unique to Linked Data – modelling is hard however you approach it, but my suspicion is that using a Linked Data approach brings these questions to the fore. I’m not entirely sure about this, but my guess is that if you store your data in a relational database, the model is much more in the software that you build on top of this than in the database structure. With Linked Data I think there is a tendency to try to build better models in the inherent data structure (because you can?), leaving less of the modelling decisions to the software implementation.

If I’m right about this, it means Linked Data forces you to think more carefully about the data model at a much earlier point in the process of designing and developing systems. It also means that anyone interacting with your Linked Data (consumers) need to not just understand your data, but also your model – which can be challenging.

I’d recommend having at a look at various presentations/articles/posts by those involved in implementing Linked Data for parts of the BBC website e.g this presentation on How the BBC make Websites from IWMW2009.

Also to see (or contribute to) the thought processes behind building a Linked Data model, have a look at this work in progress on modelling Science Museum data/collections by Mia Ridge.

Reuse

One of the concepts with Linked Data is that you don’t invent new identifiers, models and vocabularies if someone else has already done it. This sounds great, and is one of the promises that open Linked Data brings – if the BBC have already established an ‘identifier’ for the common Kingfisher species, then I shouldn’t need to do this again. Similarly if someone else has already established a Linked Data vocabulary for describing people, and I want to describe a person, I can simply use this existing vocabulary. More than this I can mix and match existing elements in new models – so if I want to describe books about wildlife, and their authors, I can use the BBC wildlife identifiers when I want to show a book is about a particular species, and I can use the FOAF vocabulary (linked above) to described the authors.

This all sounds great – and given that I’ve said modelling data is difficult, the idea that someone else may have done the hard work for you and you can just pick up their model sounds great. Unfortunately I think that reuse is actually much more difficult than it sounds.

First you’ve got to find the existing identifier/vocabulary, then you’ve got to decide if it does what you need it to do, and you may have to make some judgements about the provenance and longterm prospects for those things you are going to reuse. If you use the BBC URI for Kingfishers, are you sure they are talking about the same thing you are? If so, how much do you trust that URI to be there in a year? In 5 years? In 10 years? (my books are highly likely to be around for 10 years).

I would guess reuse will get easier as Linked Data becomes more established (assuming it does). The recently established Schemapedia looks like a good starting point for discovering existing vocabularies you may be able to reuse, while Sameas.org is a good place to find existing Linked Data identifiers. This is also an area where communities of practice are going to be very important. For libraries it isn’t too hard to imagine a collaborative effort to establish common Linked Data identifiers for authors (VIAF as Linked Data looks like it could well be a viable starting point for this)

RDF and SPARQL

In my previous post I question the mention of RDF and SPARQL as part of the Linked Data principles. However, I don’t particularly have an issue with RDF and SPARQL as such – however my perception is that others do. Recently Mike Ellis laid dow a challenge to the Linked Data community in which he says “How I should do this [publish linked data], and easily. If you need to use the word “ontology” or “triple” or make me understand the deepest horrors of RDF, consider your approach a failed approach”, which suggests that RDF is difficult, or at the least, complicated.

I’m not going to defend RDF as uncomplicated, but I do think it is an area of Linked Data that attracts some bad press, which is probably unwarranted. My argument is that RDF isn’t the difficult bit – it’s the data modelling that gets represented in RDF that is difficult. This is echoed by the comment in the Nodalities article from Tom Scott and Michael Smethurst from the BBC

The trick here isn’t the RDF mapping – it’s having a well thought through and well expressed domain model. And if you’re serious about building web sites that’s something you need anyway. Using this ontology we began to add RDF views to /programmes (e.g. www.bbc.co.uk/programmes/b00f91wz.rdf). Again the work needed was minimal.

So for those considering the Linked Data approach we’d say that 95% of the work is work you should be doing just to build for the (non-semantic) web. Get the fundamentals right and the leap to the Semantic Web is really more of a hop.

I do think that we are still lacking any close to decent consumer facing tools for interacting with RDF (although I’d be really happy to be shown some good examples). When I played around with an RDF representation of characters from Middlemarch I authored the RDF by hand, having failed to find an authoring tool I could use. I found a few more tools that were OK to use for visualising and exploring the data I created – but to be honest most of these seemed buggy or flaky in some way.

I have to admit that I haven’t got my head around SPARQL in any meaningful way yet, and I’m not convinced it deserves the prominence it seems to be currently getting in the Linked Data world. SPARQL is a language for querying RDF, and as such is clearly going to be an essential tool for those using and manipulating RDF. However, you could say the same about SQL (a language for querying data stored as tables with rows and columns) in relation to traditional databases – but most people neither know, nor case, about SQL.

Tony Hirst often mentions how widespread the use of spreadsheets to store tabular data is, and that this enables basic data manipulation to happen on the desktop. Many people are comfortable with representing sets of data as tables – and I suspect this embedded strongly in our culture. It may be we will see tools that start to bridge this divide – I was very very impressed by the demonstration videos of the Gridworks tool posted on the Freebase blog recently, and I’m really looking forward to playing with it when it is made publicly available.

Conclusion

I’m not sure I have a strong conclusion – sorry! What I am aware of is a shift in my thinking. I used to think the technical aspects of Linked Data were the hard bits – RDF, SPARQL, and a whole load of stuff I haven’t mentioned. While there is no doubt that these things are complicated, and complex, I now believe the really difficult bits are the modelling and reuse aspects. I also think that there is an overlap here with the areas where domain experts need to have an understanding of ‘computing’ concepts, and computing experts need to understand the domain – and this kind of crossover is always difficult.