Linked Data for Libraries: Publishing and Using Linked Data

Today I’m speaking at the “Linked Data for Libraries” event organised and hosted by the Library Association of Ireland, Cataloguing & Metadata Group; the Digital Repository of Ireland; and Trinity College Library, Dublin. In the presentation I cover some of the key issues around publishing and consuming linked data, using library based examples.

My slides plus speaker notes are available for download – Publishing and Using Linked Data (pdf).

Experimenting with British Museum data

[UPDATE 2014-11-20: The British Museum data model and has changed quite a bit since I wrote this. While there is still useful stuff in this post, the detail of any particular query, or comment may well now be outdated. I’ve added some updates in square brackets in some cases]

In September 2011 the British Museum started publishing descriptions of items in its collections as RDF (the data structure that underlies Linked Data). The data is available from http://collection.britishmuseum.org/ where the Museum have made a ‘SPARQL Endpoint’ available. SPARQL is a query language for extracting data from RDF stores – it can be seen as a parallel to SQL, which is a query language for extract data from traditional relational databases.

Although I knew what SPARQL was, and what it looked like, I really hadn’t got to grips with it, and since I’d just recently purchased “Learning SPARQL” it seemed like a good opportunity to get familiar with the British Museum data and SPARQL syntax. So I had a play (more below). Skip forward a few months, and I noticed some tweets from a JISC meeting about the Pelagios project (which is interested in the creation of linked (geo)data to describe ‘ancient places’), and in particular from Mia Ridge and Alex Dutton which indicated they were experiementing with the British Museum data. My previous experience seemed to gel with the experience they were having, and prompted me to finally get on with a blog post documenting my experience so hopefully others can benefit.

Perhaps one reason I’ve been a bit reluctant to blog this is that I struggled with the data, and I don’t want this post to come across as overly critical of the British Museum. The fact they have their data out there at all is amazing – and I hope other museums (and archives and libraries) follow the lead of the British Museum in releasing data onto the web. So I hope that all comments/criticisms below come across as offering suggestions for improving the Museum data on the web (and offering pointers to others doing similar projects), and of course the opportunity for some dialogue about the issues. There is also no doubt that some of the issues I encountered were down to my own ignorance/stupidity – so feel free to point out obvious errors.

When you arrive at the British Museum SPARQL endpoint the nice thing is there is a pre-populated query that you can run immediately. It just retrieves 10 results, of any type, from the data – but it means you aren’t staring at a blank form, and those ten results give a starting point for exploring the data set. Most URIs in the resulting data are clickable, and give you a nice way of finding what data is in the store, and to start to get a feel for how it is structured.

For example, running the default search now brings back the triple:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Object type :: marriage equipment ::

 

Which is intriguing enough to make you want to know more (I am married, and have to admit I don’t remember any special equipment). Clicking on the URI http://collection.britishmuseum.org/id/object/EAF119772 in a browser takes you to an HTML representation of the resource – a list of all the triples that make statements about the item in the British Museum identified by that URI.

While I think it would be an exaggeration to say this is ‘easily readable’, sometimes, as with the triple above, there is enough information to guess the basics of what is being said – for example:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Acquisition date :: 1994 ::

 

From this it is perhaps easy enough to see that there is some item (identified by the URI http://collection.britishmuseum.org/id/object/EAF119772) which has a note related to it stating that it was acquired (presumably by the museum) in 1994.

So far, so good. I’d got an idea of the kind of information that might be in the database. So the next question I had was “what kind of queries could I throw at the data that might produce some interesting/useful results?” Since I’d recently been playing around with data about composers I thought it might be interesting to see if the British Museum had any objects that were related to a well-known composer – say Mozart.

This is where I started to hit problems…. In my initial explorations, while some information was obvious, I’d also realised that the data was modelled using something called CIDOC CRM, which is intended to model ‘cultural heritage’ data. With some help from Twitter (including staff at the British Museum) I started to read up on CIDOC CRM – and struggled! Even now I’m not sure I’d say I feel completely on top of it, but I now have a bit of a better understanding. Much of the CIDOC model is based around ‘events’ – things that happened at a certain time/in a certain place. This means that often what might seem like a simple piece of information – such as where a item in the museum originates from – become complex.

To give a simple example, the ‘discovery’ of an item is a kind of event. So to find all the items in the British Museum ‘discovered’ in Greenwich you have to first find all the ‘discovery’ events that ‘took place at’ Greenwich, then link these discovery events back to the items they are a related to:

An item -> was discovered by a discovery event -> which took place at Greenwich

This adds extra complexity to what might seem initially (naively?) a simple query. This example was inspired by discussion at the Pelagios event mentioned earlier – the full query is:

SELECT ?greenwichitem WHERE
{
	?s <http://collection.britishmuseum.org/id/crm/P7F.took_place_at> <http://collection.britishmuseum.org/id/thesauri/x34215> .
	?subitem <http://collection.britishmuseum.org/id/crm/bm-extensions/PX.was_discovered_by> ?s .
	?greenwichitem <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?subitem
}

and the results can be seen at http://bit.ly/vojTWq.

[UPDATE 2014-11-20: This query no longer works. The query is now simpler:

PREFIX ecrm: <http://erlangen-crm.org/current/>
SELECT ?greenwichitem WHERE 
{ 
 ?find ecrm:P7_took_place_at <http://collection.britishmuseum.org/id/place/x34215> .
 ?greenwichitem ecrm:P12i_was_present_at ?find
}

END UPDATE]

To make things even more complex the British Museum data seems to describe all items actually as made up of (what I’m calling) ‘sub-items’. In some cases this makes some sense. If a single item is actually made up of several pieces, all with their own properties and provenance, it clearly makes sense to describe each part separately. Each part of the object will have it’s own properties and provenance, and it makes sense to describe these separately.

However, the British Museum data describes even single items as made up of ‘pieces’ – just that the single item consists of a single piece – and it is then that piece that has many of the properties of the item associated with it. To illustrate. A multi-piece item is like:

Which makes sense to me. But a single piece item is like:

 

Which I found (and continue to find) this confusing. This isn’t helped in my view by the fact that some properties are attached the the ‘parent’ object, and some to the ‘child’ object, and I can’t really work out the logic associated with this. For example it is the ‘parent’ object that belongs to a department in the British Museum, while it is the ‘child’ object that is made of a specific material. Both the parent and child in this situation are classified as physical objects, and this feels wrong to me.

Thankfully a link from the Pelagios meeting alerted me to some more detailed documentation around the British Museum data (http://www.researchspace.org/Stage-2-Outputs), and this suggests that the British Museum are going to move away from this model:

Firstly, after much debate we have concluded that preserving the existing modelling relationship as described earlier whereby each object always consists of at least one part is largely nonsense and should not be preserved.

While arguments were put forward earlier for retaining this minimum one part per object scheme, it has now been decided that only objects which are genuinely composed of multiple parts will be shown as having parts.

The same document notes that the current modelling “may be slightly counter-intuitive” – I can back up this view!

So – back to finding stuff related to Mozart… apart from struggling with the data model, the other issue I encountered was that it was difficult to approach the dataset through anything except a URI for a entity. That is to say, if you knew the URI for ‘Wolfgang Amadeus Mozart’ in the museum data set, the query would be easy, but if you only know a name, then it is much more difficult. How could I find the URI for Mozart, to then find all related objects?

Just using SPARQL, there are two approaches that might work. If you know the exact (and I mean exact) form of the name in the data, you can query for a ‘literal’ – i.e. do a SPARQL query for a textual string such as “Mozart, Wolfgang Amadeus”. If this is the exact for used in the data, the query will be successful, but if you get this slightly wrong then you’ll fail to get any result. A working example for the British Museum data is:

SELECT * WHERE 
{ 
	?s ?p "Mozart, Wolfgang Amadeus"
}

The second approach you can use is to do a more general query and ‘filter’ the result using a regular expression. Regular expressions are ways of looking for patterns in text strings, and are incredibly powerful (supporting things like wildcards, ignoring case etc. etc.). So you can be a lot less precise than searching for an exact string, and for example, you might try to retrieve all the statements about ‘people’ and filter for those containing the (case insensitive) word ‘mozart’. While this would get you Leopold Mozart as well as Wolfgang Amadeus if both are present in the data, there are probably a small enough number of mozarts that you would be able to pick out WA Mozart by eye, and get the relevant URI which identifies him.

A possible query of this type is:

SELECT * WHERE 
{ 
	?s <http://xmlns.com/foaf/0.1/Name> ?o 
	FILTER regex(?o, "mozart", "i") 
}

Unfortunately these latter type of ‘filter’ queries are pretty inefficient, and the British Museum SPARQL endpoint has some restrictions which mean that if you try to retrieve more than a relatively small amount of data at one time you just get an error. Since this is essentially how ‘filter’ queries work (retrieve a largish amount of data first, then filter out the stuff you don’t want), I couldn’t get this working. The issue of only being able to retrieve small sets of data was a bit of a frustration overall with the SPARQL endpoint, not helped by the fact that it seemed to be relatively arbitrary in terms of what ‘size’ of result set caused an error – I assume it is something about the overall amount of data retrieved, as it seemed unrelated to the actual number of results retrieved – for example using:

SELECT * WHERE
{
	?s ?p ?o
}

You can retrieve only 123 results before you get an error, while using

SELECT ?s WHERE
{
	?s ?p ?o
}

You can retrieve over 300 results without getting an error.

This limitation is an issue in itself (and the British Museum are by no means alone in having performance issues with an RDF triple store), but it is doubly frustrating that the limit is unclear.

The difficulty of exploring the British Museum data from a simple textual string became a real frustration as I explored the data – it made me realise that while the Linked Data/RDF concept of using URIs and not literals is something I understand and agree with, as people all we know is textual strings that describe things, so to make the data more immediately usable, supporting textual searches (e.g. via a solr index over the literals in the data) might be a good idea.

I got so frustrated that I went looking for ways of compensating. The British Museum data makes extensive use of ‘thesauri’ – lists of terms for describing people, places, times, object types, etc. In theory these thesauri would give the text string entry points into the data, and I found that one of the relevant thesauri (object types) was available on the Collections Link website (http://www.collectionslink.org.uk/assets/thesaurus/Objintro.htm). Each term in this data corresponds to a URI in the British Museum data, and so I wrote a ScraperWiki script which would search for each term in the British Museum data and identify the relevant URI and record both the term and the URI. At the same time a conversation with @portableant on twitter alerted me to the fact that the ‘Portable Antiquities‘ site uses a (possibly modified) version of the same thesaurus for classifying objects, so I added in a lookup of the term on this site to start to form connections between the Portable Antiquities data and the British Museum data. This script is available at https://scraperwiki.com/scrapers/british_museum_object_thesaurus/, but comes with some caveats about the question of how up to date the thesaurus on the Collections Link website is, and the possible imperfections of the matching between the thesaurus and the British Museum data.

Unfortunately it seems that this ‘object type’ thesaurus is the only one made publicly available (or at least the only one I could find), while clearly the people and place thesauri would be really interesting, and provide valuable access points into the data. But really ideally these would be built from the British Museum data directly, rather than being separate lists.

So, finally back to Mozart. I discovered another way into the data – which was via the really excellent British Museum website, which offers the ability to search the collections via a nice web interface. This is a good search interface, and gives access to the collections – to be honest already solving problems such as the one I set myself here (of finding all objects related to Mozart) – but nevermind that now!  If you search this interface and find an object, when the you view the record for the object, you’ll probably be at a URL something like:

http://www.britishmuseum.org/research/search_the_collection_database/search_object_details.aspx?objectid=3378094&partid=1&searchText=mozart&numpages=10&orig=%2fresearch%2fsearch_the_collection_database.aspx&currentPage=1

If you extract the “objectid” (in this case ‘3378094’) from this, you can use this to look up the RDF representation of the same object using a query like:

SELECT * WHERE
{
	?s <http://www.w3.org/2002/07/owl#sameAs> <http://collection.britishmuseum.org/id/codex/3378094>
}

This gives you the URI for the object, which you can then use to find other relevant URIs. So in this case I was able to extract the URI for Wolfgang Amadeus Mozart (http://collection.britishmuseum.org/id/person-institution/39629) and so create a query like:

SELECT ?item WHERE
{
	?s ?p <http://collection.britishmuseum.org/id/person-institution/39629> .
	?item <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?s
}

To find the 9 (as of today) items that are in someway related to Mozart (mostly pictures/engravings of Mozart).

The discussion at the Pelgios meeting identified several ‘anti-patterns’ related to the usability of Linked Data – and some of these jumped out at me as being issues when using the British Museum data:

Anti-patterns

  • homepages that don’t say where data can be found
  • not providing info on licences
  • not providing info on RDF syntaxes
  • not providing egs of query construction
  • not providing easy way to get at term lists
  • no html browsing
  • complex data models
The Pelagios wiki has some more information on ‘stumbling blocks’ at http://pelagios.pbworks.com/w/page/48544935/Stumbling%20Blocks, and also the group exploring (amongst other things) the British Museum data made notes at http://pelagios.pbworks.com/w/page/48535503/UK%20Cultural%20Heritage. Also I know that Dominic Oldman from the British Museum was at the meeting, and was keen to get feedback on how they could improve the data or the way it is made available.
One thing I felt strongly when I was looking at the British Museum data is that it would have been great to be able to ‘go’ somewhere that others looking at/using the data would also be to discuss the issues. The British Museum provide an email to send feedback (which I’ve used), but what I wanted to do was say things like “am I being stupid?” and “anyone else find this?” etc. As a result of discussion at the Pelagios meeting, and on twitter, Mia Ridge has setup a wiki page for just such a discussion.
A final thought. The potential of ‘linked data’ is to bring together data from multiple sources, and combine to give something that is more than the sum of it’s parts. At the moment the British Museum data sits in isolation. How amazing would it be to join up the British Museum ‘people’ records such as http://collection.britishmuseum.org/id/person-institution/39629 with the VIAF (http://viaf.org/viaf/32197206/) or Library of Congress (http://id.loc.gov/authorities/names/n80022788) identifier for the same person, and start to produce searches and results that build on the best of all this data?

Linked Data and Libraries: The record is dead

Rob Styles from Talis…
First of all says – the record is not dead – it should be – it should have been dead 10 years ago, but it isn’t [comment from the floor – ‘it never well be’]

Rob says that what is missing from the record is the lack of relationships – this is the power you want to exploit.

If you think this stuff is complicated try reading the MARC Manuals and AACR2… 🙂 Linked Data is different but not complicated

MARC is really really good – but really really outdated – why a 245$$a not ‘Title’ – is it because ‘245’ universally understandable? NO! Because when computing power and space was expensive they could afford 3 characters – that’s why you end up with codes like 245 rather than ‘Title’.

If you look at the use of labels and language in linked data you’ll find that natural language often used [note that this isn’t the case for lots of the library created ontologies 🙁 ]

What is an ‘identifier’? How are they used in real life? They are links between things and information about things – think of a number in a catalogue – we don’t talk about the identifiers, we just use them when we need to get a thing, or information about the thing (e.g. Argos catalogue)

Call numbers are like URLs ….

Rob’s Law: “If you wish to link to things, use a link”

Rob pointing up the issues around the way libraries work with subject headings and ‘aboutness’. Libraries talk about ‘concepts’ rather than ‘things’ – a book is about ‘Paris’ not about ‘the concept of Paris’. Says we need to get away from the abstractness of concepts. He sees the use of SKOS as a reasonable thing to do to get stuff out there – but hopes it is a temporary work around which will get fixed.

MARC fails us by limiting the amount of data that can be recorded – only 999 fields… need approaches that allow more flexibility. And ability to extend where necessary.

FRBR – introduces another set of artificial vocabulary – this isn’t how people speak, it isn’t the language they use.

We need to model data and use vocabulary so it can connect to things that people actually talk about – as directly as we realistically can. Rob praises the BL modelling on this front.

How do you get from having a library catalogue, to having a linked data graph… Rob says 3 steps:

  • training – get everyone on the team on the same page, speaking the same language
  • workshop – spend a couple of days away from the day job – looking at what others are doing, identifying what you want to do, what you can do – until you have a scope of something feasible and worthwhile
  • mentoring and oversight – keep team going, make sure they can ask questions, discuss etc.

Q & A:
Mike Taylor from IndexData asks – how many ‘MARC to RDF’ pipelines have been built by people in this room? Four or five in the room.
Rob says – lots of experimentation at this stage… this is good… but not sure if we will see this come together – but production level stuff is different to experimental stuff.

?? argues we shouldn’t drop ‘conceptual’ entities just because we start representing ‘real world’ things
Rob says ‘subject’ is a relationship – this is how it should be represented.
Seems to be agreement between them that conceptual sometimes useful – but that the more specific stuff is generally more useful… [I think – not sure I understood the argument completely]

Linked Data and Libraries: Creating a Linked Data version of the BNB

Neil Wilson from the BL doing this talk.

Government has been pushing to open up data for a while. This has started to change some expectations around publishing of ‘publicly owned’ data.

BL wanted to start to develop an Open Metadata Strategy. They wanted to:

  • Try and break away from library specific format and use more cross-domain XML based standards – but keep serving libraries not engaged in cutting edge stuff
  • Develop the new formats with communities using the metadata
  • Get some form of attribution while also adopting a licensing model appropriate to the widest re-use of the metadata
  • Adopt a multi-track approach to all this

So first steps were:
Develop capability to supple metadata using RDF/XML
Worked with variety of community and organisations etc…

Current status:
Created a new enquiry point for BL metadata issues
Signed up c400 orgs to the free MARC21 z39.50 service
Worked with JISC, Talis and other linked data implementers on technical challenges, standards and licensing issues
Begun to offer sets of RDF/XML to various projects etc.

Some of the differences between traditional library metadata and Linked data
Traditional library metadata uses a self contained proprietary document based model
Linked data more dynamic data based model to establish relationships between data

By migrating from traditional modles libraries could begin to:

  • Integrate their resources in the web
  • Increase visibiilty, reach new users
  • Offer users a richer resource discovery experience
  • Moving from niche costly specialist technologies and suppliers to more ‘standard’ and widely adopted approaches

BL wanted to offer data allowing useful experimentation and advancing discussion from theory to practice. BNB (British National Bibliography) has lots of advantages – general database of published output – not just ‘BL stuff’; reasonably consistent; good identifiers.

Wanted to undertake the work as extension of existing activities – wanted to develop local experitise, using standard hardware for conversion. Starting point was Library MARC21 data. Wanted to focus on data issues not building infrastructure and also on linking to stuff.

First challenge – how to migrate the metadata:
Staff training in linked data – modelling concepts and increased familiarisation with RDF and XML concepts
Experience working with JISC Open Bibliography Project and others
Feedback on MARC to XML conversion

Incremental approach adopted – with several interations around data and data model.

Wanted to palce library data in wider context and supplement or replace literal values in records. Linked to both library sites:
Dewey Info
LCSH SKOS
VIAF

but also non library sites:
GeoNames

Three main approaches:
Automatic Generation of URIs from elements in records (e.g. DDC)
Matching of text in records with linked data dumps – e.g. personal names to VIAF
Two stage crosswalking [? missed this]

Lots of preprocessing of MARC records before tackling the transform to RDFXML using XSLT

Can see the data model at http://www.bl.uk/bibliographic/pdfs/british_library_data_model_v1-00.pdf and more information – http://www.bl.uk/bibliographic/datafree.html

Next steps:
Staged release over coming months for books, serials, multi-parts
Monthly updates [I think?]
New data sets being thought about

Lessons learnt…
It is a new way of thinking – legacy data wasn’t designed for this purpose
There are many opinions out there, but few real certainties – your opinion may well be as valid as anyone else – especially when it’s your data
Don’t reinvetn the wheel – there are tools and experience you can use – start simple and develop in line with evolving staff expertise
Reality check by offering samples for feedback to wider groups
Be prepared for some technical criticism in addition to positive feedback and improve in response
Conversion inevitably identifiers hidden data issues – and creates new one
But better to release an imperfect something than a perfect nothing

There is a steep learning curve – but look for training opportunities for staff and develop skills; Cultivate a culture of enquiry and innovation among staff to widen perspectives on new possibilities

It’s never going to be perfect first time – we expect to make mistakes – have to make sure we learn from there and ensure that everyone benefits from the experience. So if anyone is thinking of undertaking a similar journey – Just do it!

Q: How much of the pipeline will you ‘open source’
A: Quite a few of the tools are ‘off the shelf’ (not clear if open source or not?). The BL written utilities could be released in theory – but would need work (not compiled with Open Source compilers at the moment) – so will be looked at…

Linked Data and Libraries: W3C Library Linked Data Group

This talk by Antoine Isaac…
Also see http://www.w3.org/2005/Incubator/lld/wiki
W3C setup a working group on library linked data – coming to an end now. Mission of group = help increase global interoperability of library data on web…

Wanted to see more library data in the Linked Data cloud and also to start to put together the ‘technological bits and pieces’ – things like:

  • Vocabularies/schemas
  • Web services
  • Semantic Web search engines
  • Ontology editors

But – need for a map of the landscape – specifically for library sector…
And need to answer some of the questions librarians or library decision makere have like:

  • What does it have to do with bibliography?
  • Does it make life better for patrons?
  • Is it practical?
  • etc.

About to report have got:
Use cases – grouped into 8 topical clusters – bib data; vocab alignment; citations; digital objects; social; new users

Available data:

  • Datasets
  • Value vocabularies (lists of stuff – like LCSH)
  • Element sets (Ontologies)

See http://ckan.net/group/lld

Finally and most important deliverable:
High level report – intended for a general library audience: decision makers, developers, metadata librarians, etc. Tries to expand on general benefits, issues and recommendations. Includes:

Relevant technologies
Implementation challenges and barriers to adoption

Still got a chance to comment:
http://blogs.ukoln.ac.uk/w3clld/

For the future Antoine says discussions and collaboration should continue – existing groups within libraries or with wider scope – IFLA Semantic Web special interest group; LOD-LAM

Possibility of creating a new W3C Community group…

We need a long term effort – not all issues (many issues) are not technical

Comment from floor – also see http://reddit.com/r/librarylinkeddata

Linked Data and Libraries: OpenAIRE

OpenAIRE is a European funded project … 38 partners

Want to link project data (in institution, CRIS) to publications resulting from those projects. Data sources – Institutional Repositories – using OpenDOAR as authority on this.

Various areas needed vocabularies – some very specific like ‘FP7 Area’ some quite general like ‘Author’ (of paper)

Various issues capturing research output from different domains:

  • Difference responsibilities and tasks
  • Different metadata formats
  • Different exchange interfaces and protocols
  • Different levels of granularity

In the CRIS domain:
Covers research process
run by admin depts
broader view on research info
diverse data models – e.g. CERIF(-like) models; DDF-MXD, METIS, PURE – and some internal formats

In OAR domain:
Covers research publications
Uses Dublin Core

Interoperability between CRIS and OAR

Working group within the ‘quadrolateral Knowledge Exchange-Initiative’ (involving SURF, JISC, DFG, DEFF) – aiming to increase interoperability between CRIS and OAR domains – want to define a lightweight metadata exchange format.

…. sorry – suffering my usual post lunch dip and distraction – didn’t get half of what I could have here 🙁

Linked Data and Libraries: LODUM

LODUM is Linked Open Data and University of Munster – presented by Carsten Kessler
Started LODUM – about providing scientific and educational data as Linked Open Data
Have started linking Library and CRIS… next want to start linking Courses, and then Buildings (and Bus Stops…)

Brasil – from research project
Maps – starting to annotate maps and have descriptions as LD
Bio-topes data – species etc. from the local region
Interviews – want to annotate recordings and link to transcripts

Library has central role – hub that provides the publication and want to link all the datasets to the relevant publications in the library. Hope in the future will have pointers from publications (within text) to the data.

Concrete use case:
Have institute of Planetology. Have information about the Moon. To save money they look for areas of the earth with similar characteristics as ‘reference data’ – hope that this will be something they can provide.

Development work is advancing with 4 student assistants. Can see some data at http://data.uni-muenster.de; Establishing contacts with other universities via http://linkeduniversities.org
Need more funding – only have startup funding at the moment

Linked Data and Libraries: Linked Data OPAC

This session by Phil John – Technical Lead for Prism (was Talis, now Capita). Prism is a ‘next generation’ discovery interface – but built on Linked Data through and through.

Slides available from http://www.slideshare.net/philjohn/linked-library-data-in-the-wild-8593328

Now moving to next phase of development – not going to be just about library catalogue data – but also journal metadata; archives/records (e.g. from the CALM archive system); thesis repositories; rare items and special collections (often not done well in traditional OPACs) … and more – e.g. community information systems.

When populating Prism from MARC21 – do initial ‘bulk’ conversion, then periodic ‘delat’ files – to keep in sync with LMS. Borrower and availability data is pulled from LMS “live” – via a suite of RESTful web services.

Prism is also a Linked Data API… just add .rss to collection of .rdf/.nt/.ttl/.json to items. This means simple to publish RSS feeds of preconfigured searches – e.g. new stock, or new stock in specific subjects etc.

Every HTML page in Prism has data behind it you can get as RDF.

One of the biggest challenges – Extracting data from MARC21 – MARC very rich, but not very linked… Phil fills the screen with #marcmustdie tweets 🙂

But have to be realistic – 10s of millions of MARC21 records exist – so need to be able to deal with this.
Decided to tackle problem in small chunks. Created a solution that allows you to build a model interatively. Also compartmentalises code for different sections – these can communicate but work separately and can be developed separately. Makes it easy to tweak parts of the model easily.

Feel they have a robust solution that performs well – even if it only takes 10 seconds to convert a MARC record – then when you use several million records it takes months.

No matter what MARC21 and AACR2 says – you will see variations in real date.

Have a conversion pipeline:
Parser – reads in MARC21- fires events as it encounters different parts of the record – it’s very strict with Syntax – so insists on valid MARC21
Observer – listens for MARC21 data structures and hands control over to …
Handler – knows how to convert MARC21 structures and fields into Linked data

First area they tackled was Format (and duration) – good starting point as it allows you to reason more fully about the record – once you know Format you know what kind of data to expect.

In theory should be quite easy – MARC21 has lots of structured info about format – but in practice there are lots of issues:

  • no code for CD (it’s a 12 cm sound disk that travels at 1.4m/s!)
  • DVD and LaserDisc shared a code for a while
  • Libraries slow to support new formats
  • limited use of 007 in the real world

E.g. places to look for format information:
007
245$$h
300$$a (mixed in with other info)
538$$a

Decided to do the duration at the same time:
306$$a
300$$a (but lots of variation in this field)

Now Phil talking about ‘Title’ – v important, but of course quite tricky…
245 field in MARC may duplicate information from elsewhere
Got lots of help from http://journal.code4lib.org/articles/3832 (with additional work and modification)

Retained a ‘statement of responsibility’ – but mostly for search and display…

Identifiers…
Lots of non identifier information mixed in with other stuff – e.g. ISBN followed by ‘pbk.’
Many variations in abbrevations used – have to parse all this stuff, then validate the identifier
Once you have an identifier, you start being able to link to other stuff – which is great.

Author – Pseudonyms, variations in names, generally no ‘relator terms’ in 100/700 $$e or $$4 – which would show the nature of the relationship between the person and the work (e.g. ‘author’ ‘illustrator’) – because these are missing have to parse information out of the 245$$c

… and not just dealing with English records – especially in academic libraries.

Have licensed Library of Congress authority files – which helps… – authority matching requirements were:
Has to be fast – able to parse 2M records in hours not days/months
Has to be accurate

So – store Authorities as RDF but index in SOLR – gives speed and for bulk conversions don’t get http overhead…

Language/Alternate representation – this is a nice ‘high impact’ feature – allows switching between representations – both forms can be searched for – use RDF content language feature – so useful for people using machine readable RDF

Using and Linking to external data sets…
part of the reason for using linked data – but some challenges….

  • what if datasource suffers downtime
  • worse – what if datasource removed permanently?
  • trust
  • can we display it? is it susceptible to vandalism?

Potential solutions (not there yet):

  • Harvest datasets and keep them close to the app
  • if that’s not practical proxy requests using caching proxy – e.g. Squid
  • if using wikipedia and worried about vandalism – put in checks for likely vandalism activity – e.g. many multiple edits in short time

Want to see
More library data as LOD – especially on the peripheries – authority data, author information, etc.
LMS vendors adopting LOD
LOD replacing MARC21 as standard representation of bibliographic records!

Questions?
Is process (MARC->RDF) documented?
A: Would like to open source at least some of it… but discussions to have internally in Capita – so something to keep and eye on…

Is there a running instance of Prism to play with:
A: Yes – e.g. http://prism.talis.com/bradford/

[UPDATE: See in comments Phil suggests http://catalogue.library.manchester.ac.uk/ as one that has used a more up to date version of the transform

Linked Data and Libraries: Report on the LOD LAM Summit

This report from Adrian Stevenson from UKOLN. Summit held 2-3 June 2011. Brought together 100 people from around the world, with generous funding from the Internet Archive; National Endowment from the Humanities; Alfred P Sloan foundation.

Adrian’s slides at http://www.slideshare.net/adrianstevenson/report-on-the-international-linked-open-data-for-libraries-archives-and-museums-summit

Find out more at http://lod-lam.net/summit

85 organisations represented from across libraries, archives, museums.

Summit aimed to be practical and about actionable approaches to publishing Linked Open Data. Looking at Tools, licensing policy/precedents, definitions and use cases.

Meeting ran on a ‘Open Space Technology’ format – some way between a formal conference and an unconference – agenda created via audience proposals – but there were huge numbers of sessions being proposed/run.

First day was more discursive, second day about action.

Some sessions that ran:
Business Case for LOD; Provenance/Scalability; Crowdsourcing LOD; Preservation of RDF/Vocabulary Maintenance

What next?
Connections made; acitivities kick-started
Many more LOD LAM events planned – follow #lodlam on Twitter
#lodlam #london had meeting yesterday; but bigger event planned for November – see http://bit.ly/oJ6qsl for more details as they become available

Linked Data and Libraries: Richard Wallis

A year on from the first Talis Linked Data and Libraries meeting – a lot has happened. The W3C has a group on ‘linked data and libraries’; the BL has released a load of records as RDF/XML – brave decision; Richard went to meeting where in Denmark there was discussion about trying to persuade publishers to release metadata; Europeana Linked Data now available; parts of French National Catalogue etc. etc.

While it may feel that progress is slow – people are getting out there and ‘just doing it’ as Lynne Brindley just suggested.

Talis – started pioneering in Semantic Web/Linked Data areas. Recently the Library Systems Division has been sold to Capita – allow it to focus on libraries, while ‘Talis’ is going forward with focus on linked data/semantic web.

Now Talis is made up of:

Talis Group – core technologies/strategy – run the Talis Platform
Talis Education – applications in academia – offer the Talis Aspire product (for ‘reading lists’ in Universities)
Talis Systems Ltd – Consulting/Training/Evangelism around Linked Data etc.
Kasabi.com – [Linked] Data Marketplace – free hosting at the moment – Community – APIs. Evetually looking to monetise

Enough about Talis …

UK Government still pressing forward with open linked data
BBC have done more with Linked Data – e.g. World Cup 2010 site was based on Linked Data – delivered more pages and more traffic with less staff. BBC already working with same technology to look at Olympics 2012 site…

Richard now mentioning Good Relations ontology – adoption by very large commercial organisations.

Linked Data ‘cloud’ has got larger – more links – but what are these links for?
Links (i.e. http URIs) identify things – and so you can deliver information about things you link to… Richard says lots of the ‘talk’ is about things like SPARQL endpoints etc. But should be about identifying things and delivering information about them.

Richard breaking down Linked Data – how RDF describes stuff etc. Allows you to find relationships between things – that machines can parse… [Richard actually said ‘understand’ but don’t think he is necessarily talking AI here]

Richard stressing that Linked Data is about technology and Open Data about licensing – separate issues which talking about ‘Linked Open Data’ conflates – quotes Paul Walk on this from http://blog.paulwalk.net/2009/11/11/linked-open-semantic/ – but says he (Richard) would talk about the Linked Data web not the Semantic Web (Paul uses latter term)

Richard thinks that Libraries have an advantage in entering the Linked Data world – lots of experience, lots of standards, lots of records. We have described things, whereas generally people just have the things they need to describe.

Already have lots of lists of things – subject headings (lcsh), classifications (dewey), people (authority files)

Are libraries good at describing things… or just books?

Are Libraries applicable for Linked Data? How are we doing? Richard gives a school report – “Could do better”; “Showing promise”

When we look at expressing stuff as linked data we need to ask “Why and Who For!”