Jul 14

This session by Phil John – Technical Lead for Prism (was Talis, now Capita). Prism is a ‘next generation’ discovery interface – but built on Linked Data through and through.

Slides available from http://www.slideshare.net/philjohn/linked-library-data-in-the-wild-8593328

Now moving to next phase of development – not going to be just about library catalogue data – but also journal metadata; archives/records (e.g. from the CALM archive system); thesis repositories; rare items and special collections (often not done well in traditional OPACs) … and more – e.g. community information systems.

When populating Prism from MARC21 – do initial ‘bulk’ conversion, then periodic ‘delat’ files – to keep in sync with LMS. Borrower and availability data is pulled from LMS “live” – via a suite of RESTful web services.

Prism is also a Linked Data API… just add .rss to collection of .rdf/.nt/.ttl/.json to items. This means simple to publish RSS feeds of preconfigured searches – e.g. new stock, or new stock in specific subjects etc.

Every HTML page in Prism has data behind it you can get as RDF.

One of the biggest challenges – Extracting data from MARC21 – MARC very rich, but not very linked… Phil fills the screen with #marcmustdie tweets :)

But have to be realistic – 10s of millions of MARC21 records exist – so need to be able to deal with this.
Decided to tackle problem in small chunks. Created a solution that allows you to build a model interatively. Also compartmentalises code for different sections – these can communicate but work separately and can be developed separately. Makes it easy to tweak parts of the model easily.

Feel they have a robust solution that performs well – even if it only takes 10 seconds to convert a MARC record – then when you use several million records it takes months.

No matter what MARC21 and AACR2 says – you will see variations in real date.

Have a conversion pipeline:
Parser – reads in MARC21- fires events as it encounters different parts of the record – it’s very strict with Syntax – so insists on valid MARC21
Observer – listens for MARC21 data structures and hands control over to …
Handler – knows how to convert MARC21 structures and fields into Linked data

First area they tackled was Format (and duration) – good starting point as it allows you to reason more fully about the record – once you know Format you know what kind of data to expect.

In theory should be quite easy – MARC21 has lots of structured info about format – but in practice there are lots of issues:

  • no code for CD (it’s a 12 cm sound disk that travels at 1.4m/s!)
  • DVD and LaserDisc shared a code for a while
  • Libraries slow to support new formats
  • limited use of 007 in the real world

E.g. places to look for format information:
007
245$$h
300$$a (mixed in with other info)
538$$a

Decided to do the duration at the same time:
306$$a
300$$a (but lots of variation in this field)

Now Phil talking about ‘Title’ – v important, but of course quite tricky…
245 field in MARC may duplicate information from elsewhere
Got lots of help from http://journal.code4lib.org/articles/3832 (with additional work and modification)

Retained a ‘statement of responsibility’ – but mostly for search and display…

Identifiers…
Lots of non identifier information mixed in with other stuff – e.g. ISBN followed by ‘pbk.’
Many variations in abbrevations used – have to parse all this stuff, then validate the identifier
Once you have an identifier, you start being able to link to other stuff – which is great.

Author – Pseudonyms, variations in names, generally no ‘relator terms’ in 100/700 $$e or $$4 – which would show the nature of the relationship between the person and the work (e.g. ‘author’ ‘illustrator’) – because these are missing have to parse information out of the 245$$c

… and not just dealing with English records – especially in academic libraries.

Have licensed Library of Congress authority files – which helps… – authority matching requirements were:
Has to be fast – able to parse 2M records in hours not days/months
Has to be accurate

So – store Authorities as RDF but index in SOLR – gives speed and for bulk conversions don’t get http overhead…

Language/Alternate representation – this is a nice ‘high impact’ feature – allows switching between representations – both forms can be searched for – use RDF content language feature – so useful for people using machine readable RDF

Using and Linking to external data sets…
part of the reason for using linked data – but some challenges….

  • what if datasource suffers downtime
  • worse – what if datasource removed permanently?
  • trust
  • can we display it? is it susceptible to vandalism?

Potential solutions (not there yet):

  • Harvest datasets and keep them close to the app
  • if that’s not practical proxy requests using caching proxy – e.g. Squid
  • if using wikipedia and worried about vandalism – put in checks for likely vandalism activity – e.g. many multiple edits in short time

Want to see
More library data as LOD – especially on the peripheries – authority data, author information, etc.
LMS vendors adopting LOD
LOD replacing MARC21 as standard representation of bibliographic records!

Questions?
Is process (MARC->RDF) documented?
A: Would like to open source at least some of it… but discussions to have internally in Capita – so something to keep and eye on…

Is there a running instance of Prism to play with:
A: Yes – e.g. http://prism.talis.com/bradford/

[UPDATE: See in comments Phil suggests http://catalogue.library.manchester.ac.uk/ as one that has used a more up to date version of the transform

written by ostephens \\ tags: , ,


One Response to “Linked Data and Libraries: Linked Data OPAC”

  1. 1. Phil John Says:

    Worth mentioning that different instances of Prism are using different data models, from the old legacy “triplified model” to the newer “semantic data model”.

    Bradford are using legacy model currently, University of Manchester using most recent (http://catalogue.library.manchester.ac.uk)

    Manchester also have some records using a slightly more recent version of new data model and some using older versions – they can coexist together in the same store.

Leave a Reply