Dutch Culture Link

This session by Lukas Koster.

Works for Library of the University of Amsterdam was ‘system librarian’, then Head of Library Systems Department, now Library Systems Coordinator – means responsible for MetaLib, SFX … and new innovative stuff – including mobile web and the Dutch Culture Link project – which is what he is going to talk about today.

Lukas is a ‘shambrarian’ – someone who pretends they know stuff!

Blogs at http://commonplace.net

Lukas described situation in Netherlands regarding libraries, technology and innovation. How much leeway to get involved in innovation and mashups – depends very much on individual institutions and local situation. Large Library 2.0 community – but much more in public libraries – especially user facing widgets and UI stuff – via a Ning network. Have ‘Happe.Nings’ – but more looking at social media etc. rather than data mashups. Lukas blogged the last one at http://www.lukaskoster.net/2010/06/happe-ning-in-haarlem. Next Happe.Ning about streaming music services

Lukas talking about Linked Open Data project – partners are:

  • DEN – Digital Heriatage Foundation of the Netherlands – digital standards for heritage institutions, promoting linked open data – museums etc. – simple guidelines how to publish linked open data
  • UBA – library of the University of Amsterdam
  • TIN – Theater Institute of the Netherlands

Objectives of project:

  • Set example
  • Proof of concept
  • Pilot
  • Convince heritage institutions
  • Convince TIN, UBA management

Project called “Dutch Culture Link” – aim to link cultural data and institutions through semantic web

Linked data projects – 2 viewpoints – publishing and use – no point publishing without use. Lukas keen that project includes examples of how the data can be used.

So – the initial idea is that the UBA (Aleph) OPAC will be used to get data published from the TIN collection and enhance OPAC

TIN use AdLib library system (AdLib also used for museums, archives etc.) – TIN contains objects and audio-visual material as well as bibliographic items

Started by modelling TIN collection data model – entities:

  • Person
  • Part (person plays part in play)
  • Appearance
  • Production
  • Performance
  • Location
  • Play

Images, text files, a-v material related to these entities – e.g. Images from a performance

Lukas talking about FRBR – “a library inward vision” – deals with bibliographic materials – but can perhaps be mapped to plays…

  • Work = Play
  • Expression = Production?
  • Manifestation = Production?
  • Item = Performance (one time event)

FRBR interesting model, but needs to be extended to the real world! (not just inward looking for library materials)

Questions that arise:

  • Which vocabulary/ontology to use?
  • How to implement RDF?
  • How to format URIs?
  • Which tool, techniques, languages?
  • How to find/get published linked data?
  • How to process retrieved linked data?

Needed training – but no money! Luckily were able to attend free DANS Linked Open Data workshop

Decided to start with a quick and dirty approach:

  • Produced URIs for data entities in TIN data – expressed data as JSON (not RDF)
  • At the OPAC end:
  • Javascript: construct TIN URI
  • Process JSON
  • Present data in record

URI:

  • <base-url>/person/<personname>
  • <base-url>/play/<personname>/<title>
  • <base-url>/production/<personname>/<title>/<opening>

e.g. URI <base-url>/person/Beckett, Samuel returns JSON record

So, in OPAC, find the author name – form URI from information in MARC record – but strip out any extraneous information. Get JSON and parser with javascript, and display into OPAC.

But – this not Linked Data yet – not using formal Ontology, not using RDF. But this is the approach – quick and dirty, tangible results

Next steps: at ‘Publishing’ end

  • Vocabulary for Production/Performance subject area
  • Vocabulary for Person (FOAF?), Subject (SKOS?)
  • RDF in JSON (internal relationships)
  • Publish RDF/XML
  • More URIs – form performances etc
  • External links
  • Content negotiation
  • Links to a-v objects etc.

At ‘use’ end:

  • More ‘search’ fields (e.g. Title)
  • Extend presentation
  • Include relations
  • Clickable
  • More information – e.g. could list multiple productions of same play from script in library catalogue

Issues:

  • Need to use generic, really unique URIs
  • Person: ids (VIAF?)
  • Plays: ids

Open Bibliography (and why it shouldn’t have to exist)

Today I’m at Mashspa – another Mashed Library event.

Ben O’Steen is talking about a JISC project he is currently involved with. Project about getting bibliographic information into the open. For Ben Open means “publishing bibliographic information under a permissive license to encourage indexing, re-use and re-purposing”. Ben believes that some aspects – such as attribution – should be part of ‘community norm’, not written into a license.

In essence an open bibliography is all about Advertising! Telling other people what you have.

Bibliographic information allows you to:

  • Identify and find an item you know you want
  • Discovery related items or items you believe you want
  • Serendipitously discover items you would like with knowing they might exist
  • …other stuff

This list (from top to bottom) require increasing investment. Advertising isn’t about spending money – it’s about investment.

To maximise returns you maximise the audience

Ben asks “Should the advertising target ‘b2b’ or ‘consumers’?”

Ben acknowledges that it may not be necessary to completely open up the data set – but believes that in the long term open is the way forward.

Some people ask “Can’t I just scrape sites and use the data – it’s just facts isn’t it?”. However Directive 96/9/EC of the European Parliament which codifies a new protection based on “sui generis” rights – rights earned by the “sweat of the brow”. So far this law seems to have only solidified existing monopolies – not generated new economic growth (which was apparently the intention of the law)

When project asked UK PubMedCentral if we could reproduce the bibliographic data they share through their OAI-PMH service? – they said ‘Generally, No’ – paraphrasing that basically UK PubMedCentral said they didn’t have the rights to give away the data (except the stuff from Open Access journals) – NOTE – this is the metadata not the full text articles we are talking about – they said they could not grant the right to reuse the metadata [would this, for example, mean that you could not use this metadata in a reference management package to then produce a bibliography?]

Principles:

  • Assign a license when you publish data
  • Use a recognised license
  • If you want your data to be effectively used and added to by other it should be open – in particular non-commercial and other restrictive licenses should be avoided
  • Strongly recommend using CC0 or PDDL (latter in the EU only)
  • Strongly encourage release of bibliographic data into the ‘Open’

Sliding scale:

  • Identify – e.g. for author simple identifier could just be name – cheap, more expensive identifiers – e.g. URIs or ORIDs
  • Discover –
  • Serendipity –

If you increase investment you get more use – difficult to reuse data without identifiers for example.

1. Where there is human input, there is interpretation – people may interpret standards in different ways, use fields in different ways

Ben found a lot of variation across data in PubMed data set – different journals or publishers interpret where information should go in different ways – “Standards don’t bring interoperability, people do”

2. Data has been entered and curated without large-scale sharing as a focus – lots of implicit, contextual  information left out – e.g. if you are working in a specialist Social Science library, perhaps you don’t mention that the item is about Social Sciences as that is implicit by (original) context

3. Data quality is generally poor – example from the BL ISBN = £2.50!

In a closed data set you may not discover errors – when you have lots of people looking at data (with different uses in mind) you pick up different types of error.

The data clean-up process is going to be PROBABALISTIC – we cannot be sure – by definition – that we are accurate when we deduplicate or disambiguate. Typical methods:

  • Natural Language Processing
  • Machine learning techniques
  • String metrics and old school record reduplication – easiest of the the 3 (for Ben)

Not just about matching uniquely – looking at level of similarity and making decisions

List of string metrics at http://staffwww.dcs.shef.ac.uk/people/s.chapman/stringmetrics.html

Felligi-Sunter method for old school deduplication – not great, but works OK .

Can now take a map-reduce approach (distribute processing across servers)

Do it yourself:

When de-duping – need to be able to unmerge so you can correct if necessary – canonical data that you have, and data that you publish to the public

Directions with Bibliographic data: So far much effort has been directed at ‘Works’ – we need to put much more effort into their ‘Networks’ – starts to help (for example) disambiguate people

Network examples:

  • A cites B
  • Works by a given author
  • Works cited by a given author
  • Works citing articles that have since been disproved, redacted or withdrawn
  • Co-authors
  • …other connections that we’ve not even thought of yet

Ben says – Don’t get hung up on standards …