Opening Data – Opening Doors: Technical Standards

Some slightly sketchy notes of Paul Walk’s talk

Paul says: the real challenges are around:
Business case
IPR
etc.

Technical issues not trivial, but insignificant compared to other challenges

We aren’t building a system here – but thinking about an environment … although probably will need to build systems on top of this at some point

‘The purple triangle of interoperability’!:

  • Shared Principles
  • Technical Standards
  • Community/Domain Conventions

Standards are not the whole story

  • Use (open) technical standards
  • Require standards only where necessary
  • Avoid pushing standards to create adoption
  • Establish and understand high-level principles and ‘explain the working out’ – support deeper understanding

Paul suggests some ‘safe bets’ in terms of approaches/principles:

  • Use Resource Oriented Architecture
  • identify persistently – global and public identities to your high-order entities (e.g. metadata records, actual resources)
    • URLs (or http URIs) is a sensible default for us (although not the only game in town)
  • use HTTP and REST

Aggregation is a corner-stone of RDTF vision – so make your resources a target for aggregation:

  • use persistent identifiers for everything
  • adopt appropriate licensing
  • ‘Share alike’ maybe easier than ‘attribution’

Paul still a little sceptical of ‘Linked Data’ – it’s been the future for a long time. Tools for Linked Data still not good enough – can be real challenge for developers. However, we should be a
Quote Tom Coates: “Build for normal users, developers and machines” – and if possible, build the same interface for all three [hint, a simple dump of RDF data isn’t going to achieve this!]

Expect and enable users to filter – give them ‘feeds’ (e.g. RSS/Atom) – concentrate on making your resources available

Paul sees slight risk we embrace ‘open at the expense of ‘usability’ – being open is an important first step – but need to invest in making things useful and usable

Developer friendly formats:

  • XML has a lot going for it, but also quite a few issues
    • well understood
    • lots of tools available
    • validation is a pain
    • very verbose
    • not everything is a tree
  • JSON has gained rapid adoption
    • less verbose – simple
    • enables client side manipulation

Character encodings – huge number of XML records from UK IRs are invalid do to character encoding issues

Technical Foundations:

  • Work going on now – Will be a website ETA June 2011
  • JISC Observatory will gather evidence of ‘good use’ of technical standards etc
  • Need to understand federated aggregation better

Questions for data providers:Do you want to provide a data service, or just provide data?

Opening Data – Opening Doors: Cambridge University Library

Finally in this set of three ‘perspectives’ session, Ed Chamberlain from Cambridge University Library.

Why expose bibliographic data?

  • Natural follow on from philosophy of ‘meeting reader in their (online) place’
  • Already exposing data to others (OCLC, COPAC, SUNCAT etc.) – lots of work to setup each agreement and export – Open data approach might give easier way of approaching this
  • Offer value for money (for taxpayer)
  • Internal academic pressure – ‘we are being asked for data’

e.g. use Rufus Pollock – wanted to do ‘analysis of size and growth of the public domain using CUL bibliographic data (http://rufuspollock.org/tags/eupd)

The COMET (Cambridge Open METadata) project will be releasing large amounts of bibliographic data under an Open Data Commons License. Formats will include MARC21 and RDF – partnering with OCLC so linking into related services such as FAST and VIAF.

Ed thinks the library sector should have following ambitions around resource discovery:

  • Hope to see ‘long tail’ effect – exposing data to large audience
  • ‘Out of domain’ discovery
  • Multiple points of discovery at multiple levels for multiple audiences
  • Services for Undergraduates, for academics AND for developers

Practicalities/Challenges:

  • Licensing
    • While individual records may not be protected by copyright, collections of records may be – and often obtained by library from shared catalogue resources/commercial suppliers under contract
    • Ideal is full unrestricted access
    • Better to publish data (as much as you can, even if necessary to have more restrictive licensing attached)
  • RDF vocab and mappings – no standard
  • Triplestores – for managing RDF – but new technology, seems complex

Opportunities:

  • Strong platform for future development
  • Linked formats and open licenses are virtuous pairing
  • Huge scope for back office benefits

Need to also think beyond bibliography – what about holdings? libraries (physical locations)? librarians as linked data (!) (finding people with specialisms etc.)

Opening Data – Opening Doors: The National Archives

Next up, Nick Kingsley from the National Archives.

For ‘non-archivist’ a whistlestop tour:

  • Archival holdings consist of collections (or ‘fonds’) representing any number of archival objects – these are primary units of management
  • Collections often have ‘natural’ or imposed structure
  • Ideally catalogue are linked to authority records for names and places + taxonomies for subjects
  • Archive users typically use both search and browse to aid resource discovery
  • Archive catalogues compile over long periods (a century or more in some cases) – so inconsistencies/changes in language etc.

The ‘National Register of Archives’ – start of ‘aggregation’ for archives – computerised and then made available online throughout 80s and 90s

Funding silos meant outcomes of ‘Archives Online’ report published by Nationa Council on Archives (1998) were taken forward through a series of different projects – but all with committment to interoperability to allow for integration or cross-searching. Projects include:

About 5 years ago, a view started to emerge that future not about aggregations ‘doing it for’ archives, but individual archives publishing their own catalogues online – but “usually proved disappointing” (personal view of Nick) – because:

  • contrained by lack of technical support
  • 2 widely adopted commercial platforms – developments limited to those supported by ‘majority’ cusomters
  • Rarely offer robust search/browse
  • ….

National Archives committed to supporting and promoting open data. Also has been pioneer in exploiting the potential of Linked Data – through http://legistlation.gov.uk and also looking at Linked Data version of PRONOM (impartial, definitive information about file formats, software products and other technical components) – see the National Archive ‘labs’ page http://labs.nationalarchives.gov.uk/wordpress/index.php/2011/01/linked-data-and-pronom

Lots of work going on at the National Archives – see http://labs.nationalarchives.gov.uk/wordpress/index.php. Also looking at a review of the National Register of Archives and considering a linked data approach.

Opening Data – Opening Doors: National Maritime Museum

Today I’m at the ‘Opening Data – Opening Doors’ event, which is both the first public event around the ‘Resource Discovery Task Force’ work, and also an opportunity to launch the JISC ‘Guide to Open Bibliographic Data’. The event is being live streamed at http://www.livestream.com/johnpopham

Following introduction from David Baker on the background of the RDTF and the ‘vision’ that came out of that work, now Laurence Chiles from the National Maritime Museum is talking about how they’ve approached publishing their collections and data on the web. The results are going live in the next few weeks.

Amongst a wide range of aims, they wanted to:

  • Connect objects & records across varied collection – use linked data to enable connectivity between objects; help develop the story and relationships across the collections
  • Give objects a growing online identity – permanent/stable home based on Object IDs
  • To be conversational – let people use the data but then start/react to the conversation – if no one knows it’s there …

Actions they took:

Changed the criteria:
From ‘web ready’ to ‘not for web use’ – i.e. moved from a ‘not on web’ assumption, to a ‘on web unless specific decision not to’ assumption
Decided publishing data without images was OK
4 basic mandatory fields – (Title, ID, ?, ?)

Offer new ways to the data:

  • OAI-PMH (for aggregation into Culture Grid and onwards to Europeana)
  • OpenSearch
  • XML

Used principles of linked data to link out of collections online:

  • AEON (Archival retrieval service through their Archive catalogue)
  • Cabinet (for print ordering service)
  • WorldCat (links to publications)
  • Plans to work with Wikimedia commons to enhance authoristy records
  • Exposed bothe the SOLR API for ‘traditional’ search and the SPARQL end-point for lined data
    • Promoted at Culture and History Hack days

Going forward want to:

  • Promote a collaborative, conversational approch – e.g. ‘Help the NMM’ feature on all records
  • Improve ‘on-gallery’ experiences
  • Contiue to release more data and monitor – .e.g 1915 Crew lists