Opening Data – Opening Doors: Technical Standards

April 18, 2011April 18, 2011 · ostephens

Some slightly sketchy notes of Paul Walk’s talk

Paul says: the real challenges are around:
Business case
IPR
etc.

Technical issues not trivial, but insignificant compared to other challenges

We aren’t building a system here – but thinking about an environment … although probably will need to build systems on top of this at some point

‘The purple triangle of interoperability’!:

Shared Principles
Technical Standards
Community/Domain Conventions

Standards are not the whole story

Use (open) technical standards
Require standards only where necessary
Avoid pushing standards to create adoption
Establish and understand high-level principles and ‘explain the working out’ – support deeper understanding

Paul suggests some ‘safe bets’ in terms of approaches/principles:

Use Resource Oriented Architecture
identify persistently – global and public identities to your high-order entities (e.g. metadata records, actual resources)
- URLs (or http URIs) is a sensible default for us (although not the only game in town)
use HTTP and REST

Aggregation is a corner-stone of RDTF vision – so make your resources a target for aggregation:

use persistent identifiers for everything
adopt appropriate licensing
‘Share alike’ maybe easier than ‘attribution’

Paul still a little sceptical of ‘Linked Data’ – it’s been the future for a long time. Tools for Linked Data still not good enough – can be real challenge for developers. However, we should be a
Quote Tom Coates: “Build for normal users, developers and machines” – and if possible, build the same interface for all three [hint, a simple dump of RDF data isn’t going to achieve this!]

Expect and enable users to filter – give them ‘feeds’ (e.g. RSS/Atom) – concentrate on making your resources available

Paul sees slight risk we embrace ‘open at the expense of ‘usability’ – being open is an important first step – but need to invest in making things useful and usable

Developer friendly formats:

XML has a lot going for it, but also quite a few issues
- well understood
- lots of tools available
- validation is a pain
- very verbose
- not everything is a tree
JSON has gained rapid adoption
- less verbose – simple
- enables client side manipulation

Character encodings – huge number of XML records from UK IRs are invalid do to character encoding issues

Technical Foundations:

Work going on now – Will be a website ETA June 2011
JISC Observatory will gather evidence of ‘good use’ of technical standards etc
Need to understand federated aggregation better

Questions for data providers:Do you want to provide a data service, or just provide data?

Opening Data – Opening Doors: Cambridge University Library

April 18, 2011 · ostephens

Finally in this set of three ‘perspectives’ session, Ed Chamberlain from Cambridge University Library.

Why expose bibliographic data?

Natural follow on from philosophy of ‘meeting reader in their (online) place’
Already exposing data to others (OCLC, COPAC, SUNCAT etc.) – lots of work to setup each agreement and export – Open data approach might give easier way of approaching this
Offer value for money (for taxpayer)
Internal academic pressure – ‘we are being asked for data’

e.g. use Rufus Pollock – wanted to do ‘analysis of size and growth of the public domain using CUL bibliographic data (http://rufuspollock.org/tags/eupd)

The COMET (Cambridge Open METadata) project will be releasing large amounts of bibliographic data under an Open Data Commons License. Formats will include MARC21 and RDF – partnering with OCLC so linking into related services such as FAST and VIAF.

Ed thinks the library sector should have following ambitions around resource discovery:

Hope to see ‘long tail’ effect – exposing data to large audience
‘Out of domain’ discovery
Multiple points of discovery at multiple levels for multiple audiences
Services for Undergraduates, for academics AND for developers

Practicalities/Challenges:

Licensing
- While individual records may not be protected by copyright, collections of records may be – and often obtained by library from shared catalogue resources/commercial suppliers under contract
- Ideal is full unrestricted access
- Better to publish data (as much as you can, even if necessary to have more restrictive licensing attached)
RDF vocab and mappings – no standard
Triplestores – for managing RDF – but new technology, seems complex

Opportunities:

Strong platform for future development
Linked formats and open licenses are virtuous pairing
Huge scope for back office benefits

Need to also think beyond bibliography – what about holdings? libraries (physical locations)? librarians as linked data (!) (finding people with specialisms etc.)

Opening Data – Opening Doors: The National Archives

April 18, 2011 · ostephens

Next up, Nick Kingsley from the National Archives.

For ‘non-archivist’ a whistlestop tour:

Archival holdings consist of collections (or ‘fonds’) representing any number of archival objects – these are primary units of management
Collections often have ‘natural’ or imposed structure
Ideally catalogue are linked to authority records for names and places + taxonomies for subjects
Archive users typically use both search and browse to aid resource discovery
Archive catalogues compile over long periods (a century or more in some cases) – so inconsistencies/changes in language etc.

The ‘National Register of Archives’ – start of ‘aggregation’ for archives – computerised and then made available online throughout 80s and 90s

Funding silos meant outcomes of ‘Archives Online’ report published by Nationa Council on Archives (1998) were taken forward through a series of different projects – but all with committment to interoperability to allow for integration or cross-searching. Projects include:

A2A
Archives Hub
AIM25
Archives Wales
SCAN (Scotland)
JANUS (HE in Cambridgshire)

About 5 years ago, a view started to emerge that future not about aggregations ‘doing it for’ archives, but individual archives publishing their own catalogues online – but “usually proved disappointing” (personal view of Nick) – because:

contrained by lack of technical support
2 widely adopted commercial platforms – developments limited to those supported by ‘majority’ cusomters
Rarely offer robust search/browse
….

National Archives committed to supporting and promoting open data. Also has been pioneer in exploiting the potential of Linked Data – through http://legistlation.gov.uk and also looking at Linked Data version of PRONOM (impartial, definitive information about file formats, software products and other technical components) – see the National Archive ‘labs’ page http://labs.nationalarchives.gov.uk/wordpress/index.php/2011/01/linked-data-and-pronom

Lots of work going on at the National Archives – see http://labs.nationalarchives.gov.uk/wordpress/index.php. Also looking at a review of the National Register of Archives and considering a linked data approach.

Opening Data – Opening Doors: National Maritime Museum

April 18, 2011 · ostephens

Today I’m at the ‘Opening Data – Opening Doors’ event, which is both the first public event around the ‘Resource Discovery Task Force’ work, and also an opportunity to launch the JISC ‘Guide to Open Bibliographic Data’. The event is being live streamed at http://www.livestream.com/johnpopham

Following introduction from David Baker on the background of the RDTF and the ‘vision’ that came out of that work, now Laurence Chiles from the National Maritime Museum is talking about how they’ve approached publishing their collections and data on the web. The results are going live in the next few weeks.

Amongst a wide range of aims, they wanted to:

Connect objects & records across varied collection – use linked data to enable connectivity between objects; help develop the story and relationships across the collections
Give objects a growing online identity – permanent/stable home based on Object IDs
To be conversational – let people use the data but then start/react to the conversation – if no one knows it’s there …

Actions they took:

Changed the criteria:
From ‘web ready’ to ‘not for web use’ – i.e. moved from a ‘not on web’ assumption, to a ‘on web unless specific decision not to’ assumption
Decided publishing data without images was OK
4 basic mandatory fields – (Title, ID, ?, ?)

Offer new ways to the data:

OAI-PMH (for aggregation into Culture Grid and onwards to Europeana)
OpenSearch
XML

Used principles of linked data to link out of collections online:

AEON (Archival retrieval service through their Archive catalogue)
Cabinet (for print ordering service)
WorldCat (links to publications)
Plans to work with Wikimedia commons to enhance authoristy records
Exposed bothe the SOLR API for ‘traditional’ search and the SPARQL end-point for lined data
- Promoted at Culture and History Hack days

Going forward want to:

Promote a collaborative, conversational approch – e.g. ‘Help the NMM’ feature on all records
Improve ‘on-gallery’ experiences
Contiue to release more data and monitor – .e.g 1915 Crew lists

Overdue Ideas

Ideas linking Libraries, Computing, E-learning, and anything else that springs to mind.

Monthly Archives ⇒ April 2011

Opening Data – Opening Doors: Technical Standards

Opening Data – Opening Doors: Cambridge University Library

Opening Data – Opening Doors: The National Archives

Opening Data – Opening Doors: National Maritime Museum