Provenance use cases for legislation.gov.uk

Stephen Cresswell from The Stationery Office outlining:
Background
  • legislation.gov.uk – 60k pieces of legislation for UK managed by the National Archives
  • Publication in various formats – paper docs, pdf, xml, xhtml+RDFa, RDF
  • TSO (The Stationery Office) currently redesigning workflows for legislation

Use cases/Requirements:

  • Drafters of legislation (e.g. government departments) – “how is my job progressing though the publishing workflow?”
  • Management information (aggregated) – “where are the bottlenecks in the publishing workflow”
  • Maintainers may want to trace problems – “Which documents were derived from this XSLT?”
  • Anyone might ask “Where did this document come from?”
  • Acceptance test – re-run workflow from provenance graph (to prove that the provenance recorded is the true provenance – that re-running the same workflow results in the same outcome)

 

Provenance and Linked Data Issues

This list formed from discussions and voting yesterday (and will be posted to the wiki at some point)

Top Issues:

  • Things that change (if the thing that a URI points at change – e.g. boundaries of a geographical area) then what happens to the existing provenance statements
  • Provenance of links between data ‘islands’
  • Summarisation of provenance to aid usability and scalability
  • Reasoning about provenance: Cross-referencing filling the gap in the provenance

Further issues

  • 80/20 principle for provenance
  • What general provenance model to use to enable interoperability
  • Provenance for validation of facts
  • Is reasoning even possible over provenance?
  • Interaction of triple stores and data integration/transformation
  • Semantics vs data capture (does the rich semantic nature of Linked Data offset the need to capture provenance data ‘at source’)
  • Access level of provenance (public vs private provenance statements)

The Open Provenance Model

Introduction to the Open Provenance Model (OPM) from Luc Moreau, University of Southampton.

Came out of a series ‘provenance challenges’ – driven by a desire to understand provenance systems. First 3 provenance challenges (around interoperability of provenance information from different systems) led to the OPM, which became the basis of a further challenge event.

OPM is ‘annotated causality/dependency graph – directed, acyclic’ – it is an abstract modle with serialization formats to XML and RDF. It encourages specialisation for specific domains through profiles.

A simple use case inspired by Jeni Tennison blogpost http://www.jenitennison.com/blog/node/133 [worth reading to look at practical application of provenance in RDF]

Lots of detail here that I didn’t get – need to go and read http://openprovenance.org/

OPM itself doesn’t say anything about aggregations – but using OPM it is possible to build an ontology that expresses this – there is a draft proposed collection profile – but needs more work.

Some work around being able to add a digital signature to provenance information – to validate it as from a specific source/person.

… at this point I’m afraid the 5am start got to me and I didn’t manage to capture the rest of this presentation 🙁

Some pointers

Becoming Dataware

This presentation from James Goulding …

Lots of ‘personal’ data being collected (e.g. by Tesco) – but not open. Who owns it? Very unclear – no precedents. Perhaps parallel with photography – if someone takes a photo of you, it is data about you that you don’t own.

Do ‘business’ own it? Lots of data in their data silos (Facebook, Tesco etc.)

Do we own the data? What if ‘I’ (as an individual) want ‘my’ data to be open

Data ‘tug-of-war’ – data can be duplicated, instantly transferrable

Marx: split between ‘those who own the means of production and those who work on them’ – but in data creation of this type, we’re not the worker? We’re not the customer?

We are the product!

Policy slow to catchup with practice.

Big data generates new data…

So question is not owns an individuals data, but: Who controls the means of analysis?

Vision of a ‘personal datasphere’:

  • My (SPARQL?) endpoint – under my control
  • Logically a single entity – a catalo on hosts under my control or on my cloud – maintains privacy
  • User controlled – may decide to expose to trusted partners or for a price… (right to access data may not be right to process it!)

Dataware concept – it updates the ‘catalogue’ as other data sources (e.g. energy consumption data, social media etc.) update. Third party applications then could request permission to access data from the Dataware catalogue, the catalogue issues a token, which then allows the 3rd party app to access the data source (Facebook, Bank data, etc. etc.)

Illustrated here:

(click for larger picture)

James interested in how provenance might apply to the Dataware catalogue.

Provenance in the Dynamic, Collaborative New Science

This presentation by Jun Zhao, University of Oxford – but missed the start as I presented just now, and getting myself sorted…

Want to use some of the systems/expertise of libraries (esp. digital preservation) to preserve workflows – to make experiments repeatable at any time in the future. Project is ‘Workflow4Ever‘ – http://www.wf4ever-project.org/

Need to be able re-run experiments over data sets as data sets grow – does finding remain true as data grows.

Biology Use case – ‘reuse’:

  • Search for existing experiments from myExperiment (http://myexperiment.org)
    • Challenge – understand the workflow
    • Perform test runs with test data and his/her own data
    • Read others’ logs
    • Read annotations to workflows
  • Reuse scripts from colleagues and perform test that his/her colleagues are familiar with

Provenance Challenges:

  • Identity
  • Context
  • Storage
  • Retrieval

Provenance and Linked Open Data

Today and tomorrow I’m at a workshop on Provenance and Linked Open Data in Edinburgh. The workshop is linked to the work of the W3C Provenance Incubator Group.

First up Paul Groth (@pgroth) from VU University Amsterdam is going to summarise the work of the incubator group and outline remaining open questions.

Paul says this audience (at this workshop) takes as a given the need for provenance. Provenance is fundamental to the web – it is a pressing issue in many areas for W3C:

  • Linked data/Semantic web
  • Open government (data.gov, data.gov.uk)
  • HCLS (?)

Most people do not know how to approach provenance – people looking for standard and methodology that they can use immediately. Existing research/work on provenance scattered across computer and library science research – hard to get overview. Also within enterprise/business systems often a concept of provenance, but without using the same terminology.

The provenance group was tasked to ‘provide state-of-the-art understanding and develop and roadmap’. About 20 active members, worked over about a year and came to:

  • Common (working) definition of provenance

“Provenance of a resource is a record that described entities and processes involved in producing and delivering or otherwise influencing that resource”

Provenance is metadata (but not all metadata is provenance). Provenance provides a ‘substrate for deriving different trust metrics’ (but it isn’t trust)

Provenance records can be used to verify and authenticate among other uses – but you can have provenance without cryptography/security

Provenance assertions can have their own provenance!

Inference is useful if provenance records are incomplete. There may be different accounts of provenance for the same data.

  • Developed set of key dimensions for provenance

3 top level dimensions:

Content – ability to identify things; describe processes; describe who made a statement; to know how a database solve a specific query

Management – How should provenance be ‘exposed’; How do we deal with the scale of provenance; How do we deal with scale?

Use – How do we go about using provenance information – showing trust, ownership, uncovering errors, interoperability…

Each of these dimensions broken down into further sub-categories.

  • Collected use cases

Over 30 use cases – from many domains (at least two from Cultural Heritage ‘Collection vs Objects in Cultural Heritage‘, ‘Different_Levels_Cultural_Heritage‘).

  • Designed 3 flagship scenarios from the use cases

The 30+ use cases were boiled down into three ‘super use-cases’ – trying to cover everything:

  1. News aggregator
  2. Disease Outbreak
  3. Business Contracts
  • Created mappings for existing vocabularies for provenance
  • … more

Group came up with recommendations:

  • Proposed a Provenance Interchange Working Group – to define a provenance exchange language – to enable systems to exchange provenance information, and to make it possible to publish this on the web

Timeline:

W3C in the process of deciding whether the Provenance Interchange Working Group should be approved. If this goes ahead will start soon. Two year working group – aggressive deliverable target. “Standards work is hard” says Paul. Will rely on next version of RDF (not time to cover this now).

Open Questions:

  • How to deal with Complex Objects
    • dealing with multiple levels of granularity
    • how provenance interacts with Named Graphs
    • Unification of database provenance and process ‘style’ provenance
    • objects, their versions and their provenance
    • visualisation and summarization
  • Imperfections
    • What is adequate provenance for proof/quality?
    • How do we deal with gaps in provenance?
    • Repeatability vs. reproduction and how much provenance is enough?
    • Can provenance help us get around the problem of reasoning over integrated data?
    • Using provenance as a platform for trust, does it work?
  • Distribution
    • How do we encourage provenance capture?
    • Multiple disagreeing claims about the origins data – which one is right?
    • SameAs detection through provenance
    • Distribution often gives us privacy – once we integrate how do we preserve privacy
    • Scale (way more provenance than data! Has to scale – to very large)
    • Hypothesis: distribution is a fundamental property of provenance

Counting the cost

The context of this post and related work is that our local public library service (Warwickshire Library and Information Service) is facing a cut of approximately 27% of its budget over the next 3 years (£2million from an overall budget of £7.4million). For more information see Facing the Challenge on libraries on the Warkwickshire County Council website.

One canard that seems to come up repeatedly in discussions around public libraries and their role is that ‘books are cheap’ and that this means there isn’t the need for public libraries that there once was. I thought an interesting experiment might be to work out how much our family’s use of the library service might equate to based on the cost of buying the books that we have borrowed from the library. So I set about writing a script to achieve this.

The first stage was to retrieve details of a loan history – luckily the system used by Warwickshire libraries (Vubis) do have a loan history which seems to go back 18 months (although with an absolute minimum of information – more on this later). Unluckily there is no way of querying this via an API, only via a web browser. This is not so unusual, and there are good ways of scraping data out of html which I’ve used before. However, in this case the page I needed to get the data from was protected by a login – as it was a loan history linked to an individual library account, and required a barcode and PIN to be input into a form on a webpage before accessing the loan history page.

To help with this I used a Ruby library called ‘Mechanize‘ (inspired by the Perl Mechanize library which does the same sort of thing). This essentially allows you to simulate web browser behaviour within a script – it can fill in forms for you and deal with things like cookies which the web page might try to write to the browser. (incidentally, if you are interested in doing stuff like automating checks for authenticated library resources online Mechanize is worth a look)

Once Mechanize has navigated through the login form and then to the loan history page the content of the page can be parsed as usual – luckily Mechanize supports parsing page content with another Ruby library called Nokogiri – which is my preferred HTML/XML parser in Ruby anyway. So grabbing the details from loan history page was fine once Mechanize had retrieved it.

Unfortunately the amount of information available in the loan history is minimal – just a title and a date. Since at minimum I need to have a price, I need more information. My first thought was I’d get prices from Amazon, but with only the title available I quickly realised I was going to get a lot of duplicate items – especially since my son (3 and a half) borrows books with titles like ‘Goldilocks and the three bears’ and ‘Shapes’ – not very unique.

So I needed to get more details – preferably an ISBN – before I start looking up prices. Any method of looking up books given only the title is going to suffer from the same problems of many duplicate titles. The way to minimise the number of duplicates is to search in the most limited pool of items for the titles from the loan history – which in this case is to search the Warwickshire library catalogue and limit the results to books in our local branch (Leamington Spa) – as this is the only branch we use.

The software that Warwickshire use to provide their library catalogue interface on the web is called ‘Aquabrowser‘ (as opposed to the one that stores administrative information like the loan history – which is called Vubis). Happily unlike Vubis, Aquabrowser does have some APIs that can be used to get information out of the system in a machine readable format. I have to admit that this functionality isn’t well advertised (neither by Warwickshire Library service nor by the vendor of the software), but luckily Chris Keene and Ed Chamberlain have both done some work in documenting the Aquabrowser API (here and here) which is in use at their place of work (University of Sussex and University of Cambridge respectively).

Aquabrowser supports what is meant to be a sector wide standard interface called SRU. The location of the SRU interface for each Aquabrowser interface seems to be in a standard place – http://aquabrowser-url/aquabrowser-path/sru.ashx? (in this case librarycatalogue.warwickshire.gov.uk/abwarwick/sru.ashx?) – then followed by the relevant SRU parameters (e.g. operation=explain&version=1.1&query=dogs&maximumRecords=100&recordSchema=dc – see http://www.loc.gov/standards/sru/ for full documentation on SRU). However, there are several issues with the Aquabrowser implementation of SRU. I’m not clear which issues are Aquabrowser issues, and which are down to local configuration decisions, but the result is a rather limited interface. In this case the main problems for me was that it didn’t allow me to limit a search by title, and it didn’t include the ISBN in the results.

As well as the ‘standard’ SRU interface, Aquabrowser also supports a proprietary interface which will provide search results as XML. This is pretty much undocumented as far as I can tell, but it seems you can take some of the URLs from the HTML interfaces and add the parameter ‘output=xml’ to the URL (I’m not 100% sure, but my guess is that the HTML interface actually is built over the XML API). A basic request looks like http://aquabrowser-url/aquabrowser-path/result.ashx?q=emily%20gravett&output=xml (in this case http://librarycatalogue.warwickshire.gov.uk/abwarwick/result.ashx?q=emily%20gravett&output=xml).

The default HTML interface uses frames which obscures some of what is happening – however, if you use the ‘Accessible’ version, the syntax in the URLs can be seen easily. Using this method I was able to copy the syntax for limiting the results to stock available in a specific branch – in this case the parameter required is ‘branch=LEA’.

The results returned by the XML interface in the Warwickshire instance include information lacking from the SRU interface – and ISBN is one of the elements included. However searching by title is still not possible as far as I can tell – although I came up with a work around for this.

So, once I have the list of titles from the loan history, I use each title to run a search against the Aquabrowser XML interface, limiting the results to stock in the Leamington Spa branch. Since I can’t limit to just title search instead I check the contents of the <title> tag in each record contained in the XML response (note in the XML response Aquabrowser wraps matches to your search terms <exact> tags so you have to ignore these when comparing search phrases). This technique also means I can do more exact matching (e.g. including case sensitivity) than a search would let me.

I decided that if I found duplicates having done this filtering I’d take each duplicate record and include them in my results for later manual tweaking.

So now I have a set of bibliographic records with generally a title, an author (or author statement) and an ISBN. An added bonus is a URL for a cover image for the book is in the response, so I grab that as well (it’s an Amazon cover image). What I don’t have is the price (price may well be in the library record somewhere, but it isn’t included in the XML response). So I then use the ISBN to look up the price on Amazon – I tried a couple of Ruby libraries designed to work with the Amazon ‘Product Advertising’ API (which is the one you need for stuff like prices), but the only one I really liked was ‘Sucker‘ – it is very lightweight, doesn’t hide the Amazon API that much, and again has Nokogiri somewhere underneath (and some other stuff) – which means dealing with the results is relatively straightforward if you’ve used Nokogiri.

So, from the Amazon response I grab the price (I check it is in GBP first) and also a URL for the Amazon product page (so I can provide a link back to Amazon). (This isn’t perfect – despite the library catalogue searching above, I don’t get ISBNs for every item and so can’t search them all on Amazon).

Now I have all the information I need but I don’t have a way to present it (and I still have some unwanted duplicate entries in the books list). For this I went back to a tactic I used previously for the (currently defunct) ‘What to Watch‘ application – importing the results of a script into WordPress as a series of blog posts. In this case created a csv file which allowed me to easily add in a few tags – such as names to indicate which account (mine, Damyanti’s or our son’s) the item had been borrowed on. The ‘text’ of the blog post was some simple html to display the cover image and a few links – including an option to add the item to a LibraryThing account in a single click. I also included some WordPress ‘custom fields’ to record the title, ISBN and price – to enable to additional functions in the blog.

Using the ‘csv importer’ WordPress plugin (a great plugin) to import the csv file the content was now in the blog. This is the point at which we removed those duplicate book entries (this did require us to remember what we’d borrowed!). The final step was to actually do the calculation I originally wanted to do – the cost of the books we’d borrowed. This final step was achieved by writing a WordPress plugin to support the display of a Widget which uses SQL to get values from the ‘custom field’ storing price information. Since I haven’t got prices for all books, I use those I have got, work out an average per book, and multiply this up by the total number of loans.

The final result is the ‘Overdue Books‘ blog showing every book we have borrowed as a family since September 2009 at http://www.meanboyfriend.com/overdue_books/. The headline figure is that if we’d bought those items we’ve borrowed in this time the cost would have been approximately £3,428.26 – or around £190 per month  – or £44 pounds a week. I think it is fair to say even a relatively well off family would find this steep!

Finally the code is available at https://github.com/ostephens/Savelibs – there are all kinds of things I’m unhappy about and want to improve – including modifying it so it can be used to regularly update the Overdue Books blog without any manual intervention from me – and if anyone wants to take it and make something that any library user could use to do the same type of calculation feel free (I’d like to do this, but not sure where I’m going to get the time).

Addendum 08/03/2011

In an oversight last night I forgot to add my apologies and thanks to the IT people at Warwickshire council and library – firstly they chose a system for their library catalogue with an API (despite some shortcomings, this is a big step!) and secondly they didn’t shout at me when I repeatedly ran hundreds of searches against their catalogue in quick succession, but rather got in touch politely so we could work together to ensure the scripts weren’t going to cause a problem on the system – if you are thinking of hacking stuff around like this, probably as well to get in touch with system owners first!

What I do hope is that there might be some opportunity for developers like me to work with the library system and data – it would be really good if we could see applications built on top of the library systems APIs (such as this one which won the first ‘Hack Warwickshire’ competition) – the attitude Warwickshire have to open data and wanting to get developers involved is really refreshing 🙂

CORE

CORE (COnnecting REpositories) (Presented by Petr Knoth from Open University)

Working with content and metadata from Open Access Institutional Repositories – approx 167 repositories in the UK. Mainly interested in Full-text items (approx 10 percent of metadata records in repositories have full-text items attached).

Will use OAI-PMH to harvest metadata, and then use to grab the pdf (or other full-text) representations of resource. Will then analyse content, and find ‘similarities’ between items – and then express as RDF. Will then make available via triple store.

Have started working with the Open University repository (ORO) – finding about 30% have full-text. Will focus on extracting relationships – specifically ‘semantic similarity’ based on content… (rather than on metadata)

Use cases – demonstrator client that can be integrated into any repository – which will provide links to papers in other repositories based on similarity relationships – will be open to any institution to use.

COMET

(Ed Chamberlain presenting) – COMET (Cambridge Open Metadata) – follow on from the Open Bibliography project– collaborating with CARET at Cambridge, with support from OCLC.

Want to publish lots of Open Bibliographic data – engaging with RDF, enriching records with FAST and VIAF linking – and will document experiences

Be taking MARC21 encoded records from library catalogue – data taken from main University Library catalogue – historical mix of quality and origin built overtime – some mixture of standards as cataloguing practice has changed over time

Data will be made available as MARC21 (bulk download) and also RDF/XML (bulk download and triple-store with SPARQL endpoint etc.) Will publish where possible under PDDL, but will need discussions with OCLC where it includes data from OCLC derived records…

Contextual Wrappers

(Presented by David Scruton) Contextual Wrappers is project from FitzWilliam Museum at the University of Cambridge – working with Knowledge Integration Ltd and Culture Grid.

Looking at how ‘Collection Level Descriptions’ interact/add value to ‘item level descriptions’

Will be providing metadata for 160k items and 50 collection level descriptions (think I got that right)

Will be producing draft update for collection level descriptions (and related APIs) in the Culture Grid