Provenance use cases for

Stephen Cresswell from The Stationery Office outlining:
  • – 60k pieces of legislation for UK managed by the National Archives
  • Publication in various formats – paper docs, pdf, xml, xhtml+RDFa, RDF
  • TSO (The Stationery Office) currently redesigning workflows for legislation

Use cases/Requirements:

  • Drafters of legislation (e.g. government departments) – “how is my job progressing though the publishing workflow?”
  • Management information (aggregated) – “where are the bottlenecks in the publishing workflow”
  • Maintainers may want to trace problems – “Which documents were derived from this XSLT?”
  • Anyone might ask “Where did this document come from?”
  • Acceptance test – re-run workflow from provenance graph (to prove that the provenance recorded is the true provenance – that re-running the same workflow results in the same outcome)


Provenance and Linked Data Issues

This list formed from discussions and voting yesterday (and will be posted to the wiki at some point)

Top Issues:

  • Things that change (if the thing that a URI points at change – e.g. boundaries of a geographical area) then what happens to the existing provenance statements
  • Provenance of links between data ‘islands’
  • Summarisation of provenance to aid usability and scalability
  • Reasoning about provenance: Cross-referencing filling the gap in the provenance

Further issues

  • 80/20 principle for provenance
  • What general provenance model to use to enable interoperability
  • Provenance for validation of facts
  • Is reasoning even possible over provenance?
  • Interaction of triple stores and data integration/transformation
  • Semantics vs data capture (does the rich semantic nature of Linked Data offset the need to capture provenance data ‘at source’)
  • Access level of provenance (public vs private provenance statements)

The Open Provenance Model

Introduction to the Open Provenance Model (OPM) from Luc Moreau, University of Southampton.

Came out of a series ‘provenance challenges’ – driven by a desire to understand provenance systems. First 3 provenance challenges (around interoperability of provenance information from different systems) led to the OPM, which became the basis of a further challenge event.

OPM is ‘annotated causality/dependency graph – directed, acyclic’ – it is an abstract modle with serialization formats to XML and RDF. It encourages specialisation for specific domains through profiles.

A simple use case inspired by Jeni Tennison blogpost [worth reading to look at practical application of provenance in RDF]

Lots of detail here that I didn’t get – need to go and read

OPM itself doesn’t say anything about aggregations – but using OPM it is possible to build an ontology that expresses this – there is a draft proposed collection profile – but needs more work.

Some work around being able to add a digital signature to provenance information – to validate it as from a specific source/person.

… at this point I’m afraid the 5am start got to me and I didn’t manage to capture the rest of this presentation 🙁

Some pointers

Becoming Dataware

This presentation from James Goulding …

Lots of ‘personal’ data being collected (e.g. by Tesco) – but not open. Who owns it? Very unclear – no precedents. Perhaps parallel with photography – if someone takes a photo of you, it is data about you that you don’t own.

Do ‘business’ own it? Lots of data in their data silos (Facebook, Tesco etc.)

Do we own the data? What if ‘I’ (as an individual) want ‘my’ data to be open

Data ‘tug-of-war’ – data can be duplicated, instantly transferrable

Marx: split between ‘those who own the means of production and those who work on them’ – but in data creation of this type, we’re not the worker? We’re not the customer?

We are the product!

Policy slow to catchup with practice.

Big data generates new data…

So question is not owns an individuals data, but: Who controls the means of analysis?

Vision of a ‘personal datasphere’:

  • My (SPARQL?) endpoint – under my control
  • Logically a single entity – a catalo on hosts under my control or on my cloud – maintains privacy
  • User controlled – may decide to expose to trusted partners or for a price… (right to access data may not be right to process it!)

Dataware concept – it updates the ‘catalogue’ as other data sources (e.g. energy consumption data, social media etc.) update. Third party applications then could request permission to access data from the Dataware catalogue, the catalogue issues a token, which then allows the 3rd party app to access the data source (Facebook, Bank data, etc. etc.)

Illustrated here:

(click for larger picture)

James interested in how provenance might apply to the Dataware catalogue.

Provenance in the Dynamic, Collaborative New Science

This presentation by Jun Zhao, University of Oxford – but missed the start as I presented just now, and getting myself sorted…

Want to use some of the systems/expertise of libraries (esp. digital preservation) to preserve workflows – to make experiments repeatable at any time in the future. Project is ‘Workflow4Ever‘ –

Need to be able re-run experiments over data sets as data sets grow – does finding remain true as data grows.

Biology Use case – ‘reuse’:

  • Search for existing experiments from myExperiment (
    • Challenge – understand the workflow
    • Perform test runs with test data and his/her own data
    • Read others’ logs
    • Read annotations to workflows
  • Reuse scripts from colleagues and perform test that his/her colleagues are familiar with

Provenance Challenges:

  • Identity
  • Context
  • Storage
  • Retrieval

Provenance and Linked Open Data

Today and tomorrow I’m at a workshop on Provenance and Linked Open Data in Edinburgh. The workshop is linked to the work of the W3C Provenance Incubator Group.

First up Paul Groth (@pgroth) from VU University Amsterdam is going to summarise the work of the incubator group and outline remaining open questions.

Paul says this audience (at this workshop) takes as a given the need for provenance. Provenance is fundamental to the web – it is a pressing issue in many areas for W3C:

  • Linked data/Semantic web
  • Open government (,
  • HCLS (?)

Most people do not know how to approach provenance – people looking for standard and methodology that they can use immediately. Existing research/work on provenance scattered across computer and library science research – hard to get overview. Also within enterprise/business systems often a concept of provenance, but without using the same terminology.

The provenance group was tasked to ‘provide state-of-the-art understanding and develop and roadmap’. About 20 active members, worked over about a year and came to:

  • Common (working) definition of provenance

“Provenance of a resource is a record that described entities and processes involved in producing and delivering or otherwise influencing that resource”

Provenance is metadata (but not all metadata is provenance). Provenance provides a ‘substrate for deriving different trust metrics’ (but it isn’t trust)

Provenance records can be used to verify and authenticate among other uses – but you can have provenance without cryptography/security

Provenance assertions can have their own provenance!

Inference is useful if provenance records are incomplete. There may be different accounts of provenance for the same data.

  • Developed set of key dimensions for provenance

3 top level dimensions:

Content – ability to identify things; describe processes; describe who made a statement; to know how a database solve a specific query

Management – How should provenance be ‘exposed’; How do we deal with the scale of provenance; How do we deal with scale?

Use – How do we go about using provenance information – showing trust, ownership, uncovering errors, interoperability…

Each of these dimensions broken down into further sub-categories.

  • Collected use cases

Over 30 use cases – from many domains (at least two from Cultural Heritage ‘Collection vs Objects in Cultural Heritage‘, ‘Different_Levels_Cultural_Heritage‘).

  • Designed 3 flagship scenarios from the use cases

The 30+ use cases were boiled down into three ‘super use-cases’ – trying to cover everything:

  1. News aggregator
  2. Disease Outbreak
  3. Business Contracts
  • Created mappings for existing vocabularies for provenance
  • … more

Group came up with recommendations:

  • Proposed a Provenance Interchange Working Group – to define a provenance exchange language – to enable systems to exchange provenance information, and to make it possible to publish this on the web


W3C in the process of deciding whether the Provenance Interchange Working Group should be approved. If this goes ahead will start soon. Two year working group – aggressive deliverable target. “Standards work is hard” says Paul. Will rely on next version of RDF (not time to cover this now).

Open Questions:

  • How to deal with Complex Objects
    • dealing with multiple levels of granularity
    • how provenance interacts with Named Graphs
    • Unification of database provenance and process ‘style’ provenance
    • objects, their versions and their provenance
    • visualisation and summarization
  • Imperfections
    • What is adequate provenance for proof/quality?
    • How do we deal with gaps in provenance?
    • Repeatability vs. reproduction and how much provenance is enough?
    • Can provenance help us get around the problem of reasoning over integrated data?
    • Using provenance as a platform for trust, does it work?
  • Distribution
    • How do we encourage provenance capture?
    • Multiple disagreeing claims about the origins data – which one is right?
    • SameAs detection through provenance
    • Distribution often gives us privacy – once we integrate how do we preserve privacy
    • Scale (way more provenance than data! Has to scale – to very large)
    • Hypothesis: distribution is a fundamental property of provenance