Provenance and Linked Open Data

Today and tomorrow I’m at a workshop on Provenance and Linked Open Data in Edinburgh. The workshop is linked to the work of the W3C Provenance Incubator Group.

First up Paul Groth (@pgroth) from VU University Amsterdam is going to summarise the work of the incubator group and outline remaining open questions.

Paul says this audience (at this workshop) takes as a given the need for provenance. Provenance is fundamental to the web – it is a pressing issue in many areas for W3C:

  • Linked data/Semantic web
  • Open government (,
  • HCLS (?)

Most people do not know how to approach provenance – people looking for standard and methodology that they can use immediately. Existing research/work on provenance scattered across computer and library science research – hard to get overview. Also within enterprise/business systems often a concept of provenance, but without using the same terminology.

The provenance group was tasked to ‘provide state-of-the-art understanding and develop and roadmap’. About 20 active members, worked over about a year and came to:

  • Common (working) definition of provenance

“Provenance of a resource is a record that described entities and processes involved in producing and delivering or otherwise influencing that resource”

Provenance is metadata (but not all metadata is provenance). Provenance provides a ‘substrate for deriving different trust metrics’ (but it isn’t trust)

Provenance records can be used to verify and authenticate among other uses – but you can have provenance without cryptography/security

Provenance assertions can have their own provenance!

Inference is useful if provenance records are incomplete. There may be different accounts of provenance for the same data.

  • Developed set of key dimensions for provenance

3 top level dimensions:

Content – ability to identify things; describe processes; describe who made a statement; to know how a database solve a specific query

Management – How should provenance be ‘exposed’; How do we deal with the scale of provenance; How do we deal with scale?

Use – How do we go about using provenance information – showing trust, ownership, uncovering errors, interoperability…

Each of these dimensions broken down into further sub-categories.

  • Collected use cases

Over 30 use cases – from many domains (at least two from Cultural Heritage ‘Collection vs Objects in Cultural Heritage‘, ‘Different_Levels_Cultural_Heritage‘).

  • Designed 3 flagship scenarios from the use cases

The 30+ use cases were boiled down into three ‘super use-cases’ – trying to cover everything:

  1. News aggregator
  2. Disease Outbreak
  3. Business Contracts
  • Created mappings for existing vocabularies for provenance
  • … more

Group came up with recommendations:

  • Proposed a Provenance Interchange Working Group – to define a provenance exchange language – to enable systems to exchange provenance information, and to make it possible to publish this on the web


W3C in the process of deciding whether the Provenance Interchange Working Group should be approved. If this goes ahead will start soon. Two year working group – aggressive deliverable target. “Standards work is hard” says Paul. Will rely on next version of RDF (not time to cover this now).

Open Questions:

  • How to deal with Complex Objects
    • dealing with multiple levels of granularity
    • how provenance interacts with Named Graphs
    • Unification of database provenance and process ‘style’ provenance
    • objects, their versions and their provenance
    • visualisation and summarization
  • Imperfections
    • What is adequate provenance for proof/quality?
    • How do we deal with gaps in provenance?
    • Repeatability vs. reproduction and how much provenance is enough?
    • Can provenance help us get around the problem of reasoning over integrated data?
    • Using provenance as a platform for trust, does it work?
  • Distribution
    • How do we encourage provenance capture?
    • Multiple disagreeing claims about the origins data – which one is right?
    • SameAs detection through provenance
    • Distribution often gives us privacy – once we integrate how do we preserve privacy
    • Scale (way more provenance than data! Has to scale – to very large)
    • Hypothesis: distribution is a fundamental property of provenance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.