Linked Data and Libraries: Richard Wallis

A year on from the first Talis Linked Data and Libraries meeting – a lot has happened. The W3C has a group on ‘linked data and libraries’; the BL has released a load of records as RDF/XML – brave decision; Richard went to meeting where in Denmark there was discussion about trying to persuade publishers to release metadata; Europeana Linked Data now available; parts of French National Catalogue etc. etc.

While it may feel that progress is slow – people are getting out there and ‘just doing it’ as Lynne Brindley just suggested.

Talis – started pioneering in Semantic Web/Linked Data areas. Recently the Library Systems Division has been sold to Capita – allow it to focus on libraries, while ‘Talis’ is going forward with focus on linked data/semantic web.

Now Talis is made up of:

Talis Group – core technologies/strategy – run the Talis Platform
Talis Education – applications in academia – offer the Talis Aspire product (for ‘reading lists’ in Universities)
Talis Systems Ltd – Consulting/Training/Evangelism around Linked Data etc.
Kasabi.com – [Linked] Data Marketplace – free hosting at the moment – Community – APIs. Evetually looking to monetise

Enough about Talis …

UK Government still pressing forward with open linked data
BBC have done more with Linked Data – e.g. World Cup 2010 site was based on Linked Data – delivered more pages and more traffic with less staff. BBC already working with same technology to look at Olympics 2012 site…

Richard now mentioning Good Relations ontology – adoption by very large commercial organisations.

Linked Data ‘cloud’ has got larger – more links – but what are these links for?
Links (i.e. http URIs) identify things – and so you can deliver information about things you link to… Richard says lots of the ‘talk’ is about things like SPARQL endpoints etc. But should be about identifying things and delivering information about them.

Richard breaking down Linked Data – how RDF describes stuff etc. Allows you to find relationships between things – that machines can parse… [Richard actually said ‘understand’ but don’t think he is necessarily talking AI here]

Richard stressing that Linked Data is about technology and Open Data about licensing – separate issues which talking about ‘Linked Open Data’ conflates – quotes Paul Walk on this from http://blog.paulwalk.net/2009/11/11/linked-open-semantic/ – but says he (Richard) would talk about the Linked Data web not the Semantic Web (Paul uses latter term)

Richard thinks that Libraries have an advantage in entering the Linked Data world – lots of experience, lots of standards, lots of records. We have described things, whereas generally people just have the things they need to describe.

Already have lots of lists of things – subject headings (lcsh), classifications (dewey), people (authority files)

Are libraries good at describing things… or just books?

Are Libraries applicable for Linked Data? How are we doing? Richard gives a school report – “Could do better”; “Showing promise”

When we look at expressing stuff as linked data we need to ask “Why and Who For!”

Linked Data and Libraries: Keynote

Today at the second Linked Data and Libraries meeting organised by Talis.

Lynne Brindley is kicking off the day…
Noting that the broad agenda is about getting access to information, linking across domains etc. See potential of Linked Data approach to increasing the use of their catalogue & so collections. Bringing better discovery and therefor utility to researchers – and to exploit the legacy of data that has been created over long periods of time.

The British library is ‘up for it’ – but need to look at the costs, benefits and may need to convince sceptics. But BL has history of taking innovative steps – introduced public online information retrieval systems to the UK around 40 years ago (MEDLARS in the 1970s). 10 years later UK was one of the first countries to publish National Bibliography on CD-ROM (now ‘museum pieces’! says Lynne).

And now exposing national bibliography as linked open data… – some history:

BL involved in UK PubMed Central (UKPMC) – repository of papers, patents, reports etc. etc. Contains many data types from many organisations. Provides better access to hard to find reports, theses etc. For Lynne this is also about ‘linking’ even if not built on “Linked Data” technology stack. – sees it as part and parcel of same thing and movement in a direction of linking materials/collections.

Also ‘sounds’ – UK Sound Map http://sounds.bl.uk/uksoundmap/index.aspx – linked across domains and also involved public in capturing ‘sounds of Britain’ – via AudioBoo – added metadata and mashed up with Google Maps…

‘Evolving English’ exhibition – had a ‘map your voice’ element – many people recorded the same piece of material – which has been incorporated into a research database of linguistic recordings – global collaboration and massive participation.

Lynne says – it is pretty difficult to do multi-national, multi-organisational stuff – and should learn from these examples.

The BL Catalogue is primary tool to access, order and manage the BL collections. Long operated a priced service where the catalogue records are sold to others – in various forms. Despite pressure to earn money from Government, BL decided to take step of offering BNB records as RDF/XML under a CC0. Today will be announcing a linked data version of BNB – more later today from Neil Wilson.

Hope that the data will get used in a wide variety of ways. Key lesson for BL says Lynne – is ‘relinquish control, let go’ – however you think people are going to use what you put out there, they will use it in a different way.

Promise of linked data offers many benefits across sectors, across ‘memory institutions’. But the institutions involved will need to face cultural change to achieve this. ‘Curators’ in any context (libraries, archives, museums) are used to their ‘vested authority’ – and we need to both recognise this at the same time as ‘letting go’ – from the library point of view no-one can afford to stand on the sidelines – we need to get in there and experiment.

Need to get out of our institutional and metadata silos – and take a journey to the ‘mainstream future’. Partnerships are important – and everyone wants to ‘partner’ with the British Library – but often proposed partnerships are one sided – we need to look for win-win partnerships with institutions like the BL.

Final message – we are good at talking – but we need to ‘just do it’ – do it and show benefits and convince people.

JISCExpo: Past, Present and Future of Linked data in Higher Education

Opening the day is Paul Miller who authored the Linked Data Horizon Scan.

The horizon scan was written in Q3/4 2009 to see what JISC needed to focus on in terms on Linked Data. It included 9 recommendations – 3 on web identifiers; 4 on data publishing; 2 support measures

Paul going to revisit these recommendations this morning…

“Learn from Cabinet Office Guidance on the creation of URIs”
Paul thinks this ‘is done’ – “almost unnecessary to say in 2011”
[I’m not as optimistic as Paul on this one – still feel huge battle to fight in terms of convincing people that URIs are identifiers not just web addresses]

“Identify a core set of widely used identifiers (JACS, instution codes, etc.) and assign HTTP URIs”
Paul says some progress I suppose – really should have been more though – this is the step that will make all of this stuff work across data sets”

Identify the ways that researchers identify themselves and link to instutional, professional, socail identities as appropriate”
Paul says – on the whol remains ad hoc, with individuals using self defined URLs (whether self-owned or twitter/linkedin/blog url) as surrogates. Paul asks – is this OK?

David Shotton (Oxford) mentions ORCID http://www.orcid.org/ – feeling from the room that it shows promise for solving the problem of identifying people in authoring contexts – but there is a larger problem, and the question of how multiple identifiers for people are matched up is also going to be problematic

“Look at OPSI Unlocking Service (http://www.opsi.gov.uk/unlocking-service/OPSIpage.aspx?page=UnlockIndex) and consider whether a similar approach might be used in helping the community identify data sets to prioritise”
Paul says – not really been looked at – do we still need to?

“Evaluate the effectiveness of Data Incubator etc as a way of marrying data to developers”
Paul says – some DevCSI activity on this but nothing systematic?
Also http://getthedata.org gets a mention, and ‘data without borders’ initiative http://jakeporway.com/2011/06/data-without-borders-an-exciting-first-day/

“Validate existing data licenses, and engage with government”
Not perfect by pretty good across Strategic Content Alliance (SCA http://sca.jiscinvolve.org/wp/), Discovery initiative (http://discovery.ac.uk), etc.

Paul feels we are moving to point where question is ‘why can this not be open’ rather than ‘why should this be open’

“Demonstrate the utility of embedding RDFa on institutional web pages”
Paul really suprised by the apparent lack of any serious progress in this area.. – debate from the floor – some question value of RDFa – why do it. If CMS doesn’t support then difficult to achieve. I raise issue we saw in Lucero of how Google et al actually present or use data published in RDFa

“Identify ways in which the community can consume and contribute to existing data services.”
? Missed Paul’s summary…

“Identify a focus for Linked Data activities”
This programme – #jiscexpo – so challenge now is how to share and get issues out.

Paul thinks on balance – good progress on 4, failed on 5

Paul says – we need to focus less on raw numbers more on real utility – and more links between resources – how do we achieve this? Very little interlinking going on – except small number of key resources such as Dbpedia. Need real linking beyond a single data set, beyond a single institution…

Debate from the floor – is ‘link more’ really relevant? Paul agrees – again just ‘lots of links’ is not the point – about valuable linking.

I make point about re-use of URIs – more difficult than coining URIs!

Paul says we need to share wisdom around – Where does Linked Data add real value – where is it merely possible and where is it ‘really stupid’…

JISCExpo: Notes from the Linked data Workshop at Stanford

This session being presented by Jerry Persons.

This workshop spent a week looking at Linked Data – ‘be part of the web, not just on it’. Workshop sponsored by a variety of research and national libraries, research groups, companies, etc.

The workshop focused on “crafting fund-able plans for creating tools, processes and vehicles to expedite a disruptive paradigm shift in the workflows, data stores and interfaces used for managing, discovery and navigating…”

The workshop was deliberately ‘library focussed’ but recognise much wider issues – especially synergy for GLAM (galleries, libraries, archives, museums)

“I’ve liked to characterize the current moment as a circle of libraries, museums, archives, universities, journalists, publishers, broadcasters and a number of others in the culture industries standing around, eyeing one another and at the space between them while wondering how they need to reconfigure for a world of digitally networked knowledge” – Josh Greenberg

“The biggest problem we face right now is a way to ‘link’ information that comes from different sources that can scale to hundereds of millions of statements” – Stefano Mazzocchi

22 issues were identified by the mid-point of the workshop – just a few here:
co-referencings, reconciliation
use of extant metadata
killer apps
user seduction and training
workflow
scalability
licensing

Jerry says … “The elevator pitch for linked data does not yet exist”

Thinking about ‘novice’ (apprentice), ‘journeyman’, ‘master’ stages of engaging with Linked Data:

  • Value statement use cases
  • Publishing data
  • etc.

At each stage we should be looking model implementations that people can look at/follow

Elephants in the room:
URIs not strings – don’t underestimate the amount of effort required to transform large subsets of GLAM metadata into linked data with URIs as identifiers

Caveats…
Management of co-refereces needs to be a bottom up process
Build systems that accept the way the world is, not what you would like to be
Focus on changing current practices (in the long run) not only on reconciling data (in the short run) – preventing problems better than solving them!

Some docs will be coming out from the workshop very soon as well as proposals for work – over next few months

JISCExpo: Community and Linked Data

I’m at the #jiscexpo programme meeting today and tomorrow…

Ben O’Steen is the first formal talk of the day … talking about ‘community’…

Ben notes that SPARQL has a very bad reputation – people don’t like it and don’t want to use it. Taking a step back – SQL is standard way of interacting with databases, but in general you don’t write SQL queries against someone else’s database – and v unusual to do this without permission and documentation etc. (I guess unless you are really hacking into it!)

In general SQL databases are hidden from ‘remote’ users via APIs or other interfaces which present you with views on the data, not the raw data structure.

So what does this tell us about what we need to do with Linked Data?

Interaction Feedback Loop – fundamental – if you can get this you get engagement. Example ‘mouse presses button, mouse gets cheese’ – this encourages a behaviour in the mouse. Ben uses World of WarCraft as example of interaction feedback loop that works incredibly well – people write their own programmes and interfaces for WoW.

Ben notes this is not about Gamification… this is about getting pay-off for interaction.

Ben sets some homework – go read http://jboriss.wordpress.com/2011/07/06/user-testing-in-the-wild-joes-first-computer-encounter/ – blog post about user testing on web browsers and the experience of ‘Joe’ a 60 year-old who has never used a computer before – and what happened when he tried to find a local restaurant to eat in via three major web browsers “There is little modern applications do to guide people who have never used a computer”.

Sliding scale of interaction

  • googling and finding a website;
  • hunting and clicking round the website for information;
  • using a well-documented or cookie-cutter API (such as an Atom feed or a search interface);
  • Using boolean searching or other simple API ‘tricks’ –
    • WITHOUT requiring understanding of the true data model

Ben now going back to SPARQL – it is common when interacting with an unknown SPARQL endpoint to become frustrated….

What do you need to understand to craft successful SPARQL?
Understand

  • RDF and triple/quad model
  • RDF types and namespaces
  • structures in an endpoint
  • SPARQL syntaxes
  • SPARQL return formats
  • libraries for RDF responses
  • libraries for XML responses
  • … and more

Developers are clamouring for APIs

  • Every new social/web service is seen to be lacking if it is missing an API due to desire to build mobile applications
  • Whilst SPARQL can be seen as the ultimate API, then the ultimate Twitter API would be access using its Scala/Java libraries
  • Many need to see the benefits of something simple in order to hook them into learning something more complex

Taking an ‘opinionated view’ on information helps adopters – offering a constrained view of the model. Could offer csv/json/html views on the data behind a SPARQL endpoint. Ben notes ‘access to the full model is a wonderful thing’ – but don’t forget (paraphrase) ‘most average developers want constrained view’
Ben now talking about schema.org – new intiative from Google, Bing and Yahoo! Ben notes – schema.org delivers ‘cheese’ immediately – clear that the reason you want to do this is to improve search engine results.

Ben notes – schema.org contains very ‘opinionated’ views of the things it can describe – but this gives simplicity and lowers barriers to adoption.

Schema.org going to increase the amount of structured data on the web –

In summary:

  • Be empathetic to those who don’t understand what you are doing
  • Need to provide gamut of views on your data
  • You don’t have to use a triplestore to use RDF
  • Raw dumps of data are often far better than dumps of structured data such as RDF if that structure is not documented
  • “Semantic Web” has garnered such a bad PR that ‘we’ (?) are on the back foot – things and attitudes need to change or it will be forgotten in favour of schema.org