What to do with Linked Data?

I think Linked Data offers some exciting opportunities to libraries, archives and museums (LAMS), and I’m pleased and excited that others feel the same. However there has been, in my view – and on my part, a bit of ‘build it and they will come’ rhetoric around the publication of linked data by LAMS. This is perhaps inevitable as you try to change the way such a large, varied and diverse set of organisations and people think about and publish data, and particularly with ‘linked data’ where the whole point is to see links between data sets that would have once been silos. To achieve the links between data sets you need some significant amounts of data out there in linked data form before you can start seeing substantial linking.

However, over the last year or so we have seen the publication of significant data sets in the LAM space as linked data – so it is clear we need to go beyond the call to arms to publish linked data and really look at how you use the data once it is published. A couple of recent discussions have  highlighted this for me.

Firstly Karen Coyle posted a question to the code4lib mailing list asking how to access and use the ‘schema.org’ microdata that OCLC have recently added to Worldcat. A use case Karen described was as follows:

PersonA wants to create a comprehensive bibliography of works by AuthorB. The goal is to do a search on AuthorB in WorldCat and extract the RDFa data from those pages in order to populate the bibliography. Apart from all of the issues of getting a perfect match on authors and of manifestation duplicates (there would need to be editing of the results after retrieval at the user’s end), how feasible is this? Assume that the author is prolific enough that one wouldn’t want to look up all of the records by hand.

Among a number of responses, Roy Tennant from OCLC posted:

For the given use case you would be much better off simply using the WorldCat Search API response. Using it only to retrieve an identifier and then going and scraping the Linked Data out of a WorldCat.org page is, at best, redundant. As Richard pointed out, some use cases — like the one Karen provided — are not really a good use case for linked data. It’s a better use case for an API, which has been available for years.

[‘Richard’ in this case refers to Richard Wallis also from OCLC]

The discussion at this point gets a bit diverted into what APIs are available from OCLC, to whom, and under what terms and conditions. However, for me the unspoken issue here is – it’s great that OCLC have published some linked data under an open licence – but what good is it, and how can I use it?

The second prompt to write this post was through a blog post from Zoë Rose and a subsequent discussion on Twitter with Zoë (@z_rose) and others “for making learning content searchable – Strings win [over URIs]”. Zoë is talking about LRMI – a scheme for encoding metadata about learning resources. Zoë notes that “LRMI currently has a strong preference to URIs to curriculum standards for describing learning content” – that is LRMI takes a Linked Data type approach (there is a proposal for how LRMI should be represented in schema.org).

Zoë argues that LRMI should put more emphasis on using strings, rather than URIs, for describing resources. She cites a number of reasons including the lack of relevant URIs, the fact that URIs will prove unstable in the medium to long term and the fact that the people creating the learning resources aren’t going to use URIs to describe the things they have created. In response to a comment Zoë says:

Consider – which has a clearer association:

Resource marked up with USA’s common core uri for biology (this does NOT exist) Mediating layer Resource marked up with Uganda’s uri for biology (this doesn’t exist either)

Or (please ignore lack of HTML)

Resource marked up with string ‘subject > Biology’ Resource marked up with string ‘subject > Biology’

And on top of that, guess which one matches predictable user search strings?

Guess which one requires the least maintenance?

Guess which one is more likely to appear ‘on the page’ – that being the stated aim of schema.org and, consequently, LRMI?

Guess which one is going to last longer?

More than anything else, I thing I’d say this: this is a schema. It doesn’t need either/ors, it can stretch. And I can’t think of a single viable reason for not including semantically stable strings – but a shed load of reasons not to rely on a bunch of non-page-visible, non-aligning, non-existent URIs.

I think Zoë is making arguments I’ve heard before in libraries: “cataloguers are never going to enter URIs into records – it’s much easier for them to enter text strings for authors/subjects/publishers/etc.”; “there aren’t any URIs for authors/subjects/publishers/etc.”; “the users will never understand URIs”; “we can’t rely on other people’s URIs – what if they break?”.

These are all fair points, and basically I’d agree with all of them – although in libraries at least we now have pretty good URIs for authors (e.g. through http://viaf.org) and subjects (e.g. through http://id.loc.gov) as well as a few other data types (places, languages, …). However,  these might break, and they aren’t in anyway intuitive for those creating the data or those consuming it.

While these points are valid, I don’t agree with the conclusion that therefore strings are better than URIs for making learning resources discoverable. Firstly, I don’t think this is an either/or decision – at the end of the day clearly you need to use language to describe things – it’s the only thing we’ve got. However, the use of URIs as pointers towards concepts and ultimately strings, brings some advantages. In a comment on Zoë’s blog post Luke Blaney argues:

In this example, my preferred solution would be to use http://dbpedia.org/resource/Biology There’s no need for end users to see this URI, but it makes the data so much more useful. Given that URI, as a developer, I can write a simple script which will output a user-friendly name. Not only that, but I can easily get the name in 20 different languages – not just English. I can also start pulling in data from other fields which are using this URI, not just education.

I think Luke nails it here – this is the advantage of using the URI rather than simply the textual string – you get so much more than just a simple string.

BUT – how does this work in practice? How does the use of the URI in the data translate into a searchable string for the user? Going back up to Karen’s example above, how can we exploit the inclusion of structured, linked, data in web pages?

I should preface my attempt to describe some of this stuff, is that I’m thinking aloud here – what I describe below makes sense to me, but I’ve not built an application like that (although I have built a linked data application that uses a different approach)

Lets consider a single use case – I want to build a search, based on linked data embedded in a set of web pages using markup like schema.org. How could you go about building such an application? The excellent “Linked Data book” by Tom Heath and Christian Bizer outlines a number of approaches to building linked data applications in the section “6.3 Architecture of Linked Data applications“. In this case the only approach that makes sense (to me anyway) is the “Crawling Pattern”. This is described:

Applications that implement this pattern crawl the Web of Data in advance by traversing RDF links. Afterwards, they integrate and cleanse the discovered data and provide the higher layers of the application with an integrated view on the original data. The crawling pattern mimics the architecture of classical Web search engines like Google and Yahoo. The crawling pattern is suitable for implementing applications on top of an open, growing set of sources, as new sources are discovered by the crawler at run-time. Separating the tasks of building up the cache and using this cache later in the application context enables applications to execute complex queries with reasonable performance over large amounts of data. The disadvantage of the crawling pattern is that data is replicated and that applications may work with stale data, as the crawler might only manage to re-crawl data sources at certain intervals. The crawling pattern is implemented by the Linked Data search engines discussed in Section 6.1.1.2.

In this case I see this as the only viable approach, because the data is embedded in various pages across the web – there is no central source to query. Just like building a web search engine for traditional HTML content the only real option available is to have software that is given a starting point (a page or sitemap) and systematically retrieves the pages linked from there, and then any pages linked from those pages, and so on – until you reach some defined limits you’ve put on the web crawling software (or you build an index to the entire web!).

While the HTML content might be of interest, let’s assume for the moment we are only interested in the structured data that is embedded in the pages. Our crawler retrieves a page, and we scrape out the structured data. We can start to create a locally indexed and searchable set of data – we know for each data element the ‘type’ (e.g. ‘author’, ‘subject’) and if we get text strings, we can index this straight away. Alternatively if we get URIs in our data – say rather than finding an author name string we get the URI “http://viaf.org/viaf/64205559” – our crawling software can retrieve that URI. In this case we get the VIAF record for Danny Boyle – available as RDF – so we get more structure and more detail, which we can locally index. This includes his name represented as a simple string <foaf:name>Boyle, Danny</foaf:name> – so we’ve got the string for sensible human searching. It also includes his specific date of birth: <rdaGr2:dateOfBirth>1956-10-20</rdaGr2:dateOfBirth>.

Because the VIAF record is also linked data, we also get links to some further data sources:

<owl:sameAs rdf:resource=”http://dbpedia.org/resource/Danny_Boyle”/>

<owl:sameAs rdf:resource=”http://www.idref.fr/059627611/id”/>

<owl:sameAs rdf:resource=”http://d-nb.info/gnd/121891976″/>

This gives us the opportunity to crawl a step further and find even more information – which could be helpful in our search engine. Especially the link to dbpedia gives us a wealth of information in a variety of languages – including his name in a variety of scripts (Korean, Chinese, Cyrillic etc.), a list of films he has directed, and his place of birth. All of these are potentially useful when building a search application.

However, we probably don’t want to crawl on and on – we don’t necessarily want all the data we can potentially retrieve about a resource on the web, and unlimited crawling . We might decide that our search application if for English language only – so we can ignore all the other language and script information. On the otherhand we might decide that ‘search by place of birth of author’ is useful in our context. All these decisions need to be encoded in software – both controlling how far the crawler follows links, and what you do with data retrieved. You might also encode rules in the software about domains you trust, or don’t trust – e.g. ‘don’t crawl stuff from dbpedia’ if you decide it isn’t to be trusted. Alternatively you might decide you’ll capture some of the data from less trusted resources, but in the search application weight the data low, and never display in the public interface.

If this all sounds quite complicated I think it is, and it isn’t. In some ways the concepts are simple you need:

  • list of pages to crawl
  • rules controlling how ‘deep’ to crawl (i.e. how many links to follow from original page)
  • rules on what type of data to retrieve and how to index in the local application
  • rules on which domains to trust/ignore

At the moment I’m not aware of any software you could easily install and configure to do this – as far as I can see currently you’d be having to install crawler software, write data extraction routines, implement an indexing engine, build a search interface. While this is not trivial stuff, it also isn’t that complicated – this kind of functionality could easily be wrapped up in an configurable application I think – if there was a market for it. Also existing components like the Apache stack of Nutch/Solr/Lucene  (e.g. see description at http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/). It is also clearly within the capability of existing web crawling technology – big players like Google and Bing already do this on unstructured data, and schema.org comes out of the idea that they can enhance search by using more structured, and linked, data.

Where does this leave us in regards the questions/issues that triggered this post in the first place. Potentially it leaves Karen having to crawl the whole of WorldCat before she starts tackling her specific use case – 271,429,346 bibliographic records in WorldCat this is no small feat. Also Ed Summer’s post about crawling WorldCat points at some issues. Although things have moved on since 2009, and now the sitemap files for WorldCat include pointers to specific records, it isn’t clear to me if the sitemap files cover every single item, or just a subset of WorldCat. A quick count up of urls listed in one of the sitemap files suggests the latter.

Tackling the issues raised by Zoë, I hope it shows, this isn’t about strings vs URIs – URIs that can be crawled and eventually resolve to a set of strings or data could increase discoverability – if the crawl and search applications are well designed. It doesn’t resolve all the other issues raised by Zoë like establishing URIs (back to ‘build it and they will come’) or how to capture the URIs (although I’d point to the work by “Step Change” project to look at how URIs could be added to metadata records in archives for some directions on this)

The world of linked data is the world of the web, the graph, the network – and as with the web, when you build applications on top of linked data you need to crawl the data and use the connections within the data to make the application work. It isn’t necessarily the only way of using linked data (anyone for federated SPARQL queries?) but I think that ‘crawl, index, analyse’ is an approach to building applications we need to start to understand and embrace if we are to actually put linked data to use.