What to do with Linked Data?

I think Linked Data offers some exciting opportunities to libraries, archives and museums (LAMS), and I’m pleased and excited that others feel the same. However there has been, in my view – and on my part, a bit of ‘build it and they will come’ rhetoric around the publication of linked data by LAMS. This is perhaps inevitable as you try to change the way such a large, varied and diverse set of organisations and people think about and publish data, and particularly with ‘linked data’ where the whole point is to see links between data sets that would have once been silos. To achieve the links between data sets you need some significant amounts of data out there in linked data form before you can start seeing substantial linking.

However, over the last year or so we have seen the publication of significant data sets in the LAM space as linked data – so it is clear we need to go beyond the call to arms to publish linked data and really look at how you use the data once it is published. A couple of recent discussions have  highlighted this for me.

Firstly Karen Coyle posted a question to the code4lib mailing list asking how to access and use the ‘schema.org’ microdata that OCLC have recently added to Worldcat. A use case Karen described was as follows:

PersonA wants to create a comprehensive bibliography of works by AuthorB. The goal is to do a search on AuthorB in WorldCat and extract the RDFa data from those pages in order to populate the bibliography. Apart from all of the issues of getting a perfect match on authors and of manifestation duplicates (there would need to be editing of the results after retrieval at the user’s end), how feasible is this? Assume that the author is prolific enough that one wouldn’t want to look up all of the records by hand.

Among a number of responses, Roy Tennant from OCLC posted:

For the given use case you would be much better off simply using the WorldCat Search API response. Using it only to retrieve an identifier and then going and scraping the Linked Data out of a WorldCat.org page is, at best, redundant. As Richard pointed out, some use cases — like the one Karen provided — are not really a good use case for linked data. It’s a better use case for an API, which has been available for years.

[‘Richard’ in this case refers to Richard Wallis also from OCLC]

The discussion at this point gets a bit diverted into what APIs are available from OCLC, to whom, and under what terms and conditions. However, for me the unspoken issue here is – it’s great that OCLC have published some linked data under an open licence – but what good is it, and how can I use it?

The second prompt to write this post was through a blog post from Zoë Rose and a subsequent discussion on Twitter with Zoë (@z_rose) and others “for making learning content searchable – Strings win [over URIs]”. Zoë is talking about LRMI – a scheme for encoding metadata about learning resources. Zoë notes that “LRMI currently has a strong preference to URIs to curriculum standards for describing learning content” – that is LRMI takes a Linked Data type approach (there is a proposal for how LRMI should be represented in schema.org).

Zoë argues that LRMI should put more emphasis on using strings, rather than URIs, for describing resources. She cites a number of reasons including the lack of relevant URIs, the fact that URIs will prove unstable in the medium to long term and the fact that the people creating the learning resources aren’t going to use URIs to describe the things they have created. In response to a comment Zoë says:

Consider – which has a clearer association:

Resource marked up with USA’s common core uri for biology (this does NOT exist) Mediating layer Resource marked up with Uganda’s uri for biology (this doesn’t exist either)

Or (please ignore lack of HTML)

Resource marked up with string ‘subject > Biology’ Resource marked up with string ‘subject > Biology’

And on top of that, guess which one matches predictable user search strings?

Guess which one requires the least maintenance?

Guess which one is more likely to appear ‘on the page’ – that being the stated aim of schema.org and, consequently, LRMI?

Guess which one is going to last longer?

More than anything else, I thing I’d say this: this is a schema. It doesn’t need either/ors, it can stretch. And I can’t think of a single viable reason for not including semantically stable strings – but a shed load of reasons not to rely on a bunch of non-page-visible, non-aligning, non-existent URIs.

I think Zoë is making arguments I’ve heard before in libraries: “cataloguers are never going to enter URIs into records – it’s much easier for them to enter text strings for authors/subjects/publishers/etc.”; “there aren’t any URIs for authors/subjects/publishers/etc.”; “the users will never understand URIs”; “we can’t rely on other people’s URIs – what if they break?”.

These are all fair points, and basically I’d agree with all of them – although in libraries at least we now have pretty good URIs for authors (e.g. through http://viaf.org) and subjects (e.g. through http://id.loc.gov) as well as a few other data types (places, languages, …). However,  these might break, and they aren’t in anyway intuitive for those creating the data or those consuming it.

While these points are valid, I don’t agree with the conclusion that therefore strings are better than URIs for making learning resources discoverable. Firstly, I don’t think this is an either/or decision – at the end of the day clearly you need to use language to describe things – it’s the only thing we’ve got. However, the use of URIs as pointers towards concepts and ultimately strings, brings some advantages. In a comment on Zoë’s blog post Luke Blaney argues:

In this example, my preferred solution would be to use http://dbpedia.org/resource/Biology There’s no need for end users to see this URI, but it makes the data so much more useful. Given that URI, as a developer, I can write a simple script which will output a user-friendly name. Not only that, but I can easily get the name in 20 different languages – not just English. I can also start pulling in data from other fields which are using this URI, not just education.

I think Luke nails it here – this is the advantage of using the URI rather than simply the textual string – you get so much more than just a simple string.

BUT – how does this work in practice? How does the use of the URI in the data translate into a searchable string for the user? Going back up to Karen’s example above, how can we exploit the inclusion of structured, linked, data in web pages?

I should preface my attempt to describe some of this stuff, is that I’m thinking aloud here – what I describe below makes sense to me, but I’ve not built an application like that (although I have built a linked data application that uses a different approach)

Lets consider a single use case – I want to build a search, based on linked data embedded in a set of web pages using markup like schema.org. How could you go about building such an application? The excellent “Linked Data book” by Tom Heath and Christian Bizer outlines a number of approaches to building linked data applications in the section “6.3 Architecture of Linked Data applications“. In this case the only approach that makes sense (to me anyway) is the “Crawling Pattern”. This is described:

Applications that implement this pattern crawl the Web of Data in advance by traversing RDF links. Afterwards, they integrate and cleanse the discovered data and provide the higher layers of the application with an integrated view on the original data. The crawling pattern mimics the architecture of classical Web search engines like Google and Yahoo. The crawling pattern is suitable for implementing applications on top of an open, growing set of sources, as new sources are discovered by the crawler at run-time. Separating the tasks of building up the cache and using this cache later in the application context enables applications to execute complex queries with reasonable performance over large amounts of data. The disadvantage of the crawling pattern is that data is replicated and that applications may work with stale data, as the crawler might only manage to re-crawl data sources at certain intervals. The crawling pattern is implemented by the Linked Data search engines discussed in Section 6.1.1.2.

In this case I see this as the only viable approach, because the data is embedded in various pages across the web – there is no central source to query. Just like building a web search engine for traditional HTML content the only real option available is to have software that is given a starting point (a page or sitemap) and systematically retrieves the pages linked from there, and then any pages linked from those pages, and so on – until you reach some defined limits you’ve put on the web crawling software (or you build an index to the entire web!).

While the HTML content might be of interest, let’s assume for the moment we are only interested in the structured data that is embedded in the pages. Our crawler retrieves a page, and we scrape out the structured data. We can start to create a locally indexed and searchable set of data – we know for each data element the ‘type’ (e.g. ‘author’, ‘subject’) and if we get text strings, we can index this straight away. Alternatively if we get URIs in our data – say rather than finding an author name string we get the URI “http://viaf.org/viaf/64205559” – our crawling software can retrieve that URI. In this case we get the VIAF record for Danny Boyle – available as RDF – so we get more structure and more detail, which we can locally index. This includes his name represented as a simple string <foaf:name>Boyle, Danny</foaf:name> – so we’ve got the string for sensible human searching. It also includes his specific date of birth: <rdaGr2:dateOfBirth>1956-10-20</rdaGr2:dateOfBirth>.

Because the VIAF record is also linked data, we also get links to some further data sources:

<owl:sameAs rdf:resource=”http://dbpedia.org/resource/Danny_Boyle”/>

<owl:sameAs rdf:resource=”http://www.idref.fr/059627611/id”/>

<owl:sameAs rdf:resource=”http://d-nb.info/gnd/121891976″/>

This gives us the opportunity to crawl a step further and find even more information – which could be helpful in our search engine. Especially the link to dbpedia gives us a wealth of information in a variety of languages – including his name in a variety of scripts (Korean, Chinese, Cyrillic etc.), a list of films he has directed, and his place of birth. All of these are potentially useful when building a search application.

However, we probably don’t want to crawl on and on – we don’t necessarily want all the data we can potentially retrieve about a resource on the web, and unlimited crawling . We might decide that our search application if for English language only – so we can ignore all the other language and script information. On the otherhand we might decide that ‘search by place of birth of author’ is useful in our context. All these decisions need to be encoded in software – both controlling how far the crawler follows links, and what you do with data retrieved. You might also encode rules in the software about domains you trust, or don’t trust – e.g. ‘don’t crawl stuff from dbpedia’ if you decide it isn’t to be trusted. Alternatively you might decide you’ll capture some of the data from less trusted resources, but in the search application weight the data low, and never display in the public interface.

If this all sounds quite complicated I think it is, and it isn’t. In some ways the concepts are simple you need:

  • list of pages to crawl
  • rules controlling how ‘deep’ to crawl (i.e. how many links to follow from original page)
  • rules on what type of data to retrieve and how to index in the local application
  • rules on which domains to trust/ignore

At the moment I’m not aware of any software you could easily install and configure to do this – as far as I can see currently you’d be having to install crawler software, write data extraction routines, implement an indexing engine, build a search interface. While this is not trivial stuff, it also isn’t that complicated – this kind of functionality could easily be wrapped up in an configurable application I think – if there was a market for it. Also existing components like the Apache stack of Nutch/Solr/Lucene  (e.g. see description at http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/). It is also clearly within the capability of existing web crawling technology – big players like Google and Bing already do this on unstructured data, and schema.org comes out of the idea that they can enhance search by using more structured, and linked, data.

Where does this leave us in regards the questions/issues that triggered this post in the first place. Potentially it leaves Karen having to crawl the whole of WorldCat before she starts tackling her specific use case – 271,429,346 bibliographic records in WorldCat this is no small feat. Also Ed Summer’s post about crawling WorldCat points at some issues. Although things have moved on since 2009, and now the sitemap files for WorldCat include pointers to specific records, it isn’t clear to me if the sitemap files cover every single item, or just a subset of WorldCat. A quick count up of urls listed in one of the sitemap files suggests the latter.

Tackling the issues raised by Zoë, I hope it shows, this isn’t about strings vs URIs – URIs that can be crawled and eventually resolve to a set of strings or data could increase discoverability – if the crawl and search applications are well designed. It doesn’t resolve all the other issues raised by Zoë like establishing URIs (back to ‘build it and they will come’) or how to capture the URIs (although I’d point to the work by “Step Change” project to look at how URIs could be added to metadata records in archives for some directions on this)

The world of linked data is the world of the web, the graph, the network – and as with the web, when you build applications on top of linked data you need to crawl the data and use the connections within the data to make the application work. It isn’t necessarily the only way of using linked data (anyone for federated SPARQL queries?) but I think that ‘crawl, index, analyse’ is an approach to building applications we need to start to understand and embrace if we are to actually put linked data to use.

14 thoughts on “What to do with Linked Data?

  1. Thank you for this; it’ll likely end up in one or more of my library-school syllabi.

    I’ve run into a misconception from some catalogers about this — that instead of the crawl-index-analyse(-display) cycle you mention, every single OPAC-like page display will involve an on-the-fly bunch of SPARQL queries. This is, of course, bonkers — a network blip from VIAF or similar could wreak havoc.

    So a discussion of local triple caching, provenance, and updating mightn’t come amiss?

  2. Good article, which raises lots of thoughts, e.g.:

    Follow-your-nose crawling will only fly if you have something like the WorldCat sitemap to take you directly to each of a whole set of resources to crawl, or if the data itself is so richly interlinked that the crawling can continue from a limited number of start-points.

    Unless your crawling takes you to resources which are outside the original dataset (e.g. from WordCat to dbpedia), then it is not telling you anything which the trad. ol’ database behind that dataset couldn’t have told you directly.

    In the “strings vs. URIs” discussion, one important point is that a LD URI affirms an identity, and so multiple parties using the same LD URI implies a shared agreement on the identity of that entity. Matching strings don’t have that connotation.

  3. You hit the nail. Crawling is not the way to go. We do not want to re-invent crawling. It just does not scale. That’s why we don’t need Linked Data but Linked Open Data. The recipe is: give all your data away as Open Data in packages for bulk downloads, and let others index the data into local search engines. People want to search, they do not care to view triples. Catalog model designers and software engineers need triple stores.

  4. Thank you for writing this up Owen. In attempt to actually do something with the emerging linked data sets, last fall I implemented a real-time follow-your-nose crawl for author pages in our local VuFind instance. We use the authorized form of the author “Orwell, George, 1903-1950” to lookup the LCCN (n79058639) in a local Solr index of name authorities and then scan/search various APIs to return an author’s bio, picture, and links to other sites. We do cache the results for a period of time but this is not an ideal solution since there is no local searchable index of what is retrieved and there is a fair amount of latency if the results aren’t available in the cache. It’s also vulnerable to network blips that Dorothea mentions. But it’s a usable feature at the moment and your post has given me ideas for improvement.

    http://bit.ly/NVT4NS

  5. Thanks for that link Ted – very interesting. I had been thinking about taking this back a step and integrating a crawl (maybe one or two steps) in the SolrMARC indexing stage (I’ve been experimenting with Blacklight, but should apply to VuFind as well). In this case the use would be to enhance search rather than necessarily improve display – although both might be possible.

    My concern would be in would impact on the speed of indexing considerably. I’d be interested in knowing if you’ve considered this and or if you are aware of any work in this area?

  6. Hi Jörg

    There are definitely issues with the ‘crawl’ approach – but some level of ‘follow your nose’ activity will be inevitable with linked data won’t it? Because the data will link to external datasets, you are never going to get all the data in a bulk download format?

    Whether it ‘scales’ or not is a good question – but obviously Google and Bing have so far been able to innovate around the scale problems they’ve encountered when crawling the web – although they obviously still hit problems with timely updates etc. But not all applications are big applications, and I can see small scale use of this kind of technique.

    In terms of releasing open data packages for bulk downloads there are some interesting questions about how we can best share information about available data and also how to effectively keep data files in sync across resources – see my post on ResourceSync for some relevant work in this area http://www.meanboyfriend.com/overdue_ideas/2012/07/resourcesync-web-based-resource-synchronization/

  7. Hi Richard

    Some good points. I can see some other ways of approaching the question of ‘what to crawl’ in certain circumstances (e.g. you may know that a resource has URIs of the form http://data.com/{isbn} or some other predictable pattern) which might allow you to seed crawls from lots of different points. However, you can’t rely on this and it does present a challenge.

    Also worth noting that all (linked) roads seem to lead to dbPedia – but not so many lead away from it… However – by sharing dbPedia URIs we reach that agreement on entities which you describe, which maybe helpful.

  8. Hi Dorothea

    Glad it will be helpful! Yes – caching, provenance and updating all need addressing and I should definitely post on these … I’ll try to find time 🙂 Also other strategies that can help avoid the ‘blips’ – for example in several library linked data examples (British Library, University of Cambridge) they double up with strings and identifiers for some entities – like LCSH. This way they have at least one string value available to them without an additional step – but if the remote resource (id.loc.gov) is available they can do further enhancement on the fly.

  9. Hi – I’ll admit that I’m not that bright, but what about an approach where we actively stuff copies of our linked data into dbpedia/freebase? It seems a lot of things feed there, so why not add to the pool? Maybe it’s too restrictive? I’ve been dabbling into Google Refine and can just imagine how interesting it would be if it held all of our data. Well, interesting to me maybe.

    Declan

  10. Hi Declan,

    Contributing to existing pools of data is definitely something I think the LAMS community should do – and this does already happen. There has been some key work in this area recently in terms of adding VIAF identifiers to Wikipedia entries (and so will flow through to dbpedia) – see http://en.wikipedia.org/wiki/Wikipedia:Authority_control_integration_proposal/RFC, and also http://www.oclc.org/research/news/2012-06-29.htm

    Freebase has certainly used some library contributed data, although I don’t have the details to hand.

    These are definitely ways of getting linked data ‘out there’ directly or indirectly, but I don’t think it answers the question as to what you do with the data once it is out in the wild, which is more what I was trying to address in this post. What is the result of having VIAF identifiers in dbPedia? What is enabled by having library derived data in Freebase?

    I agree the Google Refine + Freebase (or other) reconciliation is a possible use case – although I don’t think reconciliation in Refine requires linked data as such?

  11. Hiya!

    Starting from the top…

    Luke’s comment that dbpedia’s ‘biology’ page is accurate… if the subject is ‘biology’. Which it hardly ever is – ‘biology’ is an unusual term. Instead we find ‘myself and other living things’, ‘interdependence of organisms’, things like that (real examples). Note that neither of these are synonymous with ‘biology’ – ‘biology’ is just a closest-match, and so lacking in stability.

    Note also that there are subjects that definitely don’t have DBPedia entries – Northern Ireland teaches ‘the world around us’ as a subject, for example.

    The thing that I keep coming back to is, URL based data works really well when you’re talking about concrete entities. Your example of ‘Danny Boyle’ is a good one here – Danny Boyle is definitely a person, who has agreed on attributes (like the first name ‘Danny’ and the last name ‘Boyle’.) And entity ‘Danny Boyle’ can be tied specifically to other concrete entities, like ‘Slumdog millionaire’ – we all agree that’s a movie, and that it only has one name, which is ‘slumdog millionaire’. Easy, really – they’re concrete entities, so, no problem.

    In education, however, we really don’t get many concrete entities – not many at all. (It’s always interesting to see that people looking for concrete, single-named, uncontested entities in education only ever use examples from science and maths – there’s a good reason for that, it’s that there are basically no concrete entities at all outside these two fields, and scarce enough within them! It’s just as important to test against Art and Design, where lots of learning happens but the concepts are far more nebulous.)

    In fact, I would go so far as to say that any system for representing educational concepts as concrete entities fails out of the blocks, specifically because it’s representing the base concepts incorrectly. It can *look* like it works, but I don’t think it can ever actually work in the way that education (wild and wooly as it is) works. It’s a conceptual and practical mis-match.

  12. Hi Owen,

    To your’s and Dorothea’s thoughts let me add that I personally see local caching to be not only inevitable, but also unavoidable. The potential pitfalls – network availability, latency – are totally applicable and the benefits of a local cache are overwhelming – response time, query customization, and, most importantly, indexing. It is this last reason that I firmly believe some form of local cache will be essential, though *all* of the data included in traditional authority records may not be needed (the most important aspects being the authoritative label itself, such as the subject heading or name, and variants, of course).

    As for the inefficiencies associated with crawling, it is one of the reasons we publish bulk downloads [1] of the data at LC’s Linked Data Service (http://id.loc.gov). Grab the data, load it locally, use it however you need to. No need to crawl. That said, limited crawling may still be necessary to keep data up-to-date. For example, we publish an Atom feed for each of the datasets (for LC Names, see [2]). It sorts resources by most recently modified. In this way, you can access the Atom feed, learn what has been updated since you last checked or since the most recent bulk download, and then request the structured data associated with the URI.

    — Kevin

    [1] http://id.loc.gov/download/
    [2] http://id.loc.gov/authorities/names/feed/1

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.