Google Cultural Institute

James Davies talking about Google Cultural Institute.

As Google grew in size, it increased in scope. Encouraged employees to follow passions. If you get a dozen people in the room you’ll find at least one is passionate about Art – in Google a set of people interested in Art, Galleries, Museums, got together to find a way of making this content available – became the Google Art Project.

At the time James was at the Tate – and got involved in the Google Art project – and was impressed by how Google team listened to expertise in gallery. Now he has moved to Google – talking about various projects – Nelson Mandela Centre of Memory – http://archives.nelsonmandela.org.

Second project from Google in this area – the Cultural Institute – aim to work with variety of organisations including Archives. Finding a way of creating an ‘online exhibit’ – the Nelson Mandela site is an example of this – combines archival material with text from curators/archivists to tell a story. Then can jump into an item in the archive. From the exhibit you can access items – example here a letter from Nelson Mandela to his daughters – v high resolution by the look of it.

Forming a digestible narrative is key to exhibit format.

Romanian archive – includes footage from the revolution – TV broadcast at the time when revolutionaries took over TV studios.

James says archives are about people – plea to use stories assert the value of archives to protect them.

Into Q&A:

Q: Why not use the ‘Archives’ as a metaphor – ‘unboxing’ is the most exiting part of the archives experience and this is lost in ‘exhibit’ format

A: But that is because you know what you are looking for – the

Q: (from me) Risks of doing this in one place – why build a platform rather than distributing tools so archives can do this work themselves.

A: First step – part of that will be about distributing tools and approaches. Syndicating use of platform (as in Nelson Mandela site) is first step in this direction. Future steps could include distributing tools. Emphasised they didn’t want to be ‘hoarders’ of content.

Q: How to make self-navigation of archives easier for novice users

A: Hope that this will come

Several questions around the approach of ‘exhibits’ and ‘narratives’ – feeling that this ignores the depth of archive. Generally answer is that this is a way of presenting content – enables discovery of some content, and gives a place for people to ‘enter’ the archive – and from there explore more deeply.

Lots of concern from floor and on Twitter that this selective approach is at the expense or to the detriment of larger collection.

Discovery Summit: Paul Walk keynote

Paul talking about Open and Closed – not licensing or access, but about ‘open world assumption’ vs ‘closed world assumption’

Paul describes characteristics of ‘open world’:

  • Incomplete information
  • Schema-less data
  • Web technologies – http; html5; rdf
  • Platform independence; scales well; cross-context discovery potential

Closed world characteristics:

  • Complete information
  • Schema-based data; Records
  • Web tech – http delivering to native apps
  • Performance; contextualised discovery; quality; curation

Need to decide when to apply each of these approaches – strengths and weaknesses

Web still best available foundation of what we are doing, but still need to manage resources; quality etc.

Quote from Leslie Lamport “a distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

As a developer why should I trust your API – that it will work, that it will continue to work – if you don’t use it yourself as the service owner? See Paul’s blog post on this.

APIs are not best thought of a machine-to-machine interfaces. APIs are interfaces for developers! Talk to developers who are likely to use your API. Developer is to API as ‘user’ is to UI.

Yesterday Paul hosted a meeting for developers to get their point of view [which I was fortunate enough to attend]. Some things that came out of this:

  • please don’t build elaborate APIs which do not allow us to see all of the data or its extent 
  • Offering an API which delivers incomplete data is usually self-defeating – that is, don’t hold data back because you are worried about its quality

Introducing this afternoon’s sessions:

Emerging technologies – Graph based data (see work by Facebook, Google, BBC etc.)

Reasons for aggregation – to avoid systems/network latency; showcase; ‘web scale’ concentration; …

Data quality issues – concern about data quality can prevent release of data (which consumers don’t like); but poor data quality erodes trust and can affect reputation; reconciling these things is a major challenge

 

Discovery Summit Keynote: Maura Marx

Maura Marx is from the Digital Public Library of America.

Interesting comment from Maura in this session of ‘Are we failing users’ – she said DPLA “think about Developers as users a lot” – something that came up at the Summit pre-event focussed on developers yesterday – think this is really important thing for libraries/archives/museums to take on board.

Maura giving some background to the creation of the DPLA – looking to create “an open, distributed network on comprehensive online resources”. DPLA have been working on how to make something useful and sustainable. Workstreams for technical, legal, economic, governance, users. Spent time recruiting people from cultural heritage as well as education, law, government, technology etc.

DPLA turned down money to digitise a load of content – this is not what it is about – have to change the way libraries work, not about creating a pile of digital stuff.

Lots of meetings and workshops – very open process – which is challenging. DPLA will be ‘launched’ April 18-19  2013. DPLA is about a platform not a ‘portal’ or destination. Encouraging people to build on DPLA platform – getting beyond perception of DPLA as ‘portal’ is a struggle.

DPLA work on metadata built on Europeana work. DPLA statements on metadata:

  • The DPLA asserts that metadata are not copyrightable, and that applying a license to them is not necessary.
  • To the extent that the law determines a copyright interest exists, a CC0 license applies.
  • The DPLA asserts no new rights over metadata at the DPLA level

Digital Hubs pilot project – “taking first steps to bring together existing US digital library infrastructure into a sustainable national digital library systems” – focusing on provision in 7 regions

DPLA has strong emphasis on building Community.

Discovery Summit 2013 – a foreword

I’m at the British Library for the next couple of days for the JISC/BL Discovery Summit. This is an event that brings together work from the last 4 years which started with the snappily named “Resource Discovery Task Force” which was asked to consider how ‘resource discovery’ (finding resources in libraries, archives, museums etc.) could be improved for UK HE researchers and others.

I’m facilitating various sessions at the event, but I’ll try to blog some sessions as well. But I thought I would start by publishing a presentation I did for the first meeting of the Resource Discovery Task Force back in November 2008. I was asked to consider “What if we were starting from scratch”. This was written 5 years ago, and I’ve left the text as it was – so changes in direction and thought in the last five years are not part of this presentation – but I think I still believe in the fundamentals of what I said here, and I think the last 5 years work on ‘Discovery’ have borne this out.

What if we were starting from scratch

What if we got rid of services, standards, methods and mechanisms that are currently make up our resource discovery infrastructure? In 10 minutes!

How far back do we go? To the very first ‘library catalogues’ – written in Cuniform? Or possibly just as far the Library at Alexandria (3rd century BC). In a comment on my blog Joy Palmer (MIMAS) said “how far do we get to go back for a clean slate here? couple of hundred years? before the emergence of bureacratic cataloguing and classification practices? probably not;-)”

Well, I’m not going to go back that far. In fact, I’m going to start by going back just as far as 1931 when Shiyali Ramamitra Ranganathan published his ‘5 Laws of Library Science

In a recent blog post Lorcan Dempsey reflects that although there have been numerous attempts to ‘update’ the 5 laws they are not particularly convincing, and that this is perhaps because that the laws as they are continue to capture the fundamental challenges of what we do.

All 5 laws are worth exploration, and several of them touch on the intimate relation between resource discovery and access to those resources which is an aspect of the infrastructure I believe is vital, but I’m not going to consider in this presentation. Focusing solely on the ‘Resource Discovery Infrastructure’, I think that the last two are particularly relevant.

The 4th Law – Save the Time of the User.

Ranganathan was concerned with shelf arrangements, recent acquisitions shelves, signposting and shelf labelling. In one example he notes that the activity “may look like a great expense when considered from the isolated library-point-of-view, … can be seen to be really economical from the larger community-point-of-view” He also related saving time of the user to saving the time of the staff – what is efficient for the user is efficient for the library.

One of the major time sinks for a user in our current Resource Discovery environment is the number of places you have to look for stuff.

The 5th Law – A Library is a growing organism – or perhaps because we can consider in this context there is more than one library – libraries are growing organisms. This the law that Lorcan Dempsey was specifically commented on in his blog post, considering how we need to think about effective service in a networked environment. Ranganathan commented “an organisation which may be suitable for a small library may completely fail when the library grows big” Ranganathan asked of the library catalogue “let us examine what form the Fifth Law would recommend for it”. He suggests that libraries pioneered the use of loose-leaf binding to update catalogues, and then the use of cards – with one entry per card – describing the card index as “another epoch making contribution of the library profession to the business world in general” I believe that these principals outlined by Ranganathan can inform how we should design a Resource Discovery Infrastructure that servers the resource explorers. Before I come back to the question of ‘starting from scratch’, I think it may be useful to just reflect briefly on how we have got to where we are today.

I want to briefly go back further to the end of the 19th Century, when Melvil Dewey was involved in the introduction of a standardised form of catalogue card, and catalogue card cabinet (the fact that Dewey also setup the company that sold the cards …

… and the cabinets …

… and even special machines to type the cards may have had an influence in this!

The Library of Congress started to publish it’s catalogue records on these standard sized cards and by this method could distribute them to other libraries.

This was so successful, that Charles Cutter, who produced a seminal work on building a printed dictionary catalogue quickly had to revise it to take account of the card catalogue. By the time the 4th edition of Cutter’s work was published, he prefaced it by saying “any new library would be very foolish not to make its catalog mainly of them [LoC cards]”

This is, I believe, the start of our current Resource Discovery infrastructure – and we have been stuck in this mold ever since. Although we now have computerised catalogues we have never got away from the idea that these records are physical objects – discrete things that are copied for local use. It is for this reason, that when we try to follow Ranganathan’s 4th law to ‘save the time of the reader’ we try to put these copies back together – and encounter problems of inconsistency and duplication. So what do we need to do differently?

One tempting scenario is to think that we should stop ‘copying the cards’, and have a single ‘catalogue’ – it wouldn’t be trying to bring together disparate records – it would be the only copy. When I posted the question on my blog “What if we started from scratch“, one respondent said: “would there be a ‘master’ UK catalogue of books … The trad library catalogue would just be a ‘view’ of this”.

However, I do not believe this is realistic -we don’t have this kind of control over resources and resource discovery systems – we would always have things springing up outside the ‘big store’. Essentially this approach ignores Ranganathan’s 5th law – the library is a growing organism.

To move beyond the card catalogue, we need to look to a more recent figure – Tim Berners-Lee. I believe that if Ranganathan were to look at the problem now, he would recognise that HTML and the web represented that next step that Resource Discovery Infrastructure needed to take to go beyond the card catalogue – but for some reason librarians and archivists have not been in the forefront in the adoption of this as an approach to Resource Discovery, but have treated merely as a new medium with which they need to integrate their existing infrastructure.

When Tim Berners-Lee first described the requirement for a ‘global hypertext system’, he said “Information systems start small and grow. They also start isolated and then merge. A new system must allow existing systems to be linked together without requiring any central control or coordination.” – I think the first part of this statement is a restating of Ranganathan’s 5th law, for a networked age.

I would contend that the Web is the most effective system for disseminating information yet conceived (we might consider the brain an even more effective networked information system?).

What does all this tell us about starting from scratch?

We need to build a linked environment.

The way I think of it is each catalogue record should be a hypertext document. This can both link, and be linked to. When a library adds an item to its catalogue, it can do this by creating a new hypertext catalogue record, OR by linking to an existing one.

As many libraries do this, some records receive more inward links. Not all inward links will go to the same record necessarily – perhaps two or more key ‘nodes’ will appear in the network for a single bibliographic item – but this doesn’t matter, it is ultimately self correcting, and self correctable – afterall, you only need one connection between the two parts of the network for it to be clear that all the connected items are the same.

It also opens up the possibility of explicitly recording that ‘these two things are the same’

I’d stress this doesn’t have to be the WWW – just a linked environment.

You would need to start thinking not just in terms of ‘catalogues’ – what your library has – but also in terms of ‘indexes’ – what do you want to index? To index your library, you would crawl all the records in your library, AND all the ones you link to – you could take this further, and grab extra information from others if you wanted – but the ones that are most linked to, would be the more ‘authoritative’ – allowing records from small, but specialist, suppliers to be more authoritative than those from the traditional ‘authorities’

A Union catalogue could be built by doing a wider crawl – and the ultimate Union by crawling everything. Yes, there would be duplication, but this would be on a smaller scale than currently – and the more links you add, the less duplication there is (more data)

Although I’ve approached this in what is perhaps a Library-centric way, a linked environment allows links to anywhere – so you can link to (and index) resources outside your domain – building an index to serve your users. -standards might be needed – but in some ways links would provide aspects of this, crosswalks could be built from practice rather than theory. Diversity would be embraced and used.

The idea of a linked resource discovery infrastructure requires us to change the way we think about our ‘catalogues’ (whether that be library, archives, journal indexes or other). We’ve treated these as isolated instances – if you have one sheep, you have a pet – you can focus all your attention on it and care for it, but if you put it out to fend for itself, it is going to struggle.

We need to think more like shepherds – treating our resource descriptions as a flock – putting it out to graze, and gathering it back in (via indexing) when we need to.

I’m convinced that Ranganathan would see a linked environment as the next step – the next ‘epoch making contribution’ – he may not have been able to anticipate the information revolution that computers would bring, but he said “What further stages of evolution are in store for this Growing Organism  – the library – we can only wait and see… who knows that a day may not come … when the dissemination of knowledge, which is the vital function of libraries, will be realised by libraries even by means other than those of the printed book”. There is no doubt in my mind that he would have embraced the opportunity offered by a linked information environment  – and this is what we need to do now.

Introduction to APIs

[UPDATE October 2014: Following changes to the BNB platform and supported APIs, this tutorial no longer works as described below. An updated version of this exercise is now available at http://www.meanboyfriend.com/overdue_ideas/2014/10/using-an-api-hands-on-exercise/]

On Wednesday this week (6th Feb 2013) I spent a day at the British Library in London talking to curators about data and the web. The workshop was a full day and we covered a lot of ground  – from HTML to simple mashups to Linked Data. One of the things I wanted to do during the day was to get people to use an API – to understand what challenges this presented, what sort of questions you might have as a developer using that API, what sort of things you should think about when creating an API, and hopefully to start to get a feel for what opportunities are created by providing an API to resources.

Since we had a busy day, I only had an hour to get people working with an API for the first time, so I wanted to do something:

  • Simple
  • Relevant to the audience (Curators at the British Library)
  • Requiring no local installation of software
  • Requiring no existing knowledge of programming etc.
  • That produced a tangible outcome in an hour

The result was the two exercises below. We got through exercise 1 in the hour (some people may have gone further but as far as I know everyone completed exercise 1) and so I don’t know how well exercise 2 works – but I’d be very interested in feedback if anyone gives it a go. The exercises use the British National Bibliography as the data source:

Exercise 1: Using an API for the first time

Introduction

In this exercise you are going to use a Google Spreadsheet to retrieve records from an API to BNB, and display the results.

The API you are going to use simply allows you to submit some search terms and get a list of results in a format called RSS. You are going to use a Spreadsheet to submit a search to the API, and display the results.

Understanding the API

The API you are going to use is an interface to the BNB. Before you can start working with the API, you need to understand how it works. To do this, first go to:

http://bnb.data.bl.uk/search

You will see a search form:

BNB search form

Enter a search term in the first box (‘Search store:’) and press ‘Search’. What you see next will depend on your browser. If you are using Google Chrome you will probably see something like this:

BNB RSS

If you are using Internet Explorer or Firefox you will see something more like:

BNB RSS in Firefox

At the moment this doesn’t matter, we are interested in the URL, not the display right now.

Look carefully at the URL – see how the search terms you typed in are included in the URL. The example I used is:

http://bnb.data.bl.uk/items?query=the+social+life+of+information&max=10&offset=0&sort=&xsl=&content-type=

The first part of the URL is the address of the API. Everything after the ‘?’ are ‘parameters’ which form the input to the API. There are six parameters listed and each one consists of the parameter name, followed by an ‘=’ sign, then a value.

The URL and parameters breakdown like this:

URL Part Explanation
http://bnb.data.bl.uk/items The address of the API
query=the+social+life+of+information The ‘query’ parameter – contains the search terms submitted
max=10 The ‘max’ parameter is set to ’10’. This means the API will return a maximum of 10 records. You can experiment changing this to get more/less results at one time
offset=0 The ‘offset’ parameter tells the API which the first record you want to be included in the results. It is set to ‘0’ meaning that the API will start with the very first record.
sort=&xsl=&content-type= Other parameters – you can see that these reflect the other parts of the form at http://bnb.data.bl.uk/searchThe parameters are:

  • sort
  • xsl
  • content-type

None of these are set and are not going to be used in this exercise.

 

Going Further

If you want to find out more about the API being used here, including documentation on all the search parameters, documentation is available at:

http://docs.api.talis.com/platform-api/full-text-searching

The output of the API is displayed in the browser – this is an RSS feed – it would plug into any standard RSS reader like Google Reader (http://reader.google.com). The BBC have a brief explanation of what an RSS feed is (follow the link). It is also valid XML. The reasons browsers display it differently (as noted above) is that some browsers recognise it as an RSS feed, and try to display it nicely, while others don’t.

If you are using a browser that displays the ‘nice’ version, you can right-click on the page and use the ‘View Source’ option to see the XML that is underneath this.

While the XML is not the nicest thing to look at, it should be possible to find lines that look something like:

<item rdf:about="http://bnb.data.bl.uk/id/resource/011380365">
<title>The social life of information / J. S. Brown</title>
<link>http://bnb.data.bl.uk/id/resource/011380365</link>

Each result the API returns is called an ‘item’. Each ‘item’ at minimum will have a ‘title’ and a ‘link’. In this case the link is to more information about the item.

The key things you need to know to work with this API are:

  • The address of the API
  • The parameters that the API accepts as input
  • The format the API provides as output

Now you’ve got this information, you are ready to start using the API.

Using the API

To use the API, you are going to use a Google Spreadsheet. Go to http://drive.google.com and login to your Google account. Create a Google Spreadsheet

The first thing to do is build the API call (the query you are going to submit to the API).

First some labels:

  • In cell A1 enter the text ‘API Address’
  • In cell A2 enter the text ‘Search terms’
  • In cell A3 enter the text ‘Maximum results’
  • In cell A4 enter the text ‘Offset’
  • In cell A5 enter ‘Search URL’
  • In cell A6 enter ‘Results’

Now, based on the information we were able to obtain by understanding the API we can fill values into column B as follows:

  • In cell B1 enter the address of the API (see the table above if you’ve forgotten what this is)
  • In cell B2 enter a simple, one word search
  • In cell B3 enter the maximum number of results you want to get (10 is a good starting point)
  • In cell B4 enter ‘0’ (zero) to display from the first results onwards

The first four rows of the spreadsheet should look something like (with your own keyword in B2):

Spreadsheet 1

You now have all the parameters we need to build the API call. To do this you want to create a URL very similar to the one you saw when you explored the API above. You can do this using a handy spreadsheet function/formula called ‘Concatenate’ which allows you to combine the contents of a number of spreadsheet cells with other text.

In Cell B5 type the following formula:

=concatenate(B1,"?query=",B2,"&max=",B3,"&offset=",B4)

This joins the contents of cells B1, B2, B3 and B4 with the text included in inverted commas in formula. N.B. Depending on the locale settings in Google Docs, it is sometimes necessary to use semicolons in place of the commas in the formula above.

Once you have entered this formula and pressed enter your spreadsheet should look something like:

Spreadsheet 2

The final step is to send this query, and retrieve and display the results. This is where the fact that the API returns results as an RSS feed comes in extremely useful. Google Spreadsheets has a special function for retrieving and displaying RSS feeds.

To use this, in Cell B6 type the following formula:

=importFeed(B5)

Because Google Spreadsheets knows what an RSS feed is, and understands it will contain one or more ‘items’ with a ‘title’ and a ‘link’ it will do the rest for us. Hit enter, and see the results.

Congratulations! You have built an API query, and displayed the results.

You have:

  • Explored an API for the BNB
  • Seen how you can ‘call’ the API by adding some parameters to a URL
  • Understood how the API returns results in RSS format
  • Used this knowledge to build a Google Spreadsheet which searches BNB and displays the results
Going Further

  • Try varying the values in Cells B3 and B4. Can you see how you could use these together to make a ‘page’ of results?
  • Try changing the search term in Cell B2. What happens if you use multiple words? Do you know why?

HINT: Look at the URL created in Cell B5 – can you see what’s wrong? Try doing a multi-word search using the search form at http://bnb.data.bl.uk/search and look at the URL produced – what’s the difference?

Can you work out how to avoid the multi-word search problem? Have a look at the ‘substitute’ function documented on this page https://support.google.com/drive/bin/static.py?hl=en&topic=25273&page=table.cs

If you want to know more about the ‘importFeed’ function, have a look at the documentation at http://support.google.com/drive/bin/answer.py?hl=en&answer=155181

Exercise 2: More API – the full record

Introduction

In Exercise 1, you explored a search API for the BNB, and displayed the results. However, this minimal information (a result title and a URL) may not tell you a lot about the resource. In this exercise you will see how to retrieve a ‘full record’ and display some of that information.

Exploring the full record data

The ‘full record’ display is at the end of the URLs retrieved from the BNB in Exercise 1 above. Click on one of these URLs (or copy/paste into your browser). If possible pick a URL that looks like it is a bibliographic record describing a book, rather than a subject heading or name authority.

An example URL is http://bnb.data.bl.uk/id/resource/010712074

Following this URL will show a page similar to this:

BNB full record html

This screen displays the information about this item which is available via the BNB API as an HTML page. Note that the URL of the page in the browser address bar is different to the one you clicked on. In the example given here the original URL was:

http://bnb.data.bl.uk/id/resource/010712074

while the address in the browser bar is:

http://bnb.data.bl.uk/doc/resource/010712074

You will be able to take advantage of the equivalence of these two URLs later in this exercise.

While the HTML display works well for humans, it is not always easy to automatically extract data from HTML. In this case the same information is available in a number of different formats, listed at the top righthand side of the display. The options are:

  • rdf
  • ttl
  • json
  • xml
  • html

The default view in a browser is the ‘html’ version. Offering access to the data in a variety of formats gives choice to anyone working in the API. Both ‘json’ and ‘xml’ are widely used by developers, with ‘json’ often being praised for its simplicity. However, the choice of format can depend on experience, the required outcome, and external constrictions such as the programming language or tool being used.

Google Spreadsheet has some built in functions for reading XML, so for this exercise the XML format is the easiest one to use.

XML for BNB items

To see what the XML version of the data looks like, click on the ‘xml’ link at the top right. Note the URL looks like:

http://bnb.data.bl.uk/doc/resource/010712074.xml

This is the same as the URL we saw for the HTML version above, but with the addition of ‘.xml’

XML is a way of structuring data in a hierarchical way – one way of thinking about it is as a series of folders, each of which can contain further folders. In XML terminology, these are ‘elements’ and each element can contain a value, or further elements (not both). If you look at an XML file, the elements are denoted by tags – that is the element name in angle brackets – just as in HTML. Every XML document must have a single root element that contains the rest of the XML.

Going Further

To learn more about XML, how it is structured and how it can be used see this tutorial from IBM: http://www.ibm.com/developerworks/xml/tutorials/xmlintro/

  • Can you guess another URL which would also get you the XML version of the BNB record?
  • Look at the URL in the spreadsheet and compare it to the URL you actually arrive at if you follow the link.

The structure of the XML returned by the BNB API has a element as the root element. The diagram below partially illustrates the structure of the XML.

BNB XML Structure

To extract data from the XML we have to ‘parse’ it – that is, tell a computer how to extract data from this structure. One way of doing this is using ‘XPath’. XPath is a way of writing down a route to data in an XML document.

The simplest type of XPath expression is to list all the elements that are in the ‘path’ to the data you want to extract using a ‘/’ to separate the list of elements. This is similar to how ‘paths’ to documents are listed in a file system.

In the document structure above, the XPath to the title is:

/result/primaryTopic/title

You can use a shorthand of ‘//’ at the start of an XPath expression to mean ‘any path’ and so in this case you could simply write ‘//title’ without needing to express all the container elements.

Going Further

  • What would the XPath be for the ISBN-10 in this example?
  • Why might you sometimes not want to use the shorthand ‘//’ for ‘any path’ instead of writing the path out in full? Can you think of any possible undesired side effects?

Find out more about XPath in this tutorial: http://zvon.org/comp/r/tut-XPath_1.html

Using the API

Now you know how to get structured data for a BNB item, and the structure of the XML used, you can extend the Google Spreadsheet you created in Exercise 1 to display more detailed data for the item.

Google Spreadsheets has a function called ‘importXML’ which can be used to import XML, and then use XPath to extract the relevant data. In order to use this you need to know the location of the XML to import, and the XPath expression you want to use.

In Exercise 1 you should have finished with a list of URLs in column C. These URLs can be used to get an HTML version of the record. To get an XML version of the same item, you simply need to add ‘.xml’ to the end of the URL.

The XPath expression you can use is ‘//isbn10’. This will find all the isbn10 elements in the XML.

With these two bits of information you are ready to use the ‘importXML’ function. In to Cell D6, type the formula:

=importXml(concatenate(C6,".xml"),"//isbn10")

This creates the correct URL with the ‘concatenate’ function, retrieves the XML document, and uses the Xpath ‘//isbn10’ to get the content of the element – this 10 digit ISBN. N.B. Depending on the locale settings in Google Docs, it is sometimes necessary to use semicolons in place of the commas in the formula above.

Congratulations! You have used the BNB API to retrieve XML and extract and display information from it.

You have:

  • Understood the URLs you can use to retrieve a full record from the BNB
  • Understood the XML used to represent the BNB record
  • Written a basic XPath expression to extract information from the BNB record

Going Further

  • How would you amend the formula to display the publication information?
  • Now you have an ISBN for a BNB item, can you think of other online resources you could link to or use to further enhance the display?
  • How would you go about bringing in an additional source of data?

To see one example of how this spreadsheet could be developed further see https://docs.google.com/spreadsheet/ccc?key=0ArKBuBr9wWc3dEE1OXVHX2U2YTkyaHJxWjI2WTFWLUE&usp=sharing

  • What new source has been added to the spreadsheet?
  • What functions have been used to retrieve and display the new data?
  • Why is the formula used more complex than the examples in the two exercises above?

Shakespeare as you like it

This is a slightly delayed final post on my Will Hack entry – which I’m really happy to say won the “Best Open Hack” prize in the competition.

I should start by acknowledging the other excellent work done in the competition, with special mention of the overall winner, a ‘second screen’ app to use when watching a Shakespeare play by Kate Ho and Tom Salyers. Also the team behind the whole Will Hack event at Edina – Muriel, Nicola, Neil and Richard – the idea of an online hackathon was great, and I hope they’ll be writing up the experience of running it.

I presented my hack as part of the final Google+ Hangout and you can watch the video on YouTube. Here I’ll describe the hack, but also reflect on the nature and potential of the Will’s World Registry which I used in the hack.

The Will’s World Registry and #willhack are part of the Jisc Discovery Programme – which is a programme I’ve been quite heavily involved in. The idea of an ‘aggregation’ which brings together data from multiple sources (which is what the Will’s World Registry does) was part of the original Vision document which informed the setting up of the Discovery programme. When I sat down to start my #willhack, I really wanted to see if the registry fulfilled any of the promise that the Vision outlined.

The Hack

When I started looking at the registry, it was difficult to know what I should search for – what data was in there? what searches would give interesting results? So I decided that rather than trying to construct a search service over the top of the registry (which uses Solr and so supports the Solr API – see the Querying Data section in this tutorial), I’d see if I could extract relevant data from the plays (e.g. names of characters, places mentioned etc.) and use those to create queries for the registry and return relevant results.

It seemed to me this approach could provide a set of resources alongside the plays – a starting point for someone reading the play and wanting more information. As I’ve done with previous hacks I decided that I’d use WordPress as a platform to deliver the results, and so would build the hack as a WordPress plugin. I did consider using Moodle, the learning management system, instead of WordPress, as I wanted the final product to be something that could be used easily in an educational context – however in the end, I went with WordPress as having a larger audience.

The first thing I wanted to do was import the text of the plays into a WordPress install, and then start extracting relevant keywords to throw at the registry. This ended up being a lot more time consuming than I expected. I got the basics of downloading a play in xml (I used the simple xml provided by the Will’s World team (http://wwsrv.edina.ac.uk/wworld/plays/index) and creating posts quite quickly. However, it turned out my decision to create one WP post per line of dialogue was an expensive one – creating posts in WP seems to be quite a slow process, and so it would take minutes to load a play. This in turn led to timeout issues – the web server would timeout while waiting for the php script to run. It took me some considerable time to amend the import process to import only one act at a time based on the user pressing a ‘Next’ button. The final result isn’t ideal but it does the job – to improve this would take a re-write of the code and inserting posts directly using SQL rather than the pre-packaged ‘insert post’ function provided by WordPress. I also realised very late on in the hack that my own laptop was a hell of a lot faster than my online host – and I could have avoided these issues altogether if I’d done development locally – but then I guess that would have detracted from the ‘open’ aspect I won a prize for!

Willhack Much Ado Tag Cloud

I also spent some time working out how to customise WordPress to display the Play text in a useful way. I drew on some experience of developing ‘multi faceted documents’ in WordPress – although I didn’t go quite as far as I would have liked. I also benefited from using WordPress and having access to existing WordPress plugins. For example – putting up a tag cloud immediately gives you information on which characters in the play have the most lines (as I automatically tagged each post with the speaking character name, as well as the Act and Scene it was in)

As I developed I posted all the code to Github, and kept an online demonstration site up to date with the latest version of the plugin.

I now had a working plugin that imported the play and displayed it in a useful way. So I was ready to go back to the purpose of the hack – to draw data out of the registry. Earlier when I’d been looking for ideas of what to do I’d also created a data store on ScraperWiki of cast lists from various productions of Shakespeare plays and so an obvious starting point was a page per character in the play that would display this information, plus results from the registry.

I started to put this together (unfortunately as you can see in the tag cloud, the character names I got from the xml are actually ‘CharIDs’ and have had white space removed – meaning that while this approach works well for ‘Beatrice’ it fails for ‘Don Pedro’ which is converted to ‘DONPEDRO’. I could solve this either by using a different source for the plays, or by linking with the work Richard Light did for #willhack where he established URIs for each character with their real name and this CharID from the xml. (I also threw in some schema.org markup into each post – this was a bit of an after-thought and I’m not sure if it is useful, or indeed is marked up correctly!)

I found immediately that just throwing a character name at the registry didn’t return great results (perhaps not suprisingly) but combining the character name with the name of the play was not bad generally. Where it tended to provide less interesting results was where the character name was also in the name of the play – for example searching “Romeo AND Romeo and Juliet” doesn’t improve over just searching for “Romeo and Juliet” – and the top hits are all versions of the play from the Open Library data – which is OK, but to be honest not very interesting.

However, at its best, quite a basic approach here creates some interesting results, such as this one for the character of Dogberry from Much Ado About Nothing:

Dogberry page created by ShakespearePress WP plugin

and this interesting snippet from the Culture Grid

CultureGrid resource on Dogberry

The British Museum data proved particularly rich having many images of Shakespearean characters.

I finished off the hack with creating a ‘summary’ page for the play, which tries to get a summary of the play via the DuckDuckGo API (in turn this tends to get the data from Wikipedia – but the MediaWiki API seemed less well documented and harder to use). It also tries to get a relevant podcast and epub version of the play from the Oxford University “Approach Shakespeare” series.

Once the posts and pages have been created they are static – they are created on the fly, but don’t update at all. This means that the blog owner can then edit them as they see fit – maybe adding in descriptions of the characters, or annotating parts of the play etc. All the usual WordPress functionality is available so you could add more plugins etc. (although the play layout depends on you using the Theme that is delivered as part of the plugin).

I think this could be a great starting point for creating a resource aimed at schools – a teacher gets a great starting point, a website with the play text and pointers to more resources. I hope that illustrations of characters, and information about people who have played them (especially where that’s a recognisable name like Sean Bean playing Puck) bring the play to life a bit and give some context. It also occurred to me that I could create some ‘plan a trip’ pages which would present resources from a particular collection – like the British Museum – in a single page, pointing at objects you could look at when you visited the museum.

You can try the plugin right now – just setup a clean WordPress install, and download the ShakespearePress plugin from Github, drop it into the ‘plugins’ directory, install it via the WP admin interface, then go to the settings page (under the settings menu) – and it will walk you through the process. All feedback very welcome. You can also browse a site created by the plugin at http://demonstrators.ostephens.com/willhack.

I was going to comment on my experience of using the Registry here, but I’ve already gone on too much – that will have to be a separate post!

The time is out of joint

This is an update on my progress with my #willhack project. As I wrote in my previous post:

 my aim is to build a WordPress plugin that starts from the basis of plays expressed as a series of WordPress posts, and adds, via widgets, links to other sources based on the text you are viewing – this will be hopefully a mixture of  at least the cast lists I’ve extracted, and searches of the Will’s World registry. Other stuff if I have time!

I’ve made some progress with this, and there is now code for a WordPress plugin up on GitHub although it’s still a work in progress.

I decided to use WordPress as in the past I’ve found this a shortcut for getting a hack up on the web in a usable form quickly, and the widespread use of WordPress means that by packaging the hack as a WordPress plugin I can make it accessible to a wide audience easily. I did consider using Moodle as an alternative to WordPress to target an audience in education more directly, but thought I’d probably find WordPress quicker.

I was able to write a plugin which loaded the text of the first Act of “Much Ado about Nothing” using the XML version of the text provided by the Will’s World service at http://wwsrv.edina.ac.uk/wworld/plays/Much_Ado_about_Nothing.xml. Each ‘paragraph’ in the play – which equate to either a stage direction, or a piece of dialogue by a character – is made into a WordPress post. The plugin also includes a WordPress Theme which is automatically applied, which subverts the usual ‘blog’ style used by WordPress to group all these posts by Act and Scene. The default view is the whole of the text available, but using built in WordPress functionality you can view a single Act, a single Scene or a single line of dialogue. Using the built in ‘commenting’ facility you can annotate or comment on any line of dialogue.

Each ‘post’ is also tagged with the name of the character speaking the line. This means it is possible to view all lines by a particular role, and also via a Tag cloud easily see who get’s the most lines.

So far, so good. Unfortunately at this point I hit a problem – which was when I tried to load an entire play (rather than just the first Act), I consistently got a http status 500 error. I’m pretty sure this is because either PHP or Apache is hitting a timeout limit when it tries to create so many posts in one go (each line of dialogue is a post so a play is hundreds of posts). I spent far too long trying to tweak the timeout limits to resolve this problem with no luck.

The time I spent on the timeout issue I really should have been spending developing the planned widget to display contextual information from other sources such as cast lists, and the Will’s World Registry. I’d originally hoped I might be able to achieve something approaching “Every story has a beginning” – an example of a linked data + javascript ‘active’ document from Tim Sherrat (@wragge). I’ve now had to concede to myself that I’m not going to have time to do this (although I have put some rudimentary linked data markup into the play text for each speaking character, using schema.org markup) .

So, instead of a widget I’ve decided to create a page for each character in the play which uses at least data from the cast lists I’ve scraped and the Will’s World registry. As I’ve already got timeout issues when getting the play text into WordPress, I’m going to have to work out a way of adding the creation of these pages without hitting timeout issues again. I think I’ve got an approach which, although a bit clunky, should break both tasks (creation of the play text as posts, and creation of ‘character’ pages) into smaller chunks that the person installing the plugin can activate in turn – so I just need to make sure no single step takes too long to carry out. I’m sure there must be a better way of doing this but I haven’t found any decent examples so far – I suspect one approach would be to break down into steps as I’m proposing to do, but trigger each step in turn by some javascript rather than require the user to click a ‘next’ button, but I don’t want to spend any more time on this than is absolutely necessary at this stage.

If you want to try out the plugin as it currently stands you can get the code from from Shakespearepress on GitHub (N.B. don’t install on a WordPress blog you care about at the moment – it creates lots of posts, and changes the theme and it might break). Once the plugin is activated you need to go to plugin settings page to import the text of a play (at the moment it supports the first Act of Much Ado about Nothing or the first Act of King Lear).

If you want to see an example of a WordPress blog that has been Shakespeared then take a look at http://demonstrators.ostephens.com/willhack, which uses the first act of Much Ado about Nothing:

 

Hopefully before the submission deadline I can extend the Character pages and sort the installation process so you can load a whole play and create a page per (main?) character.

To scrape or not to scrape?

I’m currently participating in the #willhack online hackathon. This is an event being run by EDINA at the University of Edinburgh, as part of their Will’s World project, which in turn is part of the Jisc Discovery Programme.

The Discovery Programme came out of a Jisc task force looking at how ‘resource discovery’ might be improved for researchers in UK HE. The taskforce (catchily known as RDTF) outlined a vision, based on the idea of the publication of ‘open’ metadata (open in terms of licensing and access) by data owners, and the building of ‘aggregations’ of data with APIs etc., which would provide platforms for the development of user facing services by service providers.

The Will’s World project is building an aggregation of data relating to Shakespeare, and the idea of the hackathon is to test the theory that this type of aggregation can be a platform for building services.

As usual when getting involved in this kind of hackathon, I spent quite a lot of time unsure exactly what to do. The Will’s World registry has data from a number of sources, and the team have also posted other data sources that might be used, including xml markup of the plays.

I played around with some sample queries on the main registry (it supports the SOLR API for queries), but didn’t get that far – it was hard to know what you’d get back from any particular query, and I struggled to know what queries to throw at it beyond the title of each play – which inevitably brought back large numbers of hits.

I also had a couple of other data sources I was interested in – one was Theatricalia – a database of performances of plays in the UK including details of venue, play, casts etc. This is crowdsourced data, and the site was created and is maintained by Matthew Sommerville (@dracos).

The other was a database called ‘Designing Shakespeare‘. Designing Shakespeare was originally an AHDS (Arts and Humanities Data Service) project, which is now hosted by RHUL (Royal Holloway, University of London). The site contains information about London-based Shakespeare productions including the cast lists, pictures from productions, interviews with designers and even VRML models of the key theatres spaces. Designing Shakespeare is one of those publicly funded resources that I think never gets the exposure or love it deserves – a really interesting resource that is (probably) underused (I don’t have any stats on usage, so that’s just me being pessimistic!)

Both these sites about performance made me think there was potential to link plays with performance information, and then maybe some other information from the Will’s World registry. I liked the idea of using the cast lists to add some interest – many of the London based performances at least have that “oh I didn’t realise X played Y” feel to them (Sean Bean as Puck anyone?). Unfortunately neither Theatricalia nor Designing Shakespeare have an API to get at their data programmatically. So I decided I’d write a scraper to extract the cast lists. Having done some quick checking, I found the Designing Shakespeare cast lists tended to be more complete, so I decided to scrape those first. While there is lots of information about the copyright nature of many materials on the Designing Shakespeare site (pictures/audio/video etc.) there is no mention of copyright on the cast lists. Since this is very factual data and my only reason for extracting the data was to point back at the website, I felt reasonably OK scraping this data out.

As always with these things, it wasn’t as straightforward as I hoped, and it’s taken me much longer to get the data out than I expected. Time I probably should have spent actually developing the main product I want to produce for the hack, but now it’s done (using the wonderful ScraperWiki) –  you can access all the data at https://scraperwiki.com/scrapers/designing_shakespeare_cast_lists/ – around 24,000 role/actor combinations (that seems very high – I’m hoping that the data is all good at the moment!)

You can access the data via API or using SQL – I hope others will find it useful as well.

Now I need to find some time to move onto the next part of my hack – my aim is to build a WordPress plugin that starts from the basis of plays expressed as a series of WordPress posts, and adds, via widgets, links to other sources based on the text you are viewing – this will be hopefully a mixture of  at least the cast lists I’ve extracted, and searches of the Will’s World registry. Other stuff if I have time!

If anyone want’s to collaborate on this, I’d be very happy to do so. I’ll be posting code (once I have any) on GitHub at https://github.com/ostephens/shakespearepress.

 

 

Using Open Refine for e-journal data

Open Refine (previously Google Refine) is a tool for manipulating and ‘cleaning’ data (more information is available on the new Open Refine site). If you use Excel to do general data jobs, then it’s worth looking at Refine to see if it can help.

I’ve used Refine in the past to do explore and clean up data, and been impressed by the tools it provides. I’m currently working on two projects – KnowledgeBase+ and GoKB which are concerned with collecting and organising data about electronic resources. As anyone who has had to deal with data about electronic journal collections knows (especially e-resource librarians), a lot of data is sent around in spreadsheets and as comma/tab separated values – for example the KBART Guidelines, which are used by some publishers/content providers to publish lists of e-journals, recommend a TSV format.

Today I wanted to understand how a particular e-journals package was changing in 2013. I had a list of the journals included in the package in 2012, and a new list of the journals that will be included in 2013. Each file contained the title of the journals, ISSNs, eISSNs, URLs, the year, volume and issue of the first journal issue included in the online package. However, the column order and names were not the same as can be seen here:

2012 file
2013 file

I wanted to know

  • which titles were in the 2012 version that were not  in the 2013 version
  • which titles were in the 2013 package that were not in the 2012
  • for titles that were in both, if there had been any changes in key information such as the first issue included

I’d seen a blog post by Tony Hirst (@psychemedia on Twitter) on merging datasets using Refine which shows how you can match values across two Refine projects. Following this tutorial I was able to start matching across the data sets. Since both data sets contained the ISSN and eISSN it seemed likely that one of these would be a good starting point for matching – alas, this was not the case. Not all entries had an ISSN or eISSN, in more than one case the what was recorded as eISSN in one was listed as the ISSN in the other.  I decided to look for another approach to matching across the data sets.

In the end the most successful approach was using a three letter code that was used in the URLs in both files to identify the journal – this allowed me to get good matching between the two files and be pretty happy that I wasn’t missing matches. Interestingly the next most reliable matching mechanism I found was using a ‘fingerprint’ version of the journal title (‘fingerprint’ is a mechanism to try to standardised text strings, which is also described by Tony in his blog post).

Having found a decent way of matching between the files, I started to try to answer the questions above. Firstly from the 2013 file, I added in a new column which pulled the matching title from the 2012 file. Any blank cells in this column represented a 2013 title not in the 2012 file (easy to find using the Refine ‘faceting’ function).

To discover the titles in the 2012 file, not int the 2013 one, I did the same match process, but starting from the 2012 file. I couldn’t think of any other way of doing this which was a shame – it would be nice to get all the data into a single project and then do all the analysis from there – at the moment the only way I can think of doing this would be to somehow merge the files before importing into Refine – which would seem to defeat the point a bit.

Finally, I used the same matching mechanism to pull in the information relating to the first issue (first issue number, volume number, year) from the 2012 file into the 2013 file. I could then compare the 2012 version of each of these to the 2013 version

The expression used is: if(value==cells[“Frontfile 1st Issue Online Vol”].value,”same”,”changed”)
This compares the value in the “Frontfile 1st Issue Online Vol” with the value in the column I started from – and allocates a value of ‘same’ or ‘changed’. I found you needed to be careful when doing these comparisons that the data is of the same ‘type’. I had one example where I ended up comparing a ‘number’ type with a ‘string’ type and getting a ‘changed’ outcome even though both contained the character ‘4’.

By using facets to find all those that had ‘changed’, and using another facet on the 2012 title match to eliminate those titles that were not in the 2012 file, I was quickly able to find the 5 titles where the year had changed, and a further 16 titles where the volume of the first issue available was different, even though the year had stayed the same.

Could I have achieved the same outcome in Excel? Probably. However, Refine has several nice features (like faceting) that are very straightforward compared to doing the same thing in Excel. Perhaps, though, it’s greatest strength is the ability to easily view a history of all the changes that have been made to the file, undo changes, copy changes and re-apply by amending field names, and the ability to export all the changes applied in such a way that you can apply exactly the same changes to future files where the same actions are required.

What to do with Linked Data?

I think Linked Data offers some exciting opportunities to libraries, archives and museums (LAMS), and I’m pleased and excited that others feel the same. However there has been, in my view – and on my part, a bit of ‘build it and they will come’ rhetoric around the publication of linked data by LAMS. This is perhaps inevitable as you try to change the way such a large, varied and diverse set of organisations and people think about and publish data, and particularly with ‘linked data’ where the whole point is to see links between data sets that would have once been silos. To achieve the links between data sets you need some significant amounts of data out there in linked data form before you can start seeing substantial linking.

However, over the last year or so we have seen the publication of significant data sets in the LAM space as linked data – so it is clear we need to go beyond the call to arms to publish linked data and really look at how you use the data once it is published. A couple of recent discussions have  highlighted this for me.

Firstly Karen Coyle posted a question to the code4lib mailing list asking how to access and use the ‘schema.org’ microdata that OCLC have recently added to Worldcat. A use case Karen described was as follows:

PersonA wants to create a comprehensive bibliography of works by AuthorB. The goal is to do a search on AuthorB in WorldCat and extract the RDFa data from those pages in order to populate the bibliography. Apart from all of the issues of getting a perfect match on authors and of manifestation duplicates (there would need to be editing of the results after retrieval at the user’s end), how feasible is this? Assume that the author is prolific enough that one wouldn’t want to look up all of the records by hand.

Among a number of responses, Roy Tennant from OCLC posted:

For the given use case you would be much better off simply using the WorldCat Search API response. Using it only to retrieve an identifier and then going and scraping the Linked Data out of a WorldCat.org page is, at best, redundant. As Richard pointed out, some use cases — like the one Karen provided — are not really a good use case for linked data. It’s a better use case for an API, which has been available for years.

[‘Richard’ in this case refers to Richard Wallis also from OCLC]

The discussion at this point gets a bit diverted into what APIs are available from OCLC, to whom, and under what terms and conditions. However, for me the unspoken issue here is – it’s great that OCLC have published some linked data under an open licence – but what good is it, and how can I use it?

The second prompt to write this post was through a blog post from Zoë Rose and a subsequent discussion on Twitter with Zoë (@z_rose) and others “for making learning content searchable – Strings win [over URIs]”. Zoë is talking about LRMI – a scheme for encoding metadata about learning resources. Zoë notes that “LRMI currently has a strong preference to URIs to curriculum standards for describing learning content” – that is LRMI takes a Linked Data type approach (there is a proposal for how LRMI should be represented in schema.org).

Zoë argues that LRMI should put more emphasis on using strings, rather than URIs, for describing resources. She cites a number of reasons including the lack of relevant URIs, the fact that URIs will prove unstable in the medium to long term and the fact that the people creating the learning resources aren’t going to use URIs to describe the things they have created. In response to a comment Zoë says:

Consider – which has a clearer association:

Resource marked up with USA’s common core uri for biology (this does NOT exist) Mediating layer Resource marked up with Uganda’s uri for biology (this doesn’t exist either)

Or (please ignore lack of HTML)

Resource marked up with string ‘subject > Biology’ Resource marked up with string ‘subject > Biology’

And on top of that, guess which one matches predictable user search strings?

Guess which one requires the least maintenance?

Guess which one is more likely to appear ‘on the page’ – that being the stated aim of schema.org and, consequently, LRMI?

Guess which one is going to last longer?

More than anything else, I thing I’d say this: this is a schema. It doesn’t need either/ors, it can stretch. And I can’t think of a single viable reason for not including semantically stable strings – but a shed load of reasons not to rely on a bunch of non-page-visible, non-aligning, non-existent URIs.

I think Zoë is making arguments I’ve heard before in libraries: “cataloguers are never going to enter URIs into records – it’s much easier for them to enter text strings for authors/subjects/publishers/etc.”; “there aren’t any URIs for authors/subjects/publishers/etc.”; “the users will never understand URIs”; “we can’t rely on other people’s URIs – what if they break?”.

These are all fair points, and basically I’d agree with all of them – although in libraries at least we now have pretty good URIs for authors (e.g. through http://viaf.org) and subjects (e.g. through http://id.loc.gov) as well as a few other data types (places, languages, …). However,  these might break, and they aren’t in anyway intuitive for those creating the data or those consuming it.

While these points are valid, I don’t agree with the conclusion that therefore strings are better than URIs for making learning resources discoverable. Firstly, I don’t think this is an either/or decision – at the end of the day clearly you need to use language to describe things – it’s the only thing we’ve got. However, the use of URIs as pointers towards concepts and ultimately strings, brings some advantages. In a comment on Zoë’s blog post Luke Blaney argues:

In this example, my preferred solution would be to use http://dbpedia.org/resource/Biology There’s no need for end users to see this URI, but it makes the data so much more useful. Given that URI, as a developer, I can write a simple script which will output a user-friendly name. Not only that, but I can easily get the name in 20 different languages – not just English. I can also start pulling in data from other fields which are using this URI, not just education.

I think Luke nails it here – this is the advantage of using the URI rather than simply the textual string – you get so much more than just a simple string.

BUT – how does this work in practice? How does the use of the URI in the data translate into a searchable string for the user? Going back up to Karen’s example above, how can we exploit the inclusion of structured, linked, data in web pages?

I should preface my attempt to describe some of this stuff, is that I’m thinking aloud here – what I describe below makes sense to me, but I’ve not built an application like that (although I have built a linked data application that uses a different approach)

Lets consider a single use case – I want to build a search, based on linked data embedded in a set of web pages using markup like schema.org. How could you go about building such an application? The excellent “Linked Data book” by Tom Heath and Christian Bizer outlines a number of approaches to building linked data applications in the section “6.3 Architecture of Linked Data applications“. In this case the only approach that makes sense (to me anyway) is the “Crawling Pattern”. This is described:

Applications that implement this pattern crawl the Web of Data in advance by traversing RDF links. Afterwards, they integrate and cleanse the discovered data and provide the higher layers of the application with an integrated view on the original data. The crawling pattern mimics the architecture of classical Web search engines like Google and Yahoo. The crawling pattern is suitable for implementing applications on top of an open, growing set of sources, as new sources are discovered by the crawler at run-time. Separating the tasks of building up the cache and using this cache later in the application context enables applications to execute complex queries with reasonable performance over large amounts of data. The disadvantage of the crawling pattern is that data is replicated and that applications may work with stale data, as the crawler might only manage to re-crawl data sources at certain intervals. The crawling pattern is implemented by the Linked Data search engines discussed in Section 6.1.1.2.

In this case I see this as the only viable approach, because the data is embedded in various pages across the web – there is no central source to query. Just like building a web search engine for traditional HTML content the only real option available is to have software that is given a starting point (a page or sitemap) and systematically retrieves the pages linked from there, and then any pages linked from those pages, and so on – until you reach some defined limits you’ve put on the web crawling software (or you build an index to the entire web!).

While the HTML content might be of interest, let’s assume for the moment we are only interested in the structured data that is embedded in the pages. Our crawler retrieves a page, and we scrape out the structured data. We can start to create a locally indexed and searchable set of data – we know for each data element the ‘type’ (e.g. ‘author’, ‘subject’) and if we get text strings, we can index this straight away. Alternatively if we get URIs in our data – say rather than finding an author name string we get the URI “http://viaf.org/viaf/64205559” – our crawling software can retrieve that URI. In this case we get the VIAF record for Danny Boyle – available as RDF – so we get more structure and more detail, which we can locally index. This includes his name represented as a simple string <foaf:name>Boyle, Danny</foaf:name> – so we’ve got the string for sensible human searching. It also includes his specific date of birth: <rdaGr2:dateOfBirth>1956-10-20</rdaGr2:dateOfBirth>.

Because the VIAF record is also linked data, we also get links to some further data sources:

<owl:sameAs rdf:resource=”http://dbpedia.org/resource/Danny_Boyle”/>

<owl:sameAs rdf:resource=”http://www.idref.fr/059627611/id”/>

<owl:sameAs rdf:resource=”http://d-nb.info/gnd/121891976″/>

This gives us the opportunity to crawl a step further and find even more information – which could be helpful in our search engine. Especially the link to dbpedia gives us a wealth of information in a variety of languages – including his name in a variety of scripts (Korean, Chinese, Cyrillic etc.), a list of films he has directed, and his place of birth. All of these are potentially useful when building a search application.

However, we probably don’t want to crawl on and on – we don’t necessarily want all the data we can potentially retrieve about a resource on the web, and unlimited crawling . We might decide that our search application if for English language only – so we can ignore all the other language and script information. On the otherhand we might decide that ‘search by place of birth of author’ is useful in our context. All these decisions need to be encoded in software – both controlling how far the crawler follows links, and what you do with data retrieved. You might also encode rules in the software about domains you trust, or don’t trust – e.g. ‘don’t crawl stuff from dbpedia’ if you decide it isn’t to be trusted. Alternatively you might decide you’ll capture some of the data from less trusted resources, but in the search application weight the data low, and never display in the public interface.

If this all sounds quite complicated I think it is, and it isn’t. In some ways the concepts are simple you need:

  • list of pages to crawl
  • rules controlling how ‘deep’ to crawl (i.e. how many links to follow from original page)
  • rules on what type of data to retrieve and how to index in the local application
  • rules on which domains to trust/ignore

At the moment I’m not aware of any software you could easily install and configure to do this – as far as I can see currently you’d be having to install crawler software, write data extraction routines, implement an indexing engine, build a search interface. While this is not trivial stuff, it also isn’t that complicated – this kind of functionality could easily be wrapped up in an configurable application I think – if there was a market for it. Also existing components like the Apache stack of Nutch/Solr/Lucene  (e.g. see description at http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/). It is also clearly within the capability of existing web crawling technology – big players like Google and Bing already do this on unstructured data, and schema.org comes out of the idea that they can enhance search by using more structured, and linked, data.

Where does this leave us in regards the questions/issues that triggered this post in the first place. Potentially it leaves Karen having to crawl the whole of WorldCat before she starts tackling her specific use case – 271,429,346 bibliographic records in WorldCat this is no small feat. Also Ed Summer’s post about crawling WorldCat points at some issues. Although things have moved on since 2009, and now the sitemap files for WorldCat include pointers to specific records, it isn’t clear to me if the sitemap files cover every single item, or just a subset of WorldCat. A quick count up of urls listed in one of the sitemap files suggests the latter.

Tackling the issues raised by Zoë, I hope it shows, this isn’t about strings vs URIs – URIs that can be crawled and eventually resolve to a set of strings or data could increase discoverability – if the crawl and search applications are well designed. It doesn’t resolve all the other issues raised by Zoë like establishing URIs (back to ‘build it and they will come’) or how to capture the URIs (although I’d point to the work by “Step Change” project to look at how URIs could be added to metadata records in archives for some directions on this)

The world of linked data is the world of the web, the graph, the network – and as with the web, when you build applications on top of linked data you need to crawl the data and use the connections within the data to make the application work. It isn’t necessarily the only way of using linked data (anyone for federated SPARQL queries?) but I think that ‘crawl, index, analyse’ is an approach to building applications we need to start to understand and embrace if we are to actually put linked data to use.