Using Google Sheets with ESTC

I’ve done a lot of work with early English texts (and especially the related medata) over the past few years as part of my work on the Jisc ‘Historical Texts‘ platform which brings together texts from Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO) and a collection of digitised 19th Century books from the British Library.

I’m hoping that over the next few months I’ll be able to do more work looking at the opportunities to build on APIs over the Historical Texts platform. I’m interested in what sort of things people might want to do with an API which gives access to the metadata or texts in the collections, so I was very interested when, via @heatherfro on Twitter I saw a blog post from @BArmintor describing how he wanted to be able to find a Wing or STC number for a given ESTC number, to make it possible to find data on the Database of Early English Playbooks (DEEP)

Slightly tangentially I also do some training for British Library staff on using APIs, and in that training I use Google Sheets to get people with no experience of writing code interacting with an API at the British Library. Inevitably(?) the exercises we work through in this course are slightly artificial, and so I’m always interested in finding examples (especially ones working with British Library resources, which includes the ESTC) from the real world.

Given @BArmintor had already worked out the method to extract a Wing or STC number from the ESTC (the online catalogue is hosted at the British Library), I thought I’d amuse myself  by seeing if I could turn it into a Google Sheets exercise – so here it is…

Introduction

In this exercise you are going to use a Google Spreadsheet to search for an ESTC number in the ESTC, find the full record, and then extract the Wing or STC number from the record.

Since (as far as I know) there isn’t a public API to the ESTC, then you will be extracting information from the web (HTML) interface to the ESTC (this method of extracting information from web pages is sometimes referred to as ‘scraping’ or ‘web scraping’).

Understanding the web interface to ESTC

The first thing you need to understand is how the web interface to ESTC is structured, how you can search using an ESTC number, and where you can get the Wing and STC numbers from. Luckily @BArmintorhas already outlined this in his blog post. The basics are:

  • You can search for an ESTC Number using a URL of the form http://estc.bl.uk/{ESTC_Number}. E.g. http://estc.bl.uk/S109167.
  • Following this URL will take you to a list of results which (hopefully) will contain a single item – the record you are looking for
  • This single record in the results list contains a link to the full record for the item
  • There are several different ‘format’ options for viewing the full record. The view you see is controlled by a ‘format’ parameter in the URL for the full record. By default this is ‘format=999’, but the ‘MARC’ view is probably the easiest to work with for our purposes, and to get this you need to use ‘format=001’
  • In the MARC view, the Wing or STC numbers are stored in a table. The rows in the table which have Wing and STC numbers will have the text ‘5104’ in their first cell

For more detail read @BArmintor’s post.

Setup a Google Sheet

You’ll need to register for Google Drive if you haven’t already, then you can use the ‘New’ option to create a new Google Sheet.

You are going to create a sheet where you can enter ESTC numbers in the first column, and then retrieve and display various information in subsequent columns. Put a heading at the top of column A to show it will contain ESTC numbers, and add your first ESTC number in the second cell in the column (I’m going to use the ESTC number S109167).

ESTC Google Sheets setup illustartion 1

 

 

Search ESTC and get the full record URL

You know that you can use the URL http://estc.bl.uk/S109167 to search for this item on the ESTC, and that you’ll get the results page back, which will include a link to the full record – so in column B you are going to retrieve the page from http://estc.bl.uk/S109167 and extract the link to the full record. You can do this by using a couple of Google Sheets functions: ‘concat‘ (or ‘concatenate’) and ‘importXml‘.

‘Concat’ joins together multiple strings – you’ll use this to create the first URL you need:

concat(“http://estc.bl.uk/”,A2)

‘importXml’ allows you to import data formatted as XML and extract information from it using a language called ‘xpath’. Happily @BArmintor has already worked out the ‘xpath’ statement you need as “//td[@class=td1]//a[contains(@href,’&set_entry’)]/@href”. This means the importXml we’ll use will look like:

=importXml(concat(“http://estc.bl.uk/”,A2),”//td[@class=td1]//a[contains(@href,’&set_entry’)]/@href”)

The function imports the URL that you are building with the ‘concat’ statement, and then uses the XPath statement that @BArmintor has provided to get the URL for the full record.

You are going to put this formula in Cell B2, and while you are at it add a heading for column B:

ESTC Google Sheets Illustration 2

When you’ve added this formula and hit ‘enter’ you should find that the sheet finds the URL for the full record and displays it in the cell:

ESTC Google Sheets illustration 3

Retrieve full record and get the STC Number

Now you have the Full Record URL, the next step is to retrieve the full record in MARC format and extract an STC number (if it exists). To do this we’ll need to:

  • change the ‘format’ parameter (which is ‘format=999’ in this URL) to be ‘format=001’ (to get the MARC record view)
  • retrieve the HTML for that MARC record view and search for the table rows beginning ‘5104’, then see if the contents of the next cell in the row contains ‘STC’

We’ll do this step in column C – so you are going to be adding a formula to Cell C2. Here we’ll be using the ‘SUBSTITUTE‘ function to change the ‘format’ parameter:

SUBSTITUTE(B2,“format=999”,“format=001”)

You then combine this with another importXml function. In  this case @BArmintor hasn’t given the relevant XPath so you have to work it out. The first part of the XPath statement needs to find a table cell – a ‘td’ in HTML – containing the text ‘5104’. The XPath is:

//td[contains(text(),’5104′)]

However what you need is the contents of the following <td> element (i.e. the next cell in the table). Happily XPath has a specific function for this called ‘following-sibling’, so you can use it like this:

//td[contains(text(),’5104′)]/following-sibling::td

Finally, you only want to get the contents of this cell if it contains the string ‘STC’ indicating it is an STC number – you can use ‘contains’ again to do this:

//td[contains(text(),’5104′)]/following-sibling::td[contains(text(),’STC’)]

Putting this all together you get the following formula to enter into cell C2:

=importxml(SUBSTITUTE(B2,“format=999”,“format=001”),“//td[contains(text(),’5104′)]/following-sibling::td[contains(text(),’STC’)]”)

Once you hit ‘enter’ on this you should see the correct STC information being retrieved:

ESTC Google Sheet Illustration 4

 

As @BArmintor notes, the STC statement is split into two parts – the first part is preceded by “|a” and the second part – which is the number you actually want is preceded by “|c”. If you want to break this down further there are probably a few options you can use, but you can use the Google Sheet function ‘SPLIT’. You can add this to the formula you’ve worked out for C2, so edit cell C2 to be:

=split(importxml(SUBSTITUTE(B2,“format=999”,“format=001”),“//td[contains(text(),’5104′)]/following-sibling::td[contains(text(),’STC’)]”),”|c”)

This will result in the statement you’ve retrieved being split into two parts – the STC reference in cell C2, and the STC number in D2 – add some column headings and you’ve got a working sheet that will retrieve the STC number for a given ESTC number:

ESTC Google Sheet illustration 5

Retrieve full record and get the Wing Number

To extend this to retrieve Wing numbers as well is now simply – just use column E to add a modified version of the formula in column C – you need to look for the text ‘Wing’ rather than ‘STC’:

=split(importxml(SUBSTITUTE(B2,“format=999”,“format=001”),“//td[contains(text(),’5104′)]/following-sibling::td[contains(text(),’Wing’)]”),”|c”)

Unfortunately the example ESTC Number I’ve chosen to use in the first row doesn’t have a Wing reference, so I can add another row to test this – just enter the new test number (R504962) in cell A3 and copy down the formula from B2,C2,E2 into B3,C3,E3 respectively:

ESTC Google Sheets Illustration 6

And you can add new rows and copy down the formula as much as you want of course.

Possible issues

There are some things we might want to look out for that might cause us problems:

  • If a record has multiple STC and/or Wing references in the record, this will end up taking up multiple rows in the spreadsheet
  • We do an exact match on ‘STC’ or ‘Wing’ in the ESTC record – which would mean we wouldn’t find a record that used ‘stc’ or ‘wing’ etc.
  • If the STC or Wing reference more than just two parts (indicated by ‘|a’ and ‘|c’) we might find we don’t separate out the STC/Wing number by itself

In his blog post, the next thing @BArmintor wants to do is “teach a computer to look up a DEEP entry”. Unfortunately this initially looks like it isn’t going to be possible in Google Sheets (although I haven’t looked in detail yet). The reason we hit problems here is the DEEP search page (http://deep.sas.upenn.edu/search.php) uses the HTTP ‘POST’ command – this passes the values to be searched to the DEEP database without showing them in the URL. While this is no problem for a browser, the Google Sheets ‘importXml’ command doesn’t support HTTP ‘POST’ options, but only HTTP ‘GET’ – which is what we’ve relied on in this exercise. I haven’t investigated any work arounds for this so far, and that’s enough for one blog post 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *