Mar 11

I think more people in libraries should learn scripting skills – that is how to write short computer programmes. The reason is simple – because it can help you do things quickly and easily that would otherwise be time consuming and dull. This is probably the main  reason I started to use code and scripts in my work, and if you ever find yourself doing a job regularly that is time consuming and/or dull and thinking ‘there must be a better way to do this’ it may well be a good project for learning to code.

To give and example. I work on ‘Knowledgebase+’ (KB+) – a shared service for electronic resource management run by Jisc in the UK. KB+ holds details on a whole range of electronic journals and related information including details of organisations providing or using the resources.

I’ve just been passed the details of 79 new organisations to be added to the system. To create these normally would require a couple of pieces of information (including the name of the organisation) into a web form and click ‘submit’.

While not the worst nor the most time consuming job in the world, it seemed like something that could be made quicker and easier through a short piece of code. If I do this in a sensible way, next time there is a list of organisations to add to the system, I can just re-use the same code to do the job again.

Luckily I’d already been experimenting with automating some processes in KB+ so I had a head start, leaving me with just three things to do:

  1. Write code to extract the organisation name from the list I’d been given
  2. Find out how the ‘create organisation’ form in KB+ worked
  3. Write code to replicate this process that could take the organisation name and other data as input, and create an organisation on KB+

I’d been given the original list as a spreadsheet, so I just exported the list of organisation names as a csv to make it easy to read programmatically, after that writing code that opened the file, read a line at a time and found the name was trivial:

CSV.foreach(orgfile, :headers => true, :header_converters => :symbol) do |row|
    org_name = row[:name]
end

The code to trigger the creation of the organisation in KB+ was a simple http ‘POST’ command (i.e. it is just a simple web form submission). The code I’d written previously essentially ‘faked’ a browser session and logged into KB+ (I did this using a code library called ‘mechanize’ which is specially designed for this type of thing), so it was simply a matter of finding the relevant URL and parameters for the ‘post’. I used the handy Firefox extension ‘Tamper Data’ which allows you to see (and adjust) ‘POST’ and ‘GET’ requests sent from your browser – which allowed me to see the relevant information.

Screenshot of Tamper Data

The relevant details here are the URL at the top right of the form, and the list of ‘parameters’ on the right. Since I’d already got the code that dealt with authentication, the code to carryout this ‘post’ request looks like this

page = @magent.post(url, {
  "name" => org_name,
  "sector" => org_sector
  })
end

So – I’ve written less than 10 new lines of code and I’ve got everything I need to automate the creation of organisations in KB+ given a list in a CSV file.

Do you have any jobs that involve dull, repetitive tasks? Ever find yourself re-keying bits of data? Why not learn to code?

P.S. If you work on Windows, try looking at tools like MacroExpress or AutoHotKey, especially if ‘learning to code’ sounds too daunting/time-consuming

P.P.S. Perfection is the enemy of the good – try to avoid getting yourself into an XKCD ‘pass the salt’ situation

written by ostephens

Jun 10

Keynote from Ted Nelson

Talking about electronic literature for over 20 years. Felt alienated from the web because of ‘what it is not’.

Starting with the question – “what is literature”? For TN – a system of interconnected documents. But the web supports only ‘one way links’ – jumps into the unknown. Existing software does nothing for the writer to interact with this concept of ‘literacture’.

Constructs of books we have recreates the limitations of print – separate documents. Standard document formats – individual characters, scrambled with markup, encoded into a file. This thinking goes deep in the community – and TN contends this is why other ideas of how literature could exist are seen as impossible.

For the last 8-10 years, TN and colleagues working on a system that presents an interconnected literature (Xanadu Space). Two kinds of connection:

  • Links (connects things that are different, and are two way)
  • Transclusion (connects things that are the same)

TN illustrating using example of a programming working environment – where code, comments, bugs are transcluded into a single Integrated Work Environment.

  • We shouldn’t have ‘footnotes’ and ‘endnotes’ – they should be ‘on the side’.
  • Outlines should become tables of contents that go sideways into the document
  • Email quotation should be parallel – not ‘in line’

Vision is a parallel set of documents that can be see side-by-side.

History is parallel and connected – why do we not represent history as we write it – parallel coupled timelines and documents.

Challenge – how do you create this parallel set of connected documents? Each document needs to be addressable – so you can direct systems to ‘bring in text A from document B’. But challenges.

TN as a child was immersed in media. Dad was director for live TV – so TN got to see making television firsthand – his first experience was not just of consumption but as creation of TV. At college he produced musical, publication, film. Started designing interactive software.

How did we get here?

TN describing current realisation of the ‘translit’ approach – Xanadu. Several components:

  • Xanadoc – an ‘edit decision list format’ – generalisation of every quotation connected to it’s source
  • Xanalink – type, list of endsets (the things point at) – what to connected – exists independently of the doc?

What to do about changing documents? You copy & cache.

TN and colleagues almost ready to publish Xanadu specs for ‘xanadoc’ and ‘xanalink’ at http://xanadu.com/public/. Believes such an approach to literature can be published on the web, even though he dislikes the web for what it isn’t…

WYSIWYG – TN says only really applies to stuff you print out! TN aiming for ‘What you see is what you never could’ (do in print) – we need to throw off the chains of the printed document.

 

written by ostephens \\ tags:

Jun 10

Another winner of a DM2E Open Humanities award being presented today by Robyn Adams (http://www.livesandletters.ac.uk/people/robynadams) from the Center for Editing Lives and Letters. Project looked at repurposing data from the letters of Thomas Bodley (responsible for the refurbishment of the library at the University Oxford – creating the Bodleian Library).

Bodley’s letters are held in archives around the world. The letters are full of references to places, people etc. The letters had been digitised and transcribed – using software called ‘Transcribers Workbench’ developed specifically to help with early modern English writing. In order to make the transcribed data more valuable and usable decided to encode people and places from the letters – unfunded work on limited resources. Complicated by obscure references and also sometimes errors in the letters (e.g. Bodley states ‘the eldest son is to be married’ when it turns out it was the youngest son – makes researching the person to which Bodley is referring difficult).

This work was done in the absence of any specific use case. Now Robyn is re-approaching the data encoded as a consumer – to see how they can look for connections in the data to gain new insights

The data is available on Github at https://github.com/livesandletters/bodley1

written by ostephens \\ tags:

Jun 10

Dr Bernhard Haslhofer from University of Vienna giving details on their winning entry into the DM2E Open Humanities competition (http://dm2e.eu/open-humanities-award-winners-announced/)

MapHub – tools that allows you to map a historic map to the world as it is today. Also support commenting and semantic tagging.

All user contributed annotation is published via the Open Annotation API – so MapHub both consumes and produces open data.

Focussing today on Semantic Tagging in MapHub. While a user enters a full-text comment, MapHub analyses the text and tries to identify matching concepts in Wikipedia (DBPedia), and suggest them. User can click on suggested tags to accept it (or click again to reject). They carried out some research and found there was no difference between user behaviour in terms of number of tags added whether they used a ‘semantic tagging’ approach (linking each tag to a web resource) and ‘label tagging’ (tags operate as text strings).

Having found this successful, Bernhard would like to see the same concept applied to other resources – so planning to extract semantic tagging part of MapHub and develop a plugin for Annotorious. Also going to extend beyond the use of Geonames and Wikipedia – e.g. vocabularies expressed in SKOS. Aim to do this by September 2013.

written by ostephens \\ tags: ,

Apr 02

Following on from my previous post about BNB and SPARQL in this post I’m going to describe briefly building a Chrome browser extension that uses the SPARQL query described in that post – which given a VIAF URI for an author tries to find authors with the same birth year (i.e. contemporaries of the given author).

Why this particular query? I like it because it exposes data created and stored by libraries that wouldn’t normally be easy to query – the ‘birth year’ for people is usually treated as a field for display, but not for querying. The author dates are also interesting in that they give a range for the date a book was actually written rather than published which is the date that is used in most library catalogue searching.

The other reason for choosing this was that it nicely demonstrates how using ‘authoritative’ URIs for things such as people makes the process of bringing together data across multiple sources much easier. Of course whether a URI is ‘authoritative’ is a pretty subjective judgement – based on things like how much trust you have in the issuing body, how widely it is used across multiple sources, how useful it is. In this case I’m treating VIAF URIs as ‘authoritative’ in that I trust them to be around for a while, and they are already integrated into some external web resources – notably Wikipedia.

The plan was to create something that would work in a browser – from a page with a VIAF URI in it (with the main focus being Wikipedia pages), allow the user to find a list of ‘contemporaries’ for the person based on BNB data. I could have done this with a bookmarklet (similar to other projects I’ve done), but a recent conversation with @orangeaurochs on Twitter had put me in mind of writing a browser extension/plugin instead – and especially in this case where a bookmarklet would require the user to already know there was a VIAF URI in the page – it seemed to make sense.

I decided to write a Chrome extension – on a vague notion it probably had a larger installed base of any browser except Internet Explorer – but then later checking Wikipedia stats on browser use showed that Chrome was the most used on Wikipedia at the moment anyway – which is my main use case.

I started to look at the Chrome extension documentation. The ‘Getting Started’ tutorial got me up and running pretty quickly, and soon I had an extension running that worked pretty much like a bookmarklet and displayed a list of names from BNB based on a hardcoded VIAF URI. The extensions are basically a set of Javascript files (with some html/css for display), so if you are familiar with Javascript then once you’ve understood the specific chrome API you should find building an extension quite straightforward.

I then started to look at how I could grab a VIAF URI from the current page in the browser, and only show the extension action when one was found. The documentation suggested this is best handled using the ‘pageAction’ call. A couple of examples (Mappy (zip file with source code) and Page Action by content (zip file with source code)) and the documentation got me started on this.

Initially I struggled to understand the way different parts of the extension communicate with each other – partly because the code examples above don’t use the simplest (or most up to date) approaches (in general there seems to be an issue with the sample extensions sometimes using deprecated approaches). However the ‘Messaging’ documentation is much clearer and up to date.

The other challenge is parsing the XML returned from the SPARQL query – this would be much easier if I used some additional javascript libraries – but I didn’t really want to add a lot of baggage/dependencies to the extension – although I guess many extensions must include libraries like jQuery to simplify specific tasks. While writing this I’ve realised that the BNB SPARQL endpoint supports content negotiation, so it is possible to specify JSON as a response format (using Accept: sparql-results+json as per SPARQL 1.1 specification) – which would probably be simpler and faster – I suspect I’ll re-write shortly to do exactly this.

The result so far is a Chrome extension that displays an icon in the address bar when it detects a VIAF URI in the current page. The extension then tries to retrieve results from the BNB. At the moment failure (which can occur for a variety of reasons) just means a blank display. The speed of the extension leaves something to be desired – which means that sometimes you have to wait quite a while for results to display – which can look like ‘failure’ – I need to add something to show ‘working’ status and a definite message on ‘failure’ for whatever reason.

A working example looks like this:

Demonstration of Contemporaneous browser extension

Each name in the list links to the BNB URI for the person (which results in a readable HTML display in a browser, but often not a huge amount of data). It might be better to link to something else, but I’m not sure what. I could also display more information in the popup – I don’t think the overhead of retrieving additional basic information from the BNB would be that high. I could also do with just generally prettying up the display and putting some information at the top about what is actually being displayed and the common ‘year of birth’ (this latter would be nice as it would allow easy comparison of the BNB data to any date of birth in Wikipedia.

As mentioned, the extension looks for VIAF URIs in the page – so it works with other sources which do this – like WorldCat:

Demonstration of Contemporaneous extension working with WorldCat.org

While not doing anything incredibly complicated, I think that it gives one example which starts to answer the question “What to do with Linked Data?” which I proposed and discussed in a previous post, with particular reference to the inclusion of schema.org markup in WorldCat.

You can download the extension ready for installation, or look at/copy the source code from https://github.com/ostephens/contemporaneous

written by ostephens

Apr 01

I recently did a couple of workshops for the British Library about data on the web. As part of these workshops I did some work with the the BNB data using both the API and the SPARQL endpoint. Having a look and play with the data got me thinking about possible uses. One of the interesting things about using the SPARQL endpoint directly in place of the API is that you have a huge amount of flexibility about the data you can extract, and the way SPARQL works lets you do in a single query something that might take repeated calls to an API.

So starting with a query like:

SELECT *
WHERE {
<http://bnb.data.bl.uk/id/person/Bront%C3%ABCharlotte1816-1855> ?p ?o
}

This query finds triples about “Charlotte Brontë”. The next query does the same thing, but uses the fact that the BNB makes (where possible) ‘sameAs’ statements about BNB URIs to the equivalent VIAF URIs:

PREFIX owl:  <http://www.w3.org/2002/07/owl#>
SELECT ?p ?o
WHERE {
?person owl:sameAs <http://viaf.org/viaf/71388025> .
?person ?p ?o
}

This query first finds the BNB Resource which is ‘sameAs’ the VIAF URI for Charlotte Brontë (which is http://bnb.data.bl.uk/id/person/Bront%C3%ABCharlotte1816-1855)  - this is done by:

?person owl:sameAs <http://viaf.org/viaf/71388025>

The result of this query is one (or potentially more than one, although not in this particular case) URI, which are then used in the next part of the query:

?person ?p ?o

In this case, the query is slightly wider in that it is possible that there is more than one BNB resource identified as being the ‘sameAs’ the VIAF URI for Charlotte Brontë (although in actual fact there isn’t in this case).

Taking the query a bit further, we can find the date of birth for Charlotte Brontë:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX bio:  <http://purl.org/vocab/bio/0.1/>
SELECT ?dob
WHERE {
?person owl:sameAs <http://viaf.org/viaf/71388025> .
?person bio:event ?event .
?event rdf:type bio:Birth .
?event bio:date ?dob
}

The ‘prefix’ statements are just to setup a shorthand for the query – rather than having to type out the whole URI each time I can use the specified ‘prefix’ as an equivalent to the full URI. That is:

PREFIX bio:  <http://purl.org/vocab/bio/0.1/>
?person bio:event ?event

is equivalent to

?person <http://purl.org/vocab/bio/0.1/event> ?event

Having got to this stage – the year of birth based on a VIAF URI – we can use this to extend the query to find other people in BNB with the same birth year – the eventual query being:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX bio:  <http://purl.org/vocab/bio/0.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?dob ?name
WHERE {
?persona owl:sameAs <http://viaf.org/viaf/71388025> .
?persona bio:event ?eventa .
?eventa rdf:type bio:Birth .
?eventa bio:date ?dob .
?eventb bio:date ?dob . 
?eventb rdf:type bio:Birth .
?personb bio:event ?eventb .
?personb foaf:name ?name
}

I have to admit I’m not sure if this is the most efficient way of getting the result I want, but it does work – as you can see from the results. What is great about this query is that the only input is the VIAF URI for an author. We can substitute the one used here for any VIAF URI to find people born in the same year as the identified person (as long as they are in the BNB).

Since VIAF URIs are now included in many relevant Wikipedia articles, I thought it might be fun to build a browser extension that would display a list of ‘contemporaries’ for using the BNB data – partly to show how the use of common identifiers can make these things just fit together, partly to try building a browser extension and partly because I think it is a nice demonstration of the potential uses of library data which we have in libraries but often don’t exploit (like the years of birth/death for people).

But since this post has gone on long enough I’ll do a follow up post on building the extension – but if you are interested the code is available at https://github.com/ostephens/contemporaneous

written by ostephens

Mar 07

Lianne Smith from King’s College London Archives

Archives have records/papers from Senior Military personnel (I think I got that right?) [update 10th March 2013: thanks to David Underdown in the comments below for clarifying that Lianne was referring to the Liddell Hart Centre for Military Archives]

See Archives at KCL being innovative – and latest project is in Linked Data space.

Lianne is an archivist, not a technical specialist. 18 months ago, hadn’t heard of linked data – so a beginners view.

Context: Preparation for centenary of the start of the First World War – other institutions also doing work:

KCL had already contributed to the latter of these.

Report had highlighted the problem of locating resources and searching across information sources – even within institutions. Report particularly noted that if content wasn’t surfaced in Google searches it was a lot less ‘discoverable’. Also lack of clear vocabulary for WWI materials – so wanted to establish a standard approach building on existing thesauri etc.

Also Triples to Trenches built on previous project funded by JISC – particularly “Open Metadata Pathway” and “Step Change” projects – creating easy to use tools which would enable archivists to publish linked open data within normal workflows and without specialist knowledge.

Aims of Trenches to Triples included:

  • Creation of API to share data created using Alicat tool (output of Step Change project)
  • Adaption of catalogu front end for visualistation of linked data entities
  • Creation of Linked Data WWI vocabulary of peronal names, corporate name, place and subject terms – available for reuse by archives sector
  • Also act as a case study of Linked Data in archives

Within Alicat Places are based on Google Maps data – which ‘makes it simpler’ [although the problem with using contemporary maps when the world changes...] [also note that looking at a Place 'record' on data.aim25.ac.uk there is a GeoNames link - wonder if this is where the data actually comes from? E.g. http://data.aim25.ac.uk/id/place/scapafloworkneyscotland]

Outcomes of project:

  • Creation of WWI dataset and integration into AIM25-UKAT
    • Lesons learned in the creation of the dataset concerning identification of the level of granularity required and the amount of staff time which needs to be invested in preparation
    • different users have very different requirements in terms of granularity of data
    • Team included WWI specialist academic – identified a good resource for battles – could reuse existing data
    • Different users also use variation of terms in their research
  • More work on the front-end (User facing UI) presentation of additional data
    • Being able to integrate things like maps into UI is great
    • Need to work more on what sort of information you want to communicate about entities – especially things like Names – unlike location where a map is obvious addition
  • Need to increase the availability of resources as linked data
  • need to increase understanding and training in the archives sector
    • This approach is hugely reliant on understanding of the data – need archivists involved
  • need ongoing collaboration from the LODLAM community in agreeing standards for Linked Data adoption

 

written by ostephens

Mar 07

Aleks Drozdov – enterprise architect for Discovery system at the National Archive (TNA). Going to speak about APIs and Data and how implemented in Discovery system at TNA.

My Introduction to APIs post is relevant to this talk.

API and Data

An API = Application Programming Interface. Web API – in web context the API is typically defined as a set of messages over HTTP. Response messages usually in XML or JSON format.

Data – explosion in amount of data available. Common to ‘mashup’ (combine) data from a number of sources. Also User contributed data.

Discovery Architecture

At the base has a ‘Object Data Store’ – NoSQL object oriented database (MongoDB)

Getting data into Discovery

Vast number of different formats feeding into Discovery:

XML, RDBMS, Text, Spreadsheets etc. Go through a complex/sophisticated data normalisation process. Then fed into MongoDb  - the Object Data Store

Discovery data structure

Discovery treats all things as ‘informational asset’  - you can build hierarchies by links between assets

http://discovery.nationaarchives.gov.uk/SearchUI/details?Uri=C10127419

Last number here is a unique and persistent identifier for an information asset [not clear what level this is

Discovery API examples

Documentation at http://discovery.nationalarchives.gov.uk/SearchUI/api.htm

API endpoint at: http://discovery.nationalarchives.gov.uk/DiscoveryAPI

Just 6 calls supported (see http://discovery.nationalarchives.gov.uk/SearchUI/api.htm)

Can specify xml or json as format for response: http://discovery.nationalarchives.gov.uk/DiscoveryAPI/xml/ or http://discovery.nationalarchives.gov.uk/DiscoveryAPI/json

Search: http://discovery.nationalarchives.gov.uk/DiscoveryAPI/xml/search/{page}/query= or http://discovery.nationalarchives.gov.uk/DiscoveryAPI/json/search/{page}/query=

3o results per page

e.g. http://discovery.nationalarchives.gov.uk/DiscoveryAPI/json/search/1/query=C%20203

See documentation at http://discovery.nationalarchives.gov.uk/SearchUI/api.htm for details of other calls.

Next steps

Now have Discovery Platform and getting people to use API – next plan to build a Data Import API – so that External data can be brought into Discovery platform. Also want to build User Participation API.

written by ostephens

Mar 07

Jenny Bunn from UCL starting with a summary of history of archival description standards – from USMARC AMC (1977) to ISAD(G) (1st edition formally published 1994).

Meanwhile WGSAD in the US published ‘Standards for Archival Description: A Handbook” – also in 1994. Contains a wide variety of standards relevant to archives – from technical standards to Chicago Manual of Style.

EAD has its origin in encoding the Finding Aid – not to model archive data per se. EAD v1.0 released 1998

Also a mention of ISO23081 – metadata for Records (records management)

Bunn suggests that ISAD(G) is designed for information exchange – not for Archival Description. Specifically ISAD(G) doesn’t discuss the authenticity of records. At this point (says Bunn) ISAD(G) more a straight-jacket than enabler.

Call to action – move to ‘meaning’ vs information exchange in standards.

Point from Jane Stevenson that ISAD(G) not that great for information exchange! But Jenny makes point that as a schema it could serve the purpose – lack of content standard is a barrier to information exchange even within ISAD(G)

 

 

 

 

written by ostephens

Mar 07

James Davies talking about Google Cultural Institute.

As Google grew in size, it increased in scope. Encouraged employees to follow passions. If you get a dozen people in the room you’ll find at least one is passionate about Art – in Google a set of people interested in Art, Galleries, Museums, got together to find a way of making this content available – became the Google Art Project.

At the time James was at the Tate – and got involved in the Google Art project – and was impressed by how Google team listened to expertise in gallery. Now he has moved to Google – talking about various projects – Nelson Mandela Centre of Memory – http://archives.nelsonmandela.org.

Second project from Google in this area – the Cultural Institute – aim to work with variety of organisations including Archives. Finding a way of creating an ‘online exhibit’ – the Nelson Mandela site is an example of this – combines archival material with text from curators/archivists to tell a story. Then can jump into an item in the archive. From the exhibit you can access items – example here a letter from Nelson Mandela to his daughters – v high resolution by the look of it.

Forming a digestible narrative is key to exhibit format.

Romanian archive – includes footage from the revolution – TV broadcast at the time when revolutionaries took over TV studios.

James says archives are about people – plea to use stories assert the value of archives to protect them.

Into Q&A:

Q: Why not use the ‘Archives’ as a metaphor – ‘unboxing’ is the most exiting part of the archives experience and this is lost in ‘exhibit’ format

A: But that is because you know what you are looking for – the

Q: (from me) Risks of doing this in one place – why build a platform rather than distributing tools so archives can do this work themselves.

A: First step – part of that will be about distributing tools and approaches. Syndicating use of platform (as in Nelson Mandela site) is first step in this direction. Future steps could include distributing tools. Emphasised they didn’t want to be ‘hoarders’ of content.

Q: How to make self-navigation of archives easier for novice users

A: Hope that this will come

Several questions around the approach of ‘exhibits’ and ‘narratives’ – feeling that this ignores the depth of archive. Generally answer is that this is a way of presenting content – enables discovery of some content, and gives a place for people to ‘enter’ the archive – and from there explore more deeply.

Lots of concern from floor and on Twitter that this selective approach is at the expense or to the detriment of larger collection.

written by ostephens