TILT: Text to Image Linking Tool

This blog post was written during a presentation at the British Library Labs Symposium in November 2014. It is likely full of errors and omissions having been written real-time.

Why this tool?

Libraries contain thousands of literary & documentary artefacts up to 4000 years old. How to bring these effectively to a modern audience.

“Images of their own are dull” – browse interfaces tend not to give the user much information. Even at the page image level, it can be difficult to make sense of what you are seeing.

One approach is to put the text on top of an image:
* Correltates words in image/text
* can be searched but…
* only works with OCR
* if text has errors, hard to fix
* text can’t be formattted or annotated

Another approach is to put the text next to the image:
* format text for different devices
* can annotate test for stufy
* easy to verify and edit
* BUT
* must keep image and text in sync
* increases mental effort to find corresponding words in the text/image

If you are going to link text to the image of the text, what level should you do this at?
* ilink at page-level – useful but too coarse. Doesn’t reduce mental effort much
* link at line level
* link at word level

Word level probably most desirable, but how to achiev it?
Manual approach:
* Manually draw shapes around words
* link them to the text by adding markup to the transcription
* BUT
* tedious & expensive
* markup gets complex
* end up needing multiple transcriptions

TILT approach:
* find word in an image without reconginsing their content
* Use an exsiting transcripto f the page content
* Link these two component mostly automatically

Design:

TILT Service
^ ^ ^
Image Text or HTML GeoJSON
^ ^ v
TILT web-based GUI

First you have to prepare image in GUI – identify different parts of the text
Stages:
* Colour to greyscale
* Greyscale to Black and White
* Find lines
* Find word shapes
* link word-shapes to text

Recognising words is a challenge:
* (in most languages) Words are blocks of connected pixels with small gaps between them
* But if there are 300 words on a page are the 299 largest gaps always between words

How to represent word shapes? Simple polygons do the trick

Measure width of words in text, and then tries to match against lengths in transcription – so if the word shapes have not been recognised correctly, the matching algorithm just selects more or less text in the transcription.

Now looking at using of ‘anchor points’ in text that allows the user to identify the start and end of ‘clean’ text in a larger manuscript which might have messy sections that can’t be done automatically. This allows you do what you can automatically, and only deal with the messy bits in a manual way.

Still working on a GUI to work with

Code on GitHub

Big Data, Small Data & meaning

This blog post was written during a presentation at the British Library Labs Symposium in November 2014. It is likely full of errors and omissions having been written real-time.

Tim Hitchcock (University of Sussex) opening the British Library Labs Symposium

BL Labs a unique project run on a shoestring budget emphasises the role of the library in digital age. “The library creates a space in which the intellectual capital of the library is made available”. Libraries and Museums – not just valued because they are ancient, but because the preserve memories. BL Lab provides a space for experimentation and explores the role of the library in digital research.

Interest in Big Data can obscure the use of Small Data. If we just focus on the very large, and ignore the very small, we miss stuff. A ‘macroscope’ (coined by Piers Anthony!). The idea of a ‘macroscope’ in Digital Humanities come from Katy Borner in her paper on “Plug-and-Play Maxcroscopes”. A visualisation tool that allows data to be viewed both at scale, but also driving down into individual data points in context. Tim see’s this paper as the trigger of interest in the concept in Digital Humanities.

Paper Machines (http://metalab.harvard.edu/2012/07/paper-machines/) – builds on Zotero to allow the user to build their own ‘Google Books’ + ‘Google Earth’. Not problem free, but concept of allowing you to look at the large and small scale. Jo Guldi and David Armitage “The History Manifesto” – they say that ‘micro-history’ is irrelevant. Tim says they only want to look at the largescale – disregarding the other side of the ‘macroscope’.

Scott Weingart – The Historian’s Macroscope : Big Digital History. Blog post “The moral role of DH in a data-driven world”. Weingart advocates ‘network analysis’ and Tim finds him convincing. Weingart makes a powerful case (in Tim’s view) for the use of network analysis and topic mapping as a means through which ‘history can speak to power’ – similar aim to Guldi and Armitage.

Is this an attempt to present humanities as somehow ‘equal’ to STEM? Moving humanities towards social science.

Jerome Dobson “Through the Macroscope: Geography’s View of the World” – using GIS. Again, pulls humanistic enquiry towards social science.

Ben Schmidt ‘prochronisms’ – looking at language use in artistic works purporting to represent historic reality. Looks at the individual words used, then maps against corpuses such as Google Books etc. Althoguh not described as a ‘macroscope’ approach but in Tim’s view perhaps comes the closest to using the full extent of what a macroscope is meant to achieve. Tim highlights how analysis of the scripts for Mad Men shows an arc across the series from (overstated) externalised masculinity of the 1950s, to (overstated) internalised masculinity of the 1970s.

Tim see’s the prochonisms project as interesting because it seems to be one of the few projects in this space that reveals new insights, which many of the big data do not [I feel this is neglecting the point that sometimes the ‘obvious’ needs supporting with evidence]

What are the humanities? Tim see’s the excitement and power of humanities – place detail into a fabric of understanding. Need a close and narrow reading of history to get to the detail of the excluded parts of society – these are not reflected by the massive.

What Tim sees as missing from modern macroscope project is looking at detail, at the individual, at the particular. We risk losing the ability to use fine detail.

Tim recently involved in project to look at trial transcripts from Old Bailey, looking at the language used to represent violence and how that related to trial outcomes and sentences. In STEM ‘cleaning data’ is a chore, they are interested the ‘signal’ that makes its way through the noise. Assumption that ‘big data’ lets you get away with ‘dirty data’.

Humanists read dirty data and are interested in its peculiarities. Tim sees the most urgent need for tools to do close reading of small data. Not a call to ignore the digitial, but a call to remember the tools that allow us to focus on the small.

When we read historical texts, how do we know what the contemporary reader would have known – when words have a new or novel use.

A single message from this talk – we need ‘radical contextualisation’. Every gesture contextualised in knowledge of all gestures ever made. This is not just ‘doing the same thing’. Tim’s favourite fragment of meaning is from linguistics – ‘voice onset timing’ – encompasses the gap between when you open your mouth to speak and the first sound. This changes depending on the type of interaction – by milliseconds. The smallest pause has something to say about the type of interaction that is happening.

Tim would like to see this level of view for the humanities – so we can see in every chisel stroke information about the wood in which it is imprinted, the tool that made the stroke and the person who wielded the tool.

Comments from floor:
* Even in science ‘big data’ not relevant in many cases
* Lingustic scholars have been working on this for years – e.g. Oxford Historical Thesaurus – we need to be wary of re-inventing the wheel
* Textual scholars can learn a huge amount from the kind of close reading that is applied to museum objects
* It is relatively hard to get largescale information about collections
* Online auction catalogues –

Information Integration: Mash-ups, APIs and the Semantic Web

Over the last couple of years, the British Library have been running a set of internal courses on digital skills for librarians. As part of this programme I’ve delivered a course called “Information Integration: Mash-ups, APIs and the Semantic Web”, and thought it would be good to share the course structure and materials in case they were helpful to others. The course was designed to run in a 6 hour day, including two 15 minute coffee breaks and a one hour lunch break.

The materials here were developed by me (Owen Stephens, owen@ostephens.com) on
behalf of the British Library. Unless otherwise stated, all images, audio or video content included in these materials are separate works with their own licence, and should not be assumed to be CC-BY in their own right.

The slidedecks are made available as a PDF containing slides and speaker notes. The speaker notes are just prompts and not full notes. Where I can speak to the slide without the prompts, there are no speaker notes.

The materials are licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase “Developed by Owen Stephens on behalf of the British Library”.

Session title Session description Materials Length of session
Data on the web A presentation describing the web, including the concept of “a web of documents” and how “data” is different 1 Data and the web CC-BY slide deck (pdf)

30m

What is a mashup A presentation describing ‘mashups’ drawing on examples from music, video and data 2 What is a mashup CC-BY slide deck (pdf)

30m

Mashups in libraries and the scholarly domain A short session giving examples of mashups of specific relevance to libraries 3 Mashups in libraries CC-BY slidedeck (pdf)

15m

APIs: what they are and how they are used in digital scholarship An introduction to the idea of an API – what one is, and what you might do with one, including examples of the use of APIs by digital scholars 4 APIs CC-BY slidedeck (pdf)

30m

Using an API – hands on exercise A hands-on exercise to retrieve data over an API, extract a specific part of the data retrieved and display the result. Google Spreadsheets will be used to do this. There are further exercises and challenges in the documentation which can be used if time allows. http://www.meanboyfriend.com/overdue_ideas/2014/10/using-an-api-hands-on-exercise/

1h

Mashing up your collections Discussion session in which participants will be asked to consider what opportunities their might be to build on collections under their responsibility, potentially using data from collections in combination with data and services from elsewhere. No materials

45m

A brief introduction to Linked Data A presentation covering the fundamentals of Linked Data. The focus is on the use of URIs to identify things. Examples are given of how Linked Data is being used in the library sector, and more broadly in digital scholarship 6 Linked Data CC-BY slidedeck (pdf)

45m

Review of the day A recap of the material covered during the day No materials

15m

Using an API: a hands on exercise

This Introduction to APIs was developed by Owen Stephens (owen@ostephens.com) on behalf of the British Library.

This work is licensed under a Creative Commons Attribution 4.0 International License
http://creativecommons.org/licenses/by/4.0/.

It is suggested when crediting this work, you include the phrase “Developed by Owen Stephens on behalf of the British Library”

Exercise 1: Using an API for the first time

Introduction

In this exercise you are going to use a Google Spreadsheet to retrieve records from an API to Flickr, and display the results.

The API you are going to use simply allows you to submit some search terms and get a list of results in a format called RSS. You are going to use a Spreadsheet to submit a search to the API, and display the results.

Understanding the API

The API you are going to use is an interface to Flickr. Flickr has a very powerful API with lots of functions, but for simplicity in this exercise you are just going to use the Flickr RSS feeds, rather than the full API which is more complex and requires you to register.

Before you can start working with the API, you need to understand how it works. To do this, we are going to look at an example URL:

https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss

The first part of the URL is the address of the API. Everything after the ‘?’ are ‘parameters’ which form the input to the API. There are two parameters listed and they each consist of the parameter name, followed by an ‘=’ sign, then a value.

The URL and parameters breakdown like this:

URL Part Explanation
https://api.flickr.com/services/feeds/photos_public.gne The address of the API
tags=food The ‘tags parameter – contains a list of tags (separated by commas) to be used to filter the list of images returned by the API. In this case just the single tag ‘food’ is listed.
format=rss format=rss

Going Further

If you want to find out more about the API being used here, documentation is available at:

https://www.flickr.com/services/feeds/

There are 8 different types of feed available which are documented here. Click on each one to see what parameters it takes.
The output of the API is displayed in the browser – this is an RSS feed – it would plug into any standard RSS reader. It is also valid XML.

While the XML is not the nicest thing to look at, it should be possible to find lines that look something like:


Italian Crostata with cherry jam VI
https://www.flickr.com/photos/91554579@N02/13058384463/ <p><a href="https://www.flickr.com/people/91554579@N02/">Lili B. Capaccetti</a> posted a photo:</p>

<p><a href="https://www.flickr.com/photos/91554579@N02/13058384463/" title="Italian Crostata with cherry jam VI"><img src="https://farm3.staticflickr.com/2158/13058384463_b3a0416677_m.jpg" width="160" height="240" alt="Italian Crostata with cherry jam VI" /></a></p>

Each result the API returns is called an ‘item’. Each ‘item’ at minimum will have a ‘title’ and a ‘link’. In this case the link is to the image in the Flickr interface.

The key things you need to know to work with this API are:

  • The address of the API
  • The parameters that the API accepts as input
  • The format(s) the API provides as output

Now you’ve got this information, you are ready to start using the API.

Using the API

To use the API, you are going to use a Google Spreadsheet. Go to http://drive.google.com and login to your Google account. Create a Google Spreadsheet

The first thing to do is build the API call (the query you are going to submit to the API).

First some labels:

In cell A1 enter the text ‘API Address’
In cell A2 enter the text ‘Tags’
In cell A3 enter the text ‘Format’
In cell A4 enter ‘API Call’
In cell A5 enter ‘Results’

Now, based on the information we were able to obtain by understanding the API we can fill values into column B as follows:

In cell B1 enter the address of the API
In cell B2 enter a simple, one word tag
In cell B3 enter the text ‘rss’ (omitting the inverted commas)

The first three rows of the spreadsheet should look something like (with whatever tag you’ve chased to search in B2):

handson-fig1

 

You now have all the parameters we need to build the API call. To do this you want to create a URL very similar to the one you looked at above. You can do this using a handy spreadsheet function/formula called ‘Concatenate’ which allows you to combine the contents of a number of spreadsheet cells with other text.

In Cell B4 type the following formula:

=concatenate(B1,”?”,”tags=”,B2,”&format=”,B3)

This joins the contents of cells B1, B2 with the text included in inverted commas in formula. Once you have entered this formula and pressed enter your spreadsheet should look like:

handson-fig2

The final step is to send this query, and retrieve and display the results. This is where the fact that the API returns results as an RSS feed comes in extremely useful. Google Spreadsheets has a special function for retrieving and displaying RSS feeds.

To use this, in Cell B5 type the following formula:

=importFeed(B4)

Because Google Spreadsheets knows what an RSS feed is, and understands it will contain one or more ‘items’ with a ‘title’ and a ‘link’ it will do the rest for us. Hit enter, and see the results.

Congratulations! You have built an API query, and displayed the results.

You have:
* Explored an API for Flickr
* Seen how you can ‘call’ the API by adding some parameters to a URL
* Understood how the API returns results in RSS format
* Used this knowledge to build a Google Spreadsheet which searches for a tag on Flickr and displays the results

Going Further

Further parameters that this API accepts are:

  • id
  • ids
  • tagmode
  • format
  • lang

These are documented at https://www.flickr.com/services/feeds/docs/photos_public/. When adding parameters to a URL, you use the ‘&’ sign between each parameter e.g.

https://api.flickr.com/services/feeds/photos_public.gne?tags=food&id=23577728@N07

This searches for all photos tagged with ‘food’ from a specific user (user id = 23577728@N07)

By adding a row to the spreadsheet, for this parameter, and modifying the ‘concatenate’ statement that builds the API Call, can you make the spreadsheet only return images with a specific tag in the British Library Flickr collection? (The Flickr ID for the British Library is ‘12403504@N02’)

If you want to know more about the ‘importFeed’ function, have a look at the documentation at http://support.google.com/drive/bin/answer.py?hl=en&answer=155181

Exercise 2: Working with the BNB

Introduction

In Exercise 1, you explored a simple API displayed the results from an RSS feed. However, an RSS feed contains only minimal information (a result title, a link and a description) and may not tell you a lot about the resource. In this exercise you will see how to deal with more complex data structures by retrieving items from the BNB and extracting information from the XML data.

Exploring the full record data
For this exercise you are going to work with a ‘full record display’ for books from the BNB.

An example URL is http://bnb.data.bl.uk/id/resource/010712074

Following this URL will show a page similar to this:

handson-fig3

This screen displays the information about this item which is available via the BNB API as an HTML page. Note that the URL of the page in the browser address bar is different to the one you clicked on. In the example given here the original URL was:

http://bnb.data.bl.uk/id/resource/010712074

while the address in the browser bar is:

http://bnb.data.bl.uk/doc/resource/010712074
You will be able to take advantage of the equivalence of these two URLs later in this exercise.

While the HTML display works well for humans, it is not always easy to automatically extract data from HTML. In this case the same information is available in a number of different formats, listed at the top righthand side of the display. The options are:

  • rdf
  • ttl
  • json
  • xml
  • html

The default view in a browser is the ‘html’ version. Offering access to the data in a variety of formats gives choice to anyone working in the API. Both ‘json’ and ‘xml’ are widely used by developers, with ‘json’ often being praised for its simplicity. However, the choice of format can depend on experience, the required outcome, and external constraints such as the programming language or tool being used.

Google Spreadsheet has some built in functions for reading XML, so for this exercise the XML format is the easiest one to use.

  • XML for BNB items
    To see what the XML version of the data looks like, click on the ‘xml’ link at the top right. Note the URL looks like:

http://bnb.data.bl.uk/doc/resource/010712074.xml

This is the same as the URL we saw for the HTML version above, but with the addition of ‘.xml’

XML is a way of structuring data in a hierarchical way – one way of thinking about it is as a series of folders, each of which can contain further folders. In XML terminology, these are ‘elements’ and each element can contain a value, or further elements (not both). If you look at an XML file, the elements are denoted by tags – that is the element name in angle brackets – just as in HTML. Every XML document must have a single root element that contains the rest of the XML.

Going Further

To learn more about XML, how it is structured and how it can be used see this tutorial from IBM:

http://www.ibm.com/developerworks/xml/tutorials/xmlintro/

Can you guess another URL which would also get you the XML version of the BNB record? Look at the URL in the spreadsheet and compare it to the URL you actually arrive at if you follow the link. The structure of the XML returned by the BNB API has a <result> element as the root element. The diagram below partially illustrates the structure of the XML.

handson-fig4

To extract data from the XML we have to ‘parse’ it – that is, tell a computer how to extract data from this structure. One way of doing this is using ‘XPath’. XPath is a way of writing down a route to data in an XML document.

The simplest type of XPath expression is to list all the elements that are in the ‘path’ to the data you want to extract using a ‘/’ to separate the list of elements. This is similar to how ‘paths’ to documents are listed in a file system.

In the document structure above, the XPath to the title is:

/result/primaryTopic/title

You can use a shorthand of ‘//’ at the start of an XPath expression to mean ‘any path’ and so in this case you could simply write ‘//title’ without needing to express all the container elements.

Going Further

What would the XPath be for the ISBN-10 in this example?
Why might you sometimes not want to use the shorthand ‘//’ for ‘any path’ instead of writing the path out in full? Can you think of any possible undesired side effects?

Find out more about XPath in this tutorial:
http://zvon.org/comp/r/tut-XPath_1.html

Using the API

Now you know how to get structured data for a BNB item, and the structure of the XML used, given a list of URLs for books in the BNB, you can create a spreadsheet to retrieve and display information about each of the books.

Google Spreadsheets has a function called ‘importXML’ which can be used to import XML, and then use XPath to extract the relevant data. In order to use this you need to know the location of the XML to import, and the XPath expression you want to use.

Create a new Google Spreadsheet, and in Column A paste the following list of URLs from the BNB (with the first URL in cell A1):

http://bnb.data.bl.uk/id/resource/009406660
http://bnb.data.bl.uk/id/resource/010055357
http://bnb.data.bl.uk/id/resource/009406743
http://bnb.data.bl.uk/id/resource/010053535
http://bnb.data.bl.uk/id/resource/008418912
http://bnb.data.bl.uk/id/resource/012702152
http://bnb.data.bl.uk/id/resource/009406658
http://bnb.data.bl.uk/id/resource/009097698
http://bnb.data.bl.uk/id/resource/010975194

In order to use this list of URLs to retrieve the XML versions of the records, you’ll need to add ‘.xml’ onto the end of each of them.

The XPath expression you can use is ‘//isbn10’. This will find all the isbn10 elements in the XML.

With these two bits of information you are ready to use the ‘importXML’ function. In to Cell B1, type the formula:

=importXml(concatenate(A1,”.xml”),”//isbn10″)

This creates the correct URL with the ‘concatenate’ function, retrieves the XML document, and uses the Xpath ‘//isbn10’ to get the content of the element – this 10 digit ISBN.

Congratulations! You have used the BNB API to retrieve XML and extract and display information from it.

You have:
* Understood the URLs you can use to retrieve a full record from the BNB
* Understood the XML used to represent the BNB record
* Written a basic XPath expression to extract information from the BNB record

ArchiveSpace at Edinburgh

Presentation by Scott Renton:

Current Management:

  • Standard ISAD(G), Schema EAD (XML)
  • Laid down by GASHE and NAHSTE projects
  • managed by “CMSyst” an in-house built MySQL/PHP applicaiotn
  • Data feeds to ArchivesHub, ARCHON, CW Site (EDINA)

Limitations of CMSyst, wanted a new system to extend what they could do. Looked at a range of options including:

  • DSpace
  • Vernon (Museums system)
  • Calm/Adlib (commercial system)
  • Archivists’ Toolkit (lacking a UI)
  • ICA ATOM

However – all had shortcomings. Decided to go with ArchiveSpace.

  • Actually ArchiveSpace is a successor to Archivists’ Toolkit
  • Support for appropriate standards
  • All archivist functionality in one place
  • Web delivered
  • Open Source
  • Lyrasis network behind development development
  • MySQL database
  • Runs under a container system such as Jetty or Tomcat
  • Code in JRuby, available on github
  • Four web apps:
    • Public front end
    • Admin area
    • Backend (Communicator)
    • Solr for indexing/search

Migration issues:

  • All data available from CMSyst
  • Exported in EAD
  • Some obstacles getting EAD to map correctly to loaded authorities
  • Some obstacles getting authorities loaded

ArchiveSpace has functionality to link  out to digital objects – using LUNA system – with CSV import of data [not clear which direction data flows here]

ArchiveSpace is popular in US, but Edinburgh the first European institution to take it.

University of Edinburgh Collections has a new web interface http://collections.ed.ac.uk. ArchiveSpace will be the expresso not archives within this collections portal. Archives are surfaced here as Collection Level Descriptions – 1000s of collections covered.

Implementation has been a good collaborative project. Got Curators, archivists, projects and innovations staff and digital developers all involved. Also good support on mailing list and good collaboration with other institutions.

Now going to look at “CollectionSpace” – which is a sister application for museums.

Archives collections will be available soon (within a couple of weeks) at http://collections.ed.ac.uk/archives

Publications Router

http://broker.edina.ac.uk

Currently takes data from UKPMC, looks for an affiliation statement in text of publication, pushes to appropriate institutional repository based on that affiliation (if the IR has given permission for this to happen).

Uses SWORD to push paper to the IR

Can also browse all the content, search by target repository and author

Also API to the data in the Router.

Currently trying to understand how the HEFCE REF OA mandate impacts on what the Router should do and how it can help institutions deliver to the mandate.

Major change is the requirement for AAM – which wasn’t the previous requirement

One of the most important question was  how institutions would like to get AAMs delivered to them – 3 options:

  • i. 3rd party service to push content to your system – 37% of respondents didn’t know if they would want this approach
  • ii. Pull content into your system using an API – 31% don’t know if this would work for them
  • iii. Receive content via email as a file attachment – 43% said ‘OK at a push’ and 30% saw as satisfactory

If there was another solution what would it be:

  • Deposit at publication
  • Academics upload
  • SWORD would be good if institution/author matching is ‘really good’
  • Anything involved minimal reliance on academics updating information
  • Publishers to provide metadata to institution on acceptance
  • Being notified of accepted manuscripts by publishers so then can coordinate with authors

Broke into discussion groups for the following questions:

  • How would you ideally like to receive AAMs (or metadata describing them)
  • If Router starts to provide AAMs (or metadata describing them) at acceptance and then later provide metadata for the published version of record (VoR) – what are the reduplication issues?
  • If you receive multiple copies of the same version what are the de-duplication issues
  • What are the main issue holding you back for participating? What help would you require?
  • What is the one most important feature that is essential to you?

Feedback from groups:

  • Pure doesn’t currently have a field for ‘date of acceptance’
  • Better to get the manuscript but the metadata is better than nothing
    • In some cases (especially with AAMs) you may have the manuscript and minimal metadata
  • De-duplication a huge issue – to the extent that examples of institutions having to turn off automated feeds from sources such as Scopus (see also Bournemouth yesterday)
  • Getting any kind of notification of acceptance would be a huge step for those working in institutions
    • Getting notification of publication as well would be big step
  • A pull method preferred – may need local processing before you publisher
  • ‘extra’ metadata – e.g. corresponding author – would be highly useful – not available from WoS
    • If local system doesn’t have ability to store that metadata then this is a problem
  • Boundary between push/pull is not very clear. E.g. notification is ‘push’
    • Got to be clear about what is being pushed and where to
    • Reluctant to have repository populated without human intervention
    • EPrints and DSpace have a ‘review queue’ function
  • Having more publishers on board with the Router is key – if you don’t have good coverage it’s just one more source
  • Identifiers! If you have identifiers for AAMs (DOIs) it might help with de-duplication
  • If you are confident as an institution that Authors will deposit AAMs then the real issue becomes very much being notified at publication [found this point very interesting – points to a real divide in institutional attitudes about what authors will do]
  • People still trying to work out workflows
  • Maybe a mixed message to researchers/authors if some is automated and some is not

Lessons in Open Access Compliance in HE (LOCH)

LOCH Pathfinder project. Presented by Dominic Tate

Partners – University of Edinburgh, Heriot-Watt, St Andrews

In (very) brief:

The approach is:

  • Managing Open Access payments – including a review of current reporting methods and creation of shareable spreadsheet templates for reporting to funders
  • Using PURE as a tool to manage Open Access compliance, verification and reporting
  • Adapting institutional workflows to pre-empt Open Access requirements and make compliance and seamless as possible for academics

 

E2E – end to end open access

Valerie McCutcheon talking about the ‘end to end open access’ (or E2E) project.

Project is working with a wide range of institutions types – big/small, geographically dispersed, from ‘vanilla’ systems to more customised.

Manifesting standard new open access metadata profile with some implementation although overall system agnostic:

  • EPrints case study
  • Hydra case study
  • EPrints OA reporting functionality

Generic workshops

  • Early stage – issue identification and solution sharing
  • Embedding future REF requirements
  • Advocacy
  • Late stage – report on ridings and identify unsolved issues

Several existing standards bodies/activity

E2E is collecting information on different metadata that is being used or is needed –  Metadata Community Collaboration – spreadsheet for people to add to

http://e2eoa.org/2014/07/01/working-on-metadata-requirements/

Workshop on 4th September – covering:

  • Current initiatives in Open Access
  • Metadata requirements

Working with other Pathfinder projects

Open Access at UCL and Bournemouth University

Quite different institutions but similarities in publications management at systems level:

  • Both use Symplectic Elements to manage publications and EPrints for IR
  • BU Research and Knowledge Exchange Office manages OA funding and ‘Bournemouth Research Information and Networking’ while the IR (BURO) is managed by the library
  • UCL Library manages both OA funding and publications through the Research Publications Service (RPS – think this is the Symplectic Elements) and Discovery (EPrints)

Publications Management

  • Researchers manage their data via Symplectic, which can also get data from Scopus and Web of Science, the data is then pushed out to profile pages and/or Repository

Institutional Repositories

  • UCL IR (Discovery) is both metadata only and full-text outputs – 317794 outputs in total – includes 5111 theses
  • Bournemouth only has full-text  – much smaller numbers – 2831 outputs in total – not all public access

Staff support

  • UCL – a Virtual Open Access Team
  • Bournemouth
    • OA Funding – 1 manager
    • No fulltime repository staff
    • Rota of 3 editorial staff, working one week in three on outputs received
    • 0.2 repository administrator
    • 0.2 Repository manager

OA Funding

  • UCL OA funding managed by OA Team in the library
    • Combination of RCUK, UCL and Wellcome funding
    • at least 9000 research pubs per annum
    • RCUK 2013-14 target: 693 papers – successfully processed 796
    • Current level of APC payments >2000 per annum

UCL has many pre-payment agreements in place for APCs

  • BioMed Central
  • Elsevier
  • BMJ Journals
  • RSC
  • IEEE
  • PeerJ
  • Sage
  • PLOS
  • Springer
  • T&F
  • Wiley
  • ubiquity press
  • and more – and hoping to extend further

Pre-payment agreements have been very successful and saved money

Both Bournemouth and UCL have found it challenging to spend all the money available for APCs

Challenges for engagement

  • UCL Discovery
    • Metadatga only outputs – poor quality, not checked, can be entered multiple times
    • Feeds into Symplectic Elements from Scopus and WoS can lead to duplicates: Scopus sometimes has records for pre and post publication and WoS can have a record also – and academics select all three rather than just choosing one of them
    • Academic engagement
    • Difficulty sending large files from RPS (Symplectic) to IR
    • Furious about how h index is calculated in RPS (manual entries aren’t counted, only items from Scopus / WoS)
    • Incorrect search settings in RPS
    • Don’t understand the data harvesting process – user managed to crash the system by entering single word search with common author name
  • Bournemouth BURO
    • 2013 – converted with full-text only
    • Mapping data issues
    • Incorrect publications display on original staff pages
    • Academic staff left thinking BURO no longer existed [think implication is that it looked liked it had been replaced by RPS?]

UCL have very clear requirement for outputs to be deposited in IR – http://www.ucl.ac.uk/library/open-access/ref/

Sheer volume of outputs at UCL is overwhelming

At Bournemouth – advocacy a big issue still (especially since many thought BURO had been discontinued) – but now outputs in BURO and BRIAN must be considered in pay and progression.

Shared challenges

  • Deposit on acceptance
  • Open Access options – making sure academics know what routes of publication are open to them
  • Establishing new workflows
  • Publishers move goalposts, change conditions etc.
  • Flexible support
  • Encouraging champions in Faculties
  • Use the REF2020 as a stick and a carrot for their research

UCL as a whole supports Green OA, but assists academics to meet their requirements through Gold OA route. UCL feels Gold will still be important to science disciplines

BU – funding will be available and has institutional support – but issues may arise depending on volume in the future

EPrints developments

Sketchy notes on this session: Les Carr updating us on developments in EPrints software / development

What is EPrints for?

  • Supporting researchers
  • Supporting research data
  • Supporting research outputs
  • Supporting research outcomes
  • Supporting research managers

However, want repositories to take on a bigger agenda – publication, data, information, …

To achieve this stripping EPrints back to its core data management engine – tuning it for speed, efficiency, scale and flexibility.

EPrints4 is not a rewrite of the software, but making the core as generic as possible – so it can handle all kinds of content

Improved integration with Xapian (for search)

Improving efficiency of information architecture – db transactions, memcached, fine-grained ACLs – can support much bigger repositories.

MVC approach

Can be run as headless service, but comes with a UI

Towards Repository ’16 – OA Compliancy, capture projects/funders data (working with Soton and Reading)

Integrating with other services/systems

  • IRUS-UK (on Bazaar)
  • Publications router
  • ORCID
  • RIOXX
  • WoK/Scopus imports
  • OA end-to-end project (EPrints Services are partners)

Les says “EPrints moving towards being a de-factor CRIS-light systems”

Repositories are/need to be collaboration between librarians and developers

No release date for EPrints4 yet – but probably around a year away.