Oct 28

Over the last couple of years, the British Library have been running a set of internal courses on digital skills for librarians. As part of this programme I’ve delivered a course called “Information Integration: Mash-ups, APIs and the Semantic Web”, and thought it would be good to share the course structure and materials in case they were helpful to others. The course was designed to run in a 6 hour day, including two 15 minute coffee breaks and a one hour lunch break.

The materials here were developed by me (Owen Stephens, owen@ostephens.com) on
behalf of the British Library. Unless otherwise stated, all images, audio or video content included in these materials are separate works with their own licence, and should not be assumed to be CC-BY in their own right.

The slidedecks are made available as a PDF containing slides and speaker notes. The speaker notes are just prompts and not full notes. Where I can speak to the slide without the prompts, there are no speaker notes.

The materials are licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/. 

It is suggested when crediting this work, you include the phrase “Developed by Owen Stephens on behalf of the British Library”.

 

Session title Session description Materials Length of session
Data on the web A presentation describing the web, including the concept of “a web of documents” and how “data” is different 1 Data and the web CC-BY slide deck (pdf)

30m

What is a mashup A presentation describing ‘mashups’ drawing on examples from music, video and data 2 What is a mashup CC-BY slide deck (pdf)

30m

Mashups in libraries and the scholarly domain A short session giving examples of mashups of specific relevance to libraries 3 Mashups in libraries CC-BY slidedeck (pdf)

15m

APIs: what they are and how they are used in digital scholarship An introduction to the idea of an API – what one is, and what you might do with one, including examples of the use of APIs by digital scholars 4 APIs CC-BY slidedeck (pdf)

30m

Using an API – hands on exercise A hands-on exercise to retrieve data over an API, extract a specific part of the data retrieved and display the result. Google Spreadsheets will be used to do this. There are further exercises and challenges in the documentation which can be used if time allows. http://www.meanboyfriend.com/overdue_ideas/2014/10/using-an-api-hands-on-exercise/

1h

Mashing up your collections Discussion session in which participants will be asked to consider what opportunities their might be to build on collections under their responsibility, potentially using data from collections in combination with data and services from elsewhere. No materials

45m

A brief introduction to Linked Data A presentation covering the fundamentals of Linked Data. The focus is on the use of URIs to identify things. Examples are given of how Linked Data is being used in the library sector, and more broadly in digital scholarship 6 Linked Data CC-BY slidedeck (pdf)

45m

Review of the day A recap of the material covered during the day No materials

15m

written by ostephens

Oct 27

This Introduction to APIs was developed by Owen Stephens (owen@ostephens.com) on behalf of the British Library.

This work is licensed under a Creative Commons Attribution 4.0 International License
http://creativecommons.org/licenses/by/4.0/.

It is suggested when crediting this work, you include the phrase “Developed by Owen Stephens on behalf of the British Library”

Exercise 1: Using an API for the first time

Introduction

In this exercise you are going to use a Google Spreadsheet to retrieve records from an API to Flickr, and display the results.

The API you are going to use simply allows you to submit some search terms and get a list of results in a format called RSS. You are going to use a Spreadsheet to submit a search to the API, and display the results.

Understanding the API

The API you are going to use is an interface to Flickr. Flickr has a very powerful API with lots of functions, but for simplicity in this exercise you are just going to use the Flickr RSS feeds, rather than the full API which is more complex and requires you to register.

Before you can start working with the API, you need to understand how it works. To do this, we are going to look at an example URL:

https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss

The first part of the URL is the address of the API. Everything after the ‘?’ are ‘parameters’ which form the input to the API. There are two parameters listed and they each consist of the parameter name, followed by an ‘=’ sign, then a value.

The URL and parameters breakdown like this:

URL Part Explanation
https://api.flickr.com/services/feeds/photos_public.gne The address of the API
tags=food The ‘tags parameter – contains a list of tags (separated by commas) to be used to filter the list of images returned by the API. In this case just the single tag ‘food’ is listed.
format=rss format=rss

Going Further

If you want to find out more about the API being used here, documentation is available at:

https://www.flickr.com/services/feeds/

There are 8 different types of feed available which are documented here. Click on each one to see what parameters it takes.
The output of the API is displayed in the browser – this is an RSS feed – it would plug into any standard RSS reader. It is also valid XML.

While the XML is not the nicest thing to look at, it should be possible to find lines that look something like:


Italian Crostata with cherry jam VI
https://www.flickr.com/photos/91554579@N02/13058384463/ <p><a href="https://www.flickr.com/people/91554579@N02/">Lili B. Capaccetti</a> posted a photo:</p>

<p><a href="https://www.flickr.com/photos/91554579@N02/13058384463/" title="Italian Crostata with cherry jam VI"><img src="https://farm3.staticflickr.com/2158/13058384463_b3a0416677_m.jpg" width="160" height="240" alt="Italian Crostata with cherry jam VI" /></a></p>

Each result the API returns is called an ‘item’. Each ‘item’ at minimum will have a ‘title’ and a ‘link’. In this case the link is to the image in the Flickr interface.

The key things you need to know to work with this API are:

  • The address of the API
  • The parameters that the API accepts as input
  • The format(s) the API provides as output

Now you’ve got this information, you are ready to start using the API.

Using the API

To use the API, you are going to use a Google Spreadsheet. Go to http://drive.google.com and login to your Google account. Create a Google Spreadsheet

The first thing to do is build the API call (the query you are going to submit to the API).

First some labels:

In cell A1 enter the text ‘API Address’
In cell A2 enter the text ‘Tags’
In cell A3 enter the text ‘Format’
In cell A4 enter ‘API Call’
In cell A5 enter ‘Results’

Now, based on the information we were able to obtain by understanding the API we can fill values into column B as follows:

In cell B1 enter the address of the API
In cell B2 enter a simple, one word tag
In cell B3 enter the text ‘rss’ (omitting the inverted commas)

The first three rows of the spreadsheet should look something like (with whatever tag you’ve chased to search in B2):

handson-fig1

 

You now have all the parameters we need to build the API call. To do this you want to create a URL very similar to the one you looked at above. You can do this using a handy spreadsheet function/formula called ‘Concatenate’ which allows you to combine the contents of a number of spreadsheet cells with other text.

In Cell B4 type the following formula:

=concatenate(B1,“?”,“tags=”,B2,“&format=”,B3)

This joins the contents of cells B1, B2 with the text included in inverted commas in formula. Once you have entered this formula and pressed enter your spreadsheet should look like:

handson-fig2

The final step is to send this query, and retrieve and display the results. This is where the fact that the API returns results as an RSS feed comes in extremely useful. Google Spreadsheets has a special function for retrieving and displaying RSS feeds.

To use this, in Cell B5 type the following formula:

=importFeed(B4)

Because Google Spreadsheets knows what an RSS feed is, and understands it will contain one or more ‘items’ with a ‘title’ and a ‘link’ it will do the rest for us. Hit enter, and see the results.

Congratulations! You have built an API query, and displayed the results.

You have:
* Explored an API for Flickr
* Seen how you can ‘call’ the API by adding some parameters to a URL
* Understood how the API returns results in RSS format
* Used this knowledge to build a Google Spreadsheet which searches for a tag on Flickr and displays the results

Going Further

Further parameters that this API accepts are:

  • id
  • ids
  • tagmode
  • format
  • lang

These are documented at https://www.flickr.com/services/feeds/docs/photos_public/. When adding parameters to a URL, you use the ‘&’ sign between each parameter e.g.

https://api.flickr.com/services/feeds/photos_public.gne?tags=food&id=23577728@N07

This searches for all photos tagged with ‘food’ from a specific user (user id = 23577728@N07)

By adding a row to the spreadsheet, for this parameter, and modifying the ‘concatenate’ statement that builds the API Call, can you make the spreadsheet only return images with a specific tag in the British Library Flickr collection? (The Flickr ID for the British Library is ‘12403504@N02’)

If you want to know more about the ‘importFeed’ function, have a look at the documentation at http://support.google.com/drive/bin/answer.py?hl=en&answer=155181

Exercise 2: Working with the BNB

Introduction

In Exercise 1, you explored a simple API displayed the results from an RSS feed. However, an RSS feed contains only minimal information (a result title, a link and a description) and may not tell you a lot about the resource. In this exercise you will see how to deal with more complex data structures by retrieving items from the BNB and extracting information from the XML data.

Exploring the full record data
For this exercise you are going to work with a ‘full record display’ for books from the BNB.

An example URL is http://bnb.data.bl.uk/id/resource/010712074

Following this URL will show a page similar to this:

handson-fig3

This screen displays the information about this item which is available via the BNB API as an HTML page. Note that the URL of the page in the browser address bar is different to the one you clicked on. In the example given here the original URL was:

http://bnb.data.bl.uk/id/resource/010712074

while the address in the browser bar is:

http://bnb.data.bl.uk/doc/resource/010712074

You will be able to take advantage of the equivalence of these two URLs later in this exercise.

While the HTML display works well for humans, it is not always easy to automatically extract data from HTML. In this case the same information is available in a number of different formats, listed at the top righthand side of the display. The options are:

  • rdf
  • ttl
  • json
  • xml
  • html

The default view in a browser is the ‘html’ version. Offering access to the data in a variety of formats gives choice to anyone working in the API. Both ‘json’ and ‘xml’ are widely used by developers, with ‘json’ often being praised for its simplicity. However, the choice of format can depend on experience, the required outcome, and external constraints such as the programming language or tool being used.

Google Spreadsheet has some built in functions for reading XML, so for this exercise the XML format is the easiest one to use.

  • XML for BNB items
    To see what the XML version of the data looks like, click on the ‘xml’ link at the top right. Note the URL looks like:

http://bnb.data.bl.uk/doc/resource/010712074.xml

This is the same as the URL we saw for the HTML version above, but with the addition of ‘.xml’

XML is a way of structuring data in a hierarchical way – one way of thinking about it is as a series of folders, each of which can contain further folders. In XML terminology, these are ‘elements’ and each element can contain a value, or further elements (not both). If you look at an XML file, the elements are denoted by tags – that is the element name in angle brackets – just as in HTML. Every XML document must have a single root element that contains the rest of the XML.

Going Further

To learn more about XML, how it is structured and how it can be used see this tutorial from IBM:

http://www.ibm.com/developerworks/xml/tutorials/xmlintro/

Can you guess another URL which would also get you the XML version of the BNB record? Look at the URL in the spreadsheet and compare it to the URL you actually arrive at if you follow the link. The structure of the XML returned by the BNB API has a <result> element as the root element. The diagram below partially illustrates the structure of the XML.

handson-fig4

To extract data from the XML we have to ‘parse’ it – that is, tell a computer how to extract data from this structure. One way of doing this is using ‘XPath’. XPath is a way of writing down a route to data in an XML document.

The simplest type of XPath expression is to list all the elements that are in the ‘path’ to the data you want to extract using a ‘/’ to separate the list of elements. This is similar to how ‘paths’ to documents are listed in a file system.

In the document structure above, the XPath to the title is:

/result/primaryTopic/title

You can use a shorthand of ‘//’ at the start of an XPath expression to mean ‘any path’ and so in this case you could simply write ‘//title’ without needing to express all the container elements.

Going Further

What would the XPath be for the ISBN–10 in this example?
Why might you sometimes not want to use the shorthand ‘//’ for ‘any path’ instead of writing the path out in full? Can you think of any possible undesired side effects?

Find out more about XPath in this tutorial:
http://zvon.org/comp/r/tut-XPath_1.html

Using the API

Now you know how to get structured data for a BNB item, and the structure of the XML used, given a list of URLs for books in the BNB, you can create a spreadsheet to retrieve and display information about each of the books.

Google Spreadsheets has a function called ‘importXML’ which can be used to import XML, and then use XPath to extract the relevant data. In order to use this you need to know the location of the XML to import, and the XPath expression you want to use.

Create a new Google Spreadsheet, and in Column A paste the following list of URLs from the BNB (with the first URL in cell A1):

http://bnb.data.bl.uk/id/resource/009406660

http://bnb.data.bl.uk/id/resource/010055357

http://bnb.data.bl.uk/id/resource/009406743

http://bnb.data.bl.uk/id/resource/010053535

http://bnb.data.bl.uk/id/resource/008418912

http://bnb.data.bl.uk/id/resource/012702152

http://bnb.data.bl.uk/id/resource/009406658

http://bnb.data.bl.uk/id/resource/009097698

http://bnb.data.bl.uk/id/resource/010975194

In order to use this list of URLs to retrieve the XML versions of the records, you’ll need to add ‘.xml’ onto the end of each of them.

The XPath expression you can use is ‘//isbn10’. This will find all the isbn10 elements in the XML.

With these two bits of information you are ready to use the ‘importXML’ function. In to Cell B1, type the formula:

=importXml(concatenate(A1,“.xml”),“//isbn10”)

This creates the correct URL with the ‘concatenate’ function, retrieves the XML document, and uses the Xpath ‘//isbn10’ to get the content of the element – this 10 digit ISBN.

Congratulations! You have used the BNB API to retrieve XML and extract and display information from it.

You have:
* Understood the URLs you can use to retrieve a full record from the BNB
* Understood the XML used to represent the BNB record
* Written a basic XPath expression to extract information from the BNB record

written by ostephens

Jul 31

Presentation by Scott Renton:

Current Management:

  • Standard ISAD(G), Schema EAD (XML)
  • Laid down by GASHE and NAHSTE projects
  • managed by “CMSyst” an in-house built MySQL/PHP applicaiotn
  • Data feeds to ArchivesHub, ARCHON, CW Site (EDINA)

Limitations of CMSyst, wanted a new system to extend what they could do. Looked at a range of options including:

  • DSpace
  • Vernon (Museums system)
  • Calm/Adlib (commercial system)
  • Archivists’ Toolkit (lacking a UI)
  • ICA ATOM

However – all had shortcomings. Decided to go with ArchiveSpace.

  • Actually ArchiveSpace is a successor to Archivists’ Toolkit
  • Support for appropriate standards
  • All archivist functionality in one place
  • Web delivered
  • Open Source
  • Lyrasis network behind development development
  • MySQL database
  • Runs under a container system such as Jetty or Tomcat
  • Code in JRuby, available on github
  • Four web apps:
    • Public front end
    • Admin area
    • Backend (Communicator)
    • Solr for indexing/search

Migration issues:

  • All data available from CMSyst
  • Exported in EAD
  • Some obstacles getting EAD to map correctly to loaded authorities
  • Some obstacles getting authorities loaded

ArchiveSpace has functionality to link  out to digital objects – using LUNA system – with CSV import of data [not clear which direction data flows here]

ArchiveSpace is popular in US, but Edinburgh the first European institution to take it.

University of Edinburgh Collections has a new web interface http://collections.ed.ac.uk. ArchiveSpace will be the expresso not archives within this collections portal. Archives are surfaced here as Collection Level Descriptions – 1000s of collections covered.

Implementation has been a good collaborative project. Got Curators, archivists, projects and innovations staff and digital developers all involved. Also good support on mailing list and good collaboration with other institutions.

Now going to look at “CollectionSpace” – which is a sister application for museums.

Archives collections will be available soon (within a couple of weeks) at http://collections.ed.ac.uk/archives

written by ostephens \\ tags:

Jul 31

http://broker.edina.ac.uk

Currently takes data from UKPMC, looks for an affiliation statement in text of publication, pushes to appropriate institutional repository based on that affiliation (if the IR has given permission for this to happen).

Uses SWORD to push paper to the IR

Can also browse all the content, search by target repository and author

Also API to the data in the Router.

Currently trying to understand how the HEFCE REF OA mandate impacts on what the Router should do and how it can help institutions deliver to the mandate.

Major change is the requirement for AAM – which wasn’t the previous requirement

One of the most important question was  how institutions would like to get AAMs delivered to them – 3 options:

  • i. 3rd party service to push content to your system – 37% of respondents didn’t know if they would want this approach
  • ii. Pull content into your system using an API – 31% don’t know if this would work for them
  • iii. Receive content via email as a file attachment – 43% said ‘OK at a push’ and 30% saw as satisfactory

If there was another solution what would it be:

  • Deposit at publication
  • Academics upload
  • SWORD would be good if institution/author matching is ‘really good’
  • Anything involved minimal reliance on academics updating information
  • Publishers to provide metadata to institution on acceptance
  • Being notified of accepted manuscripts by publishers so then can coordinate with authors

Broke into discussion groups for the following questions:

  • How would you ideally like to receive AAMs (or metadata describing them)
  • If Router starts to provide AAMs (or metadata describing them) at acceptance and then later provide metadata for the published version of record (VoR) – what are the reduplication issues?
  • If you receive multiple copies of the same version what are the de-duplication issues
  • What are the main issue holding you back for participating? What help would you require?
  • What is the one most important feature that is essential to you?

Feedback from groups:

  • Pure doesn’t currently have a field for ‘date of acceptance’
  • Better to get the manuscript but the metadata is better than nothing
    • In some cases (especially with AAMs) you may have the manuscript and minimal metadata
  • De-duplication a huge issue – to the extent that examples of institutions having to turn off automated feeds from sources such as Scopus (see also Bournemouth yesterday)
  • Getting any kind of notification of acceptance would be a huge step for those working in institutions
    • Getting notification of publication as well would be big step
  • A pull method preferred – may need local processing before you publisher
  • ‘extra’ metadata – e.g. corresponding author – would be highly useful – not available from WoS
    • If local system doesn’t have ability to store that metadata then this is a problem
  • Boundary between push/pull is not very clear. E.g. notification is ‘push’
    • Got to be clear about what is being pushed and where to
    • Reluctant to have repository populated without human intervention
    • EPrints and DSpace have a ‘review queue’ function
  • Having more publishers on board with the Router is key – if you don’t have good coverage it’s just one more source
  • Identifiers! If you have identifiers for AAMs (DOIs) it might help with de-duplication
  • If you are confident as an institution that Authors will deposit AAMs then the real issue becomes very much being notified at publication [found this point very interesting - points to a real divide in institutional attitudes about what authors will do]
  • People still trying to work out workflows
  • Maybe a mixed message to researchers/authors if some is automated and some is not

written by ostephens \\ tags:

Jul 30

LOCH Pathfinder project. Presented by Dominic Tate

Partners – University of Edinburgh, Heriot-Watt, St Andrews

In (very) brief:

The approach is:

  • Managing Open Access payments – including a review of current reporting methods and creation of shareable spreadsheet templates for reporting to funders
  • Using PURE as a tool to manage Open Access compliance, verification and reporting
  • Adapting institutional workflows to pre-empt Open Access requirements and make compliance and seamless as possible for academics

 

written by ostephens \\ tags:

Jul 30

Valerie McCutcheon talking about the ‘end to end open access’ (or E2E) project.

Project is working with a wide range of institutions types – big/small, geographically dispersed, from ‘vanilla’ systems to more customised.

Manifesting standard new open access metadata profile with some implementation although overall system agnostic:

  • EPrints case study
  • Hydra case study
  • EPrints OA reporting functionality

Generic workshops

  • Early stage – issue identification and solution sharing
  • Embedding future REF requirements
  • Advocacy
  • Late stage – report on ridings and identify unsolved issues

Several existing standards bodies/activity

E2E is collecting information on different metadata that is being used or is needed –  Metadata Community Collaboration – spreadsheet for people to add to

http://e2eoa.org/2014/07/01/working-on-metadata-requirements/

Workshop on 4th September – covering:

  • Current initiatives in Open Access
  • Metadata requirements

Working with other Pathfinder projects

written by ostephens \\ tags:

Jul 30

Quite different institutions but similarities in publications management at systems level:

  • Both use Symplectic Elements to manage publications and EPrints for IR
  • BU Research and Knowledge Exchange Office manages OA funding and ‘Bournemouth Research Information and Networking’ while the IR (BURO) is managed by the library
  • UCL Library manages both OA funding and publications through the Research Publications Service (RPS – think this is the Symplectic Elements) and Discovery (EPrints)

Publications Management

  • Researchers manage their data via Symplectic, which can also get data from Scopus and Web of Science, the data is then pushed out to profile pages and/or Repository

Institutional Repositories

  • UCL IR (Discovery) is both metadata only and full-text outputs – 317794 outputs in total – includes 5111 theses
  • Bournemouth only has full-text  – much smaller numbers – 2831 outputs in total – not all public access

Staff support

  • UCL – a Virtual Open Access Team
  • Bournemouth
    • OA Funding – 1 manager
    • No fulltime repository staff
    • Rota of 3 editorial staff, working one week in three on outputs received
    • 0.2 repository administrator
    • 0.2 Repository manager

OA Funding

  • UCL OA funding managed by OA Team in the library
    • Combination of RCUK, UCL and Wellcome funding
    • at least 9000 research pubs per annum
    • RCUK 2013-14 target: 693 papers – successfully processed 796
    • Current level of APC payments >2000 per annum

UCL has many pre-payment agreements in place for APCs

  • BioMed Central
  • Elsevier
  • BMJ Journals
  • RSC
  • IEEE
  • PeerJ
  • Sage
  • PLOS
  • Springer
  • T&F
  • Wiley
  • ubiquity press
  • and more – and hoping to extend further

Pre-payment agreements have been very successful and saved money

Both Bournemouth and UCL have found it challenging to spend all the money available for APCs

Challenges for engagement

  • UCL Discovery
    • Metadatga only outputs – poor quality, not checked, can be entered multiple times
    • Feeds into Symplectic Elements from Scopus and WoS can lead to duplicates: Scopus sometimes has records for pre and post publication and WoS can have a record also – and academics select all three rather than just choosing one of them
    • Academic engagement
    • Difficulty sending large files from RPS (Symplectic) to IR
    • Furious about how h index is calculated in RPS (manual entries aren’t counted, only items from Scopus / WoS)
    • Incorrect search settings in RPS
    • Don’t understand the data harvesting process – user managed to crash the system by entering single word search with common author name
  • Bournemouth BURO
    • 2013 – converted with full-text only
    • Mapping data issues
    • Incorrect publications display on original staff pages
    • Academic staff left thinking BURO no longer existed [think implication is that it looked liked it had been replaced by RPS?]

UCL have very clear requirement for outputs to be deposited in IR – http://www.ucl.ac.uk/library/open-access/ref/

Sheer volume of outputs at UCL is overwhelming

At Bournemouth – advocacy a big issue still (especially since many thought BURO had been discontinued) – but now outputs in BURO and BRIAN must be considered in pay and progression.

Shared challenges

  • Deposit on acceptance
  • Open Access options – making sure academics know what routes of publication are open to them
  • Establishing new workflows
  • Publishers move goalposts, change conditions etc.
  • Flexible support
  • Encouraging champions in Faculties
  • Use the REF2020 as a stick and a carrot for their research

UCL as a whole supports Green OA, but assists academics to meet their requirements through Gold OA route. UCL feels Gold will still be important to science disciplines

BU – funding will be available and has institutional support – but issues may arise depending on volume in the future

written by ostephens \\ tags:

Jul 30

Sketchy notes on this session: Les Carr updating us on developments in EPrints software / development

What is EPrints for?

  • Supporting researchers
  • Supporting research data
  • Supporting research outputs
  • Supporting research outcomes
  • Supporting research managers

However, want repositories to take on a bigger agenda – publication, data, information, …

To achieve this stripping EPrints back to its core data management engine – tuning it for speed, efficiency, scale and flexibility.

EPrints4 is not a rewrite of the software, but making the core as generic as possible – so it can handle all kinds of content

Improved integration with Xapian (for search)

Improving efficiency of information architecture – db transactions, memcached, fine-grained ACLs – can support much bigger repositories.

MVC approach

Can be run as headless service, but comes with a UI

Towards Repository ’16 – OA Compliancy, capture projects/funders data (working with Soton and Reading)

Integrating with other services/systems

  • IRUS-UK (on Bazaar)
  • Publications router
  • ORCID
  • RIOXX
  • WoK/Scopus imports
  • OA end-to-end project (EPrints Services are partners)

Les says “EPrints moving towards being a de-factor CRIS-light systems”

Repositories are/need to be collaboration between librarians and developers

No release date for EPrints4 yet – but probably around a year away.

written by ostephens \\ tags:

Jul 30

Original concerns for RIOXX:

  • Primary:
    • How to represent the funder
    • How to represent the project/grant
  • Secondary:
    • How to represent the persistent identifier of the item described
    • Provisions of identifiers pointing to related data sets
    • How to represent the terms of use for an item

Original principles:

  • Purpose driven – Focussed on satisfying RCUK reporting requirements
  • Simple (re-use DC, not CERIF)
  • Generic in scope (don’t tie down to specific types o output)
  • Interoperable – specifically with OpenAIRE
  • Developed openly – public consultation

Has anything changed with RIOXX 2 (mid-2014)?

  • Still purpose driven, but no encompassing HEFCE requirements as well
  • Slightly more sophisticated / complex but still quite simple
  • No longer ‘generic’ – explicit focus on publications
  • No longer seen as a temporary measure – positioned to support REF2020
  • Interoperability still key – and currently working on an OpenAIRE crosswalk

Other changes:

Current status:

  • version 2.0 beta was released for public consultation in June 2014
  • version 2.0 RC 1 has been compiled
  • accompanying guidelines are being written
  • XSD schema beeen developed
  • expect full release in late August/early September

Some specific elements:

  • dc:identifier
    • identifies the open access item being described by the RIOXX metadata record, regardless of where it is
    • recommended to identify the resource itself, not a splash page
    • dc:identifier MUST be an HTTP URI
  • dc:relation and rioxxterms:version_of_record
    • rioxxterms:version_of_record is an HTTP URI which is a persistent identifier for the published version of a resource
    • will often be the HTTP URI form of a DOI
  • dc:relation
    • option property pointing to other material (e.g. dataset)
  • dcterms:dateAccepted
    • MUST be provided
    • more precise than other dated events (‘published’ date very grey area)
  • rioxxterms:author & rioxxterms:contributor
    • MUST be HTTP URIs – ORCID strongly recommended
    • one rioxxterms:author per author
    • rioxxterms:contributor is for parties that are not authors but credited with some contribution to publication
  • rioxxterms:project
    • joins funder and projected in one, slightly more complex, proerty
    • The use of funder IDs (DOIs in their HTTP URI form) from FundRef is recommended, but other ID schemes can be used and name can be used
  • license_ref
    • adopted from NISO’s Open Access Metadata and Indicators
    • takes an HTTP URI and a start date
    • URI should identify a license
      • there is work under way to create a ‘white list’ of acceptable licenses
    • embargoes can be expressed by using the ‘start date’ to show the date on which the license takes effect

Funding of development of RIOXX as application profile now at an end, but funding for further developments (e.g. s/w development for repositories etc.)

RIOXX is endorsed by both RCUK and HEFCE

Q: What about implementing RIOXX in CRIS systems?

A: Some work on mapping between CERIF and RIOXX, although not ongoing work. Technical description available for any one to implement.

A (Balviar): In terms of developing plugins for commercial products – not talked to commercial suppliers yet, but planning to look at what can be developed and conversations now starting.

A (James): Already got RIOXX terms in Pure feed at Edinburgh

Q: Can you clarify ‘first name author’ vs ‘corresponding author’ – do you intend the first named author to be the corresponding author?

A: Understand that the ‘common case’ is that the first name author is the corresponding author [at this point lots of disagreement from the floor on this point].

‘first name author’ seen as synonym for ‘lead author’

Q: Why is vocabulary for (?) not in line with REF vocabulary

A: HEFCE accepts wider range of outputs that ‘publications’, but RIOXX specifically focusses on publications  – where most OA issues lie

Q: ‘Date accepted’ what about historic publications? Won’t have date of acceptance

A: RIOXX not designed for retrospective publication – going forward only. Not a general purpose bibliographic record

Comment from Peter Burnhill: RIOXX is not a cataloguing schema – it is a set of labels

Paul W emphasises – RIOXX is not a ‘record’ format – the systems outputting RIOXX will have much richer metadata already. There is no point in ‘subverting’ RIOXX for historical purposes – this isn’t its intended purpose.

Q: Can an author have multiple IDs in RIOXX

A: Not at the moment. Mapping between IDs for authors is a different problem space, not one that RIOXX tries to address

Comment from RCUK: Biggest problem is monitoring compliance with our policies – which RIOXX will help with a lot

Comments from floor: starting to see institutions issues ORCIDs for their researchers – could see multiple ORCIDs from a single person. Similarly with DOIs – upload a publishers PDF to Figshare you get a new DOI

Q: If you are producing RIOXX what about OpenAIRE

A: There is nothing in RIOXX that would stop it being OpenAIRE compliant – so RIOXX records can be transformed into OpenAIRE records (but no vice versa)

written by ostephens \\ tags:

Mar 11

I think more people in libraries should learn scripting skills – that is how to write short computer programmes. The reason is simple – because it can help you do things quickly and easily that would otherwise be time consuming and dull. This is probably the main  reason I started to use code and scripts in my work, and if you ever find yourself doing a job regularly that is time consuming and/or dull and thinking ‘there must be a better way to do this’ it may well be a good project for learning to code.

To give and example. I work on ‘Knowledgebase+’ (KB+) – a shared service for electronic resource management run by Jisc in the UK. KB+ holds details on a whole range of electronic journals and related information including details of organisations providing or using the resources.

I’ve just been passed the details of 79 new organisations to be added to the system. To create these normally would require a couple of pieces of information (including the name of the organisation) into a web form and click ‘submit’.

While not the worst nor the most time consuming job in the world, it seemed like something that could be made quicker and easier through a short piece of code. If I do this in a sensible way, next time there is a list of organisations to add to the system, I can just re-use the same code to do the job again.

Luckily I’d already been experimenting with automating some processes in KB+ so I had a head start, leaving me with just three things to do:

  1. Write code to extract the organisation name from the list I’d been given
  2. Find out how the ‘create organisation’ form in KB+ worked
  3. Write code to replicate this process that could take the organisation name and other data as input, and create an organisation on KB+

I’d been given the original list as a spreadsheet, so I just exported the list of organisation names as a csv to make it easy to read programmatically, after that writing code that opened the file, read a line at a time and found the name was trivial:

CSV.foreach(orgfile, :headers => true, :header_converters => :symbol) do |row|
    org_name = row[:name]
end

The code to trigger the creation of the organisation in KB+ was a simple http ‘POST’ command (i.e. it is just a simple web form submission). The code I’d written previously essentially ‘faked’ a browser session and logged into KB+ (I did this using a code library called ‘mechanize’ which is specially designed for this type of thing), so it was simply a matter of finding the relevant URL and parameters for the ‘post’. I used the handy Firefox extension ‘Tamper Data’ which allows you to see (and adjust) ‘POST’ and ‘GET’ requests sent from your browser – which allowed me to see the relevant information.

Screenshot of Tamper Data

The relevant details here are the URL at the top right of the form, and the list of ‘parameters’ on the right. Since I’d already got the code that dealt with authentication, the code to carryout this ‘post’ request looks like this

page = @magent.post(url, {
  "name" => org_name,
  "sector" => org_sector
  })
end

So – I’ve written less than 10 new lines of code and I’ve got everything I need to automate the creation of organisations in KB+ given a list in a CSV file.

Do you have any jobs that involve dull, repetitive tasks? Ever find yourself re-keying bits of data? Why not learn to code?

P.S. If you work on Windows, try looking at tools like MacroExpress or AutoHotKey, especially if ‘learning to code’ sounds too daunting/time-consuming

P.P.S. Perfection is the enemy of the good – try to avoid getting yourself into an XKCD ‘pass the salt’ situation

written by ostephens