Jul 31

Presentation by Scott Renton:

Current Management:

  • Standard ISAD(G), Schema EAD (XML)
  • Laid down by GASHE and NAHSTE projects
  • managed by “CMSyst” an in-house built MySQL/PHP applicaiotn
  • Data feeds to ArchivesHub, ARCHON, CW Site (EDINA)

Limitations of CMSyst, wanted a new system to extend what they could do. Looked at a range of options including:

  • DSpace
  • Vernon (Museums system)
  • Calm/Adlib (commercial system)
  • Archivists’ Toolkit (lacking a UI)

However – all had shortcomings. Decided to go with ArchiveSpace.

  • Actually ArchiveSpace is a successor to Archivists’ Toolkit
  • Support for appropriate standards
  • All archivist functionality in one place
  • Web delivered
  • Open Source
  • Lyrasis network behind development development
  • MySQL database
  • Runs under a container system such as Jetty or Tomcat
  • Code in JRuby, available on github
  • Four web apps:
    • Public front end
    • Admin area
    • Backend (Communicator)
    • Solr for indexing/search

Migration issues:

  • All data available from CMSyst
  • Exported in EAD
  • Some obstacles getting EAD to map correctly to loaded authorities
  • Some obstacles getting authorities loaded

ArchiveSpace has functionality to link  out to digital objects – using LUNA system – with CSV import of data [not clear which direction data flows here]

ArchiveSpace is popular in US, but Edinburgh the first European institution to take it.

University of Edinburgh Collections has a new web interface http://collections.ed.ac.uk. ArchiveSpace will be the expresso not archives within this collections portal. Archives are surfaced here as Collection Level Descriptions – 1000s of collections covered.

Implementation has been a good collaborative project. Got Curators, archivists, projects and innovations staff and digital developers all involved. Also good support on mailing list and good collaboration with other institutions.

Now going to look at “CollectionSpace” – which is a sister application for museums.

Archives collections will be available soon (within a couple of weeks) at http://collections.ed.ac.uk/archives

written by ostephens \\ tags:

Jul 31


Currently takes data from UKPMC, looks for an affiliation statement in text of publication, pushes to appropriate institutional repository based on that affiliation (if the IR has given permission for this to happen).

Uses SWORD to push paper to the IR

Can also browse all the content, search by target repository and author

Also API to the data in the Router.

Currently trying to understand how the HEFCE REF OA mandate impacts on what the Router should do and how it can help institutions deliver to the mandate.

Major change is the requirement for AAM – which wasn’t the previous requirement

One of the most important question was  how institutions would like to get AAMs delivered to them – 3 options:

  • i. 3rd party service to push content to your system – 37% of respondents didn’t know if they would want this approach
  • ii. Pull content into your system using an API – 31% don’t know if this would work for them
  • iii. Receive content via email as a file attachment – 43% said ‘OK at a push’ and 30% saw as satisfactory

If there was another solution what would it be:

  • Deposit at publication
  • Academics upload
  • SWORD would be good if institution/author matching is ‘really good’
  • Anything involved minimal reliance on academics updating information
  • Publishers to provide metadata to institution on acceptance
  • Being notified of accepted manuscripts by publishers so then can coordinate with authors

Broke into discussion groups for the following questions:

  • How would you ideally like to receive AAMs (or metadata describing them)
  • If Router starts to provide AAMs (or metadata describing them) at acceptance and then later provide metadata for the published version of record (VoR) – what are the reduplication issues?
  • If you receive multiple copies of the same version what are the de-duplication issues
  • What are the main issue holding you back for participating? What help would you require?
  • What is the one most important feature that is essential to you?

Feedback from groups:

  • Pure doesn’t currently have a field for ‘date of acceptance’
  • Better to get the manuscript but the metadata is better than nothing
    • In some cases (especially with AAMs) you may have the manuscript and minimal metadata
  • De-duplication a huge issue – to the extent that examples of institutions having to turn off automated feeds from sources such as Scopus (see also Bournemouth yesterday)
  • Getting any kind of notification of acceptance would be a huge step for those working in institutions
    • Getting notification of publication as well would be big step
  • A pull method preferred – may need local processing before you publisher
  • ‘extra’ metadata – e.g. corresponding author – would be highly useful – not available from WoS
    • If local system doesn’t have ability to store that metadata then this is a problem
  • Boundary between push/pull is not very clear. E.g. notification is ‘push’
    • Got to be clear about what is being pushed and where to
    • Reluctant to have repository populated without human intervention
    • EPrints and DSpace have a ‘review queue’ function
  • Having more publishers on board with the Router is key – if you don’t have good coverage it’s just one more source
  • Identifiers! If you have identifiers for AAMs (DOIs) it might help with de-duplication
  • If you are confident as an institution that Authors will deposit AAMs then the real issue becomes very much being notified at publication [found this point very interesting - points to a real divide in institutional attitudes about what authors will do]
  • People still trying to work out workflows
  • Maybe a mixed message to researchers/authors if some is automated and some is not

written by ostephens \\ tags:

Jul 30

LOCH Pathfinder project. Presented by Dominic Tate

Partners – University of Edinburgh, Heriot-Watt, St Andrews

In (very) brief:

The approach is:

  • Managing Open Access payments – including a review of current reporting methods and creation of shareable spreadsheet templates for reporting to funders
  • Using PURE as a tool to manage Open Access compliance, verification and reporting
  • Adapting institutional workflows to pre-empt Open Access requirements and make compliance and seamless as possible for academics


written by ostephens \\ tags:

Jul 30

Valerie McCutcheon talking about the ‘end to end open access’ (or E2E) project.

Project is working with a wide range of institutions types – big/small, geographically dispersed, from ‘vanilla’ systems to more customised.

Manifesting standard new open access metadata profile with some implementation although overall system agnostic:

  • EPrints case study
  • Hydra case study
  • EPrints OA reporting functionality

Generic workshops

  • Early stage – issue identification and solution sharing
  • Embedding future REF requirements
  • Advocacy
  • Late stage – report on ridings and identify unsolved issues

Several existing standards bodies/activity

E2E is collecting information on different metadata that is being used or is needed -  Metadata Community Collaboration – spreadsheet for people to add to


Workshop on 4th September – covering:

  • Current initiatives in Open Access
  • Metadata requirements

Working with other Pathfinder projects

written by ostephens \\ tags:

Jul 30

Quite different institutions but similarities in publications management at systems level:

  • Both use Symplectic Elements to manage publications and EPrints for IR
  • BU Research and Knowledge Exchange Office manages OA funding and ‘Bournemouth Research Information and Networking’ while the IR (BURO) is managed by the library
  • UCL Library manages both OA funding and publications through the Research Publications Service (RPS – think this is the Symplectic Elements) and Discovery (EPrints)

Publications Management

  • Researchers manage their data via Symplectic, which can also get data from Scopus and Web of Science, the data is then pushed out to profile pages and/or Repository

Institutional Repositories

  • UCL IR (Discovery) is both metadata only and full-text outputs – 317794 outputs in total – includes 5111 theses
  • Bournemouth only has full-text  - much smaller numbers – 2831 outputs in total – not all public access

Staff support

  • UCL – a Virtual Open Access Team
  • Bournemouth
    • OA Funding – 1 manager
    • No fulltime repository staff
    • Rota of 3 editorial staff, working one week in three on outputs received
    • 0.2 repository administrator
    • 0.2 Repository manager

OA Funding

  • UCL OA funding managed by OA Team in the library
    • Combination of RCUK, UCL and Wellcome funding
    • at least 9000 research pubs per annum
    • RCUK 2013-14 target: 693 papers – successfully processed 796
    • Current level of APC payments >2000 per annum

UCL has many pre-payment agreements in place for APCs

  • BioMed Central
  • Elsevier
  • BMJ Journals
  • RSC
  • IEEE
  • PeerJ
  • Sage
  • PLOS
  • Springer
  • T&F
  • Wiley
  • ubiquity press
  • and more – and hoping to extend further

Pre-payment agreements have been very successful and saved money

Both Bournemouth and UCL have found it challenging to spend all the money available for APCs

Challenges for engagement

  • UCL Discovery
    • Metadatga only outputs – poor quality, not checked, can be entered multiple times
    • Feeds into Symplectic Elements from Scopus and WoS can lead to duplicates: Scopus sometimes has records for pre and post publication and WoS can have a record also – and academics select all three rather than just choosing one of them
    • Academic engagement
    • Difficulty sending large files from RPS (Symplectic) to IR
    • Furious about how h index is calculated in RPS (manual entries aren’t counted, only items from Scopus / WoS)
    • Incorrect search settings in RPS
    • Don’t understand the data harvesting process – user managed to crash the system by entering single word search with common author name
  • Bournemouth BURO
    • 2013 – converted with full-text only
    • Mapping data issues
    • Incorrect publications display on original staff pages
    • Academic staff left thinking BURO no longer existed [think implication is that it looked liked it had been replaced by RPS?]

UCL have very clear requirement for outputs to be deposited in IR – http://www.ucl.ac.uk/library/open-access/ref/

Sheer volume of outputs at UCL is overwhelming

At Bournemouth – advocacy a big issue still (especially since many thought BURO had been discontinued) – but now outputs in BURO and BRIAN must be considered in pay and progression.

Shared challenges

  • Deposit on acceptance
  • Open Access options – making sure academics know what routes of publication are open to them
  • Establishing new workflows
  • Publishers move goalposts, change conditions etc.
  • Flexible support
  • Encouraging champions in Faculties
  • Use the REF2020 as a stick and a carrot for their research

UCL as a whole supports Green OA, but assists academics to meet their requirements through Gold OA route. UCL feels Gold will still be important to science disciplines

BU – funding will be available and has institutional support – but issues may arise depending on volume in the future

written by ostephens \\ tags:

Jul 30

Sketchy notes on this session: Les Carr updating us on developments in EPrints software / development

What is EPrints for?

  • Supporting researchers
  • Supporting research data
  • Supporting research outputs
  • Supporting research outcomes
  • Supporting research managers

However, want repositories to take on a bigger agenda – publication, data, information, …

To achieve this stripping EPrints back to its core data management engine – tuning it for speed, efficiency, scale and flexibility.

EPrints4 is not a rewrite of the software, but making the core as generic as possible – so it can handle all kinds of content

Improved integration with Xapian (for search)

Improving efficiency of information architecture – db transactions, memcached, fine-grained ACLs – can support much bigger repositories.

MVC approach

Can be run as headless service, but comes with a UI

Towards Repository ’16 – OA Compliancy, capture projects/funders data (working with Soton and Reading)

Integrating with other services/systems

  • IRUS-UK (on Bazaar)
  • Publications router
  • WoK/Scopus imports
  • OA end-to-end project (EPrints Services are partners)

Les says “EPrints moving towards being a de-factor CRIS-light systems”

Repositories are/need to be collaboration between librarians and developers

No release date for EPrints4 yet – but probably around a year away.

written by ostephens \\ tags:

Jul 30

Original concerns for RIOXX:

  • Primary:
    • How to represent the funder
    • How to represent the project/grant
  • Secondary:
    • How to represent the persistent identifier of the item described
    • Provisions of identifiers pointing to related data sets
    • How to represent the terms of use for an item

Original principles:

  • Purpose driven – Focussed on satisfying RCUK reporting requirements
  • Simple (re-use DC, not CERIF)
  • Generic in scope (don’t tie down to specific types o output)
  • Interoperable – specifically with OpenAIRE
  • Developed openly – public consultation

Has anything changed with RIOXX 2 (mid-2014)?

  • Still purpose driven, but no encompassing HEFCE requirements as well
  • Slightly more sophisticated / complex but still quite simple
  • No longer ‘generic’ – explicit focus on publications
  • No longer seen as a temporary measure – positioned to support REF2020
  • Interoperability still key – and currently working on an OpenAIRE crosswalk

Other changes:

Current status:

  • version 2.0 beta was released for public consultation in June 2014
  • version 2.0 RC 1 has been compiled
  • accompanying guidelines are being written
  • XSD schema beeen developed
  • expect full release in late August/early September

Some specific elements:

  • dc:identifier
    • identifies the open access item being described by the RIOXX metadata record, regardless of where it is
    • recommended to identify the resource itself, not a splash page
    • dc:identifier MUST be an HTTP URI
  • dc:relation and rioxxterms:version_of_record
    • rioxxterms:version_of_record is an HTTP URI which is a persistent identifier for the published version of a resource
    • will often be the HTTP URI form of a DOI
  • dc:relation
    • option property pointing to other material (e.g. dataset)
  • dcterms:dateAccepted
    • MUST be provided
    • more precise than other dated events (‘published’ date very grey area)
  • rioxxterms:author & rioxxterms:contributor
    • MUST be HTTP URIs – ORCID strongly recommended
    • one rioxxterms:author per author
    • rioxxterms:contributor is for parties that are not authors but credited with some contribution to publication
  • rioxxterms:project
    • joins funder and projected in one, slightly more complex, proerty
    • The use of funder IDs (DOIs in their HTTP URI form) from FundRef is recommended, but other ID schemes can be used and name can be used
  • license_ref
    • adopted from NISO’s Open Access Metadata and Indicators
    • takes an HTTP URI and a start date
    • URI should identify a license
      • there is work under way to create a ‘white list’ of acceptable licenses
    • embargoes can be expressed by using the ‘start date’ to show the date on which the license takes effect

Funding of development of RIOXX as application profile now at an end, but funding for further developments (e.g. s/w development for repositories etc.)

RIOXX is endorsed by both RCUK and HEFCE

Q: What about implementing RIOXX in CRIS systems?

A: Some work on mapping between CERIF and RIOXX, although not ongoing work. Technical description available for any one to implement.

A (Balviar): In terms of developing plugins for commercial products – not talked to commercial suppliers yet, but planning to look at what can be developed and conversations now starting.

A (James): Already got RIOXX terms in Pure feed at Edinburgh

Q: Can you clarify ‘first name author’ vs ‘corresponding author’ – do you intend the first named author to be the corresponding author?

A: Understand that the ‘common case’ is that the first name author is the corresponding author [at this point lots of disagreement from the floor on this point].

‘first name author’ seen as synonym for ‘lead author’

Q: Why is vocabulary for (?) not in line with REF vocabulary

A: HEFCE accepts wider range of outputs that ‘publications’, but RIOXX specifically focusses on publications  - where most OA issues lie

Q: ‘Date accepted’ what about historic publications? Won’t have date of acceptance

A: RIOXX not designed for retrospective publication – going forward only. Not a general purpose bibliographic record

Comment from Peter Burnhill: RIOXX is not a cataloguing schema – it is a set of labels

Paul W emphasises – RIOXX is not a ‘record’ format – the systems outputting RIOXX will have much richer metadata already. There is no point in ‘subverting’ RIOXX for historical purposes – this isn’t its intended purpose.

Q: Can an author have multiple IDs in RIOXX

A: Not at the moment. Mapping between IDs for authors is a different problem space, not one that RIOXX tries to address

Comment from RCUK: Biggest problem is monitoring compliance with our policies – which RIOXX will help with a lot

Comments from floor: starting to see institutions issues ORCIDs for their researchers – could see multiple ORCIDs from a single person. Similarly with DOIs – upload a publishers PDF to Figshare you get a new DOI

Q: If you are producing RIOXX what about OpenAIRE

A: There is nothing in RIOXX that would stop it being OpenAIRE compliant – so RIOXX records can be transformed into OpenAIRE records (but no vice versa)

written by ostephens \\ tags:

Mar 11

I think more people in libraries should learn scripting skills – that is how to write short computer programmes. The reason is simple – because it can help you do things quickly and easily that would otherwise be time consuming and dull. This is probably the main  reason I started to use code and scripts in my work, and if you ever find yourself doing a job regularly that is time consuming and/or dull and thinking ‘there must be a better way to do this’ it may well be a good project for learning to code.

To give and example. I work on ‘Knowledgebase+’ (KB+) – a shared service for electronic resource management run by Jisc in the UK. KB+ holds details on a whole range of electronic journals and related information including details of organisations providing or using the resources.

I’ve just been passed the details of 79 new organisations to be added to the system. To create these normally would require a couple of pieces of information (including the name of the organisation) into a web form and click ‘submit’.

While not the worst nor the most time consuming job in the world, it seemed like something that could be made quicker and easier through a short piece of code. If I do this in a sensible way, next time there is a list of organisations to add to the system, I can just re-use the same code to do the job again.

Luckily I’d already been experimenting with automating some processes in KB+ so I had a head start, leaving me with just three things to do:

  1. Write code to extract the organisation name from the list I’d been given
  2. Find out how the ‘create organisation’ form in KB+ worked
  3. Write code to replicate this process that could take the organisation name and other data as input, and create an organisation on KB+

I’d been given the original list as a spreadsheet, so I just exported the list of organisation names as a csv to make it easy to read programmatically, after that writing code that opened the file, read a line at a time and found the name was trivial:

CSV.foreach(orgfile, :headers => true, :header_converters => :symbol) do |row|
    org_name = row[:name]

The code to trigger the creation of the organisation in KB+ was a simple http ‘POST’ command (i.e. it is just a simple web form submission). The code I’d written previously essentially ‘faked’ a browser session and logged into KB+ (I did this using a code library called ‘mechanize’ which is specially designed for this type of thing), so it was simply a matter of finding the relevant URL and parameters for the ‘post’. I used the handy Firefox extension ‘Tamper Data’ which allows you to see (and adjust) ‘POST’ and ‘GET’ requests sent from your browser – which allowed me to see the relevant information.

Screenshot of Tamper Data

The relevant details here are the URL at the top right of the form, and the list of ‘parameters’ on the right. Since I’d already got the code that dealt with authentication, the code to carryout this ‘post’ request looks like this

page = @magent.post(url, {
  "name" => org_name,
  "sector" => org_sector

So – I’ve written less than 10 new lines of code and I’ve got everything I need to automate the creation of organisations in KB+ given a list in a CSV file.

Do you have any jobs that involve dull, repetitive tasks? Ever find yourself re-keying bits of data? Why not learn to code?

P.S. If you work on Windows, try looking at tools like MacroExpress or AutoHotKey, especially if ‘learning to code’ sounds too daunting/time-consuming

P.P.S. Perfection is the enemy of the good – try to avoid getting yourself into an XKCD ‘pass the salt’ situation

written by ostephens

Jun 10

Keynote from Ted Nelson

Talking about electronic literature for over 20 years. Felt alienated from the web because of ‘what it is not’.

Starting with the question – “what is literature”? For TN – a system of interconnected documents. But the web supports only ‘one way links’ – jumps into the unknown. Existing software does nothing for the writer to interact with this concept of ‘literacture’.

Constructs of books we have recreates the limitations of print – separate documents. Standard document formats – individual characters, scrambled with markup, encoded into a file. This thinking goes deep in the community – and TN contends this is why other ideas of how literature could exist are seen as impossible.

For the last 8-10 years, TN and colleagues working on a system that presents an interconnected literature (Xanadu Space). Two kinds of connection:

  • Links (connects things that are different, and are two way)
  • Transclusion (connects things that are the same)

TN illustrating using example of a programming working environment – where code, comments, bugs are transcluded into a single Integrated Work Environment.

  • We shouldn’t have ‘footnotes’ and ‘endnotes’ – they should be ‘on the side’.
  • Outlines should become tables of contents that go sideways into the document
  • Email quotation should be parallel – not ‘in line’

Vision is a parallel set of documents that can be see side-by-side.

History is parallel and connected – why do we not represent history as we write it – parallel coupled timelines and documents.

Challenge – how do you create this parallel set of connected documents? Each document needs to be addressable – so you can direct systems to ‘bring in text A from document B’. But challenges.

TN as a child was immersed in media. Dad was director for live TV – so TN got to see making television firsthand – his first experience was not just of consumption but as creation of TV. At college he produced musical, publication, film. Started designing interactive software.

How did we get here?

TN describing current realisation of the ‘translit’ approach – Xanadu. Several components:

  • Xanadoc – an ‘edit decision list format’ – generalisation of every quotation connected to it’s source
  • Xanalink – type, list of endsets (the things point at) – what to connected – exists independently of the doc?

What to do about changing documents? You copy & cache.

TN and colleagues almost ready to publish Xanadu specs for ‘xanadoc’ and ‘xanalink’ at http://xanadu.com/public/. Believes such an approach to literature can be published on the web, even though he dislikes the web for what it isn’t…

WYSIWYG – TN says only really applies to stuff you print out! TN aiming for ‘What you see is what you never could’ (do in print) – we need to throw off the chains of the printed document.


written by ostephens \\ tags:

Jun 10

Another winner of a DM2E Open Humanities award being presented today by Robyn Adams (http://www.livesandletters.ac.uk/people/robynadams) from the Center for Editing Lives and Letters. Project looked at repurposing data from the letters of Thomas Bodley (responsible for the refurbishment of the library at the University Oxford – creating the Bodleian Library).

Bodley’s letters are held in archives around the world. The letters are full of references to places, people etc. The letters had been digitised and transcribed – using software called ‘Transcribers Workbench’ developed specifically to help with early modern English writing. In order to make the transcribed data more valuable and usable decided to encode people and places from the letters – unfunded work on limited resources. Complicated by obscure references and also sometimes errors in the letters (e.g. Bodley states ‘the eldest son is to be married’ when it turns out it was the youngest son – makes researching the person to which Bodley is referring difficult).

This work was done in the absence of any specific use case. Now Robyn is re-approaching the data encoded as a consumer – to see how they can look for connections in the data to gain new insights

The data is available on Github at https://github.com/livesandletters/bodley1

written by ostephens \\ tags: