ResourceSync: Web-based Resource Synchronization

Final paper in the ‘Repository Services’ session at OR2012 is presented by Simeon Warner. This is the paper I really wanted to see this morning as I’ve seen various snippets on twitter about it (via @azaroth42 and @hvdsomp). Simeon says so far it’s been lots of talking rather than doing 🙂

A lot of the stuff in this post is also available on the ResourceSync specification page http://resync.github.com/spec/

Synchronize what?

  • Web resources – things with a URI that can be dereferences and are cache-able – not dependent on underlying OS or tech
  • Small websites to large repositories – needs to work at all scales
  • Need to deal with things that change slowly (weeks/months) or quickly (seconds) and where latency needs may vary
  • Focus on needs of research communicatoin and cultural heritage orgs – but aim for generality

Why?

Because lots of projects and services are doing synchronization but have to roll their own on a case by case basis. Lots of examples where local copies of objects are needed to carry out work (I think CORE gives an example of this kind of application).

OAI-PMH is over 10 years old as protocol, and was designed to do XML metadata – not files/objects. (exactly the issue we’ve seen in CORE)

Rob Sanderson done work on a range of use cases  – including things like aggregation (multiple sources to one centre). Also ruled out some use cases – e.g. not going to deal with bi-directional synchronization at the moment. Some real life use cases they’ve looked at in detail:

DBpedia live duplication – 20million entries updated @ 1 per second -= though sporadic. Low latency needed. This suggests has to be ‘push’ mechanism – can’t have lots of services polling every second for updates

arXin mirrory – 1million article versions – about 800 per day created. Need metadata and full-text for each article. Accuracy important. Want low barrier for others to use.

Some terminology they have determined:

  • Resource – an object ot be syncrhonized – a web resource
  • Source – system with the original or master resource
  • Destination – system to which resource from the source will be copied
  • Pull – process to get information from source to destination, initiatnd by destination
  • Push – process to get information from source to destination. Initiated by source
  • Metadata – information about Resources such as URI, modification time, checksum, etc. (Not to be confused with Resources that may themselves be metadata about another resource, e.g. a DC record)

Three basic needs:

  • Baseline synchronization – perform initial load or catchup between source and destination
  • Incremental synchronization – deal with updated/creates/deletes
  • Audit – is my current copy in sync with source

Need to use ‘inventory’ approach to know what is there and what needs updating. So Audit uses inventory to check what has changed between source and destination, then do incremental  synchronization. Don’t necessarily need to use full inventory – could use inventory change set to know what has changed.

Once you’ve got agreement on the change set, need to get source and destination back in sync – whether by exchanging objects, or doing diffs and updates etc.

Decided that for simplicity needed to focus on ‘pull’ but with some ‘push’ mechanism available so sources can push changes when necessary.

What they’ve come up with is a framework based on Sitemaps – that Google uses to know what to crawl on a website. It’s a modular framework to allow selective deployment of different parts. For example – basic baseline sync looks like:

Level zero -> Publish a Sitemap

Periodic publication of a sitemap is basic implementation. Sitemap lists at least a list of URLs – one for each resource. But you could add in more information – e.g. like a hashsum for the resource – which would enable better comparison.

Another usecase – incremental sync. In this case use sitemap format but include information only for change events. One <url> element per change event.

What about ‘push’ notification. They believe XMPP best bet – this is used for messaging services (like Google/Facebook chat systems). This allows rapid notification of change events. XMPP a bit ‘heavy weight’ – but lots of libraries already available for this, so not going to have to implement from scratch.

LANL Research Library ran a significant scale experiment in synchronization of the LiveDBpedia  database to two remote sites using XMPP to push changes. A couple of issues, but overall very successful.

Sitemaps have some limitation on size (I think Simeon said 2.5billion URLs?) – but not hard to see how it could be extended beyond this if required.

Dumps: a dump format is necessary – to avoid repeated HTTP GET requests for multiple resource. Use for baseline and changeset. Options are:

  • Zip + sitemap – Zip very common, but would require custom mechanism to link
  • WARC – designed for this purpose but not widely implemented

Simeon guesses they will end up with hooks for both of these.

Expecting a draft spec very very soon (July 2012). Code and other stuff already on GitHub https://github.com/resync/

Repository Services

I’m briefly at the Open Repositories 2012 conference in Edinburgh, and this morning in a session about ‘repository services’ – which sounds like a nice easy session to ease into the morning, but is actually diving into some pretty hard technical detail pretty quickly!

There are three papers in this session.

Built to scale?

Edwin Shin is describing using Hydra (a repository stack built on Fedora, Solr, Blacklight). I missed the start, but the presentation is about dealing with very large numbers of digital objects – from millions to hundreds of millions. It’s a pretty technical talk – optimisation of Solr through sharding, taking a ‘sharded’ approach to Fedora (in the ActiveFedora layer).

Perhaps high level lessons to pull out are that you ought to look at how people use a system when planning quite technical aspects of the repository. For example – they reworked their disaster recovery strategy based on knowledge that vast number of requests were for current year – since the full system recovery takes days (or weeks?) they now deposit objects from current year so they can be restored first and quickly.

Similarly with Solr optimisation – having done a lot of generic optimisation they were still finding performance (query response times) far too slow on very larges sets of documents. By analysing how the system was used they were able to perform some very specific optimisations (I think this was around increasing the filtercache settings) to achieve a significant reduction in query response times.

Inter-repository Linking of Research Objects with Webtracks

This session being presented by Shirley Ying Crompton. Shirley describing how the research process leads to research data and outputs being stored in different places with no links between them. So decided to use RDF/linked data to added structured citation links between research objects (and people – e.g. creators).

However, different objects created in different systems – so how to make sure objects are linked as they are created? Looked at existing protocols for enabling links to be created:

  • Trackbacks – use for blogs/comments
  • Semantic pingback – an RPC protocol to form semantic links between objects
  • Salmons – RSS protocol

Decided to take ‘webtracks’ approch – this is an inter-repository communication protocol. The Webtracks InteRCom protocol – allows formation of links between objects in two different repositories. InteRCom is two stage protocol – first stage is ‘harvest’ to get links, then second stage ‘request’ a link between two objects.

InteRCom implementation has been done in Java, available as open source – available for download from http://sourceforge.net/projects/webtracks/.

Shirley says: Webtracks facilitates propagation of citation links to provide a linked web of data – uses emerging linked data environment and support linking between diverse types o digital research objects. There are no constraints on link semantics or metadata. Importantly (for the project) is that it does not rely on centralised service – it is peer-to-peer.

Webtracks has been funded by JISC and is a collaboration between the University of Southampton an the STFC – more information at http://www.jisc.ac.uk/whatwedo/programmes/mrd/clip/webtracks.aspx

ResourceSync: Web-based Resource Synchronization

This session is of particular interest to me, and I took more extensive notes – so I’ve put these into a separate post http://www.meanboyfriend.com/overdue_ideas/2012/07/resourcesync-web-based-resource-synchronization/

Boutique Catalogues

In my previous post on MaRC and SolrMaRC I described how SolrMaRC could be used, as part of Blacklight, VuFind or other discovery layers, to create indexes from various parts of the MaRC record. My question to those at Mashcat was “what would you do, or like to do, with these capabilities” – I was especially interested in how you might be able to offer better search experiences for specific types of materials – what we talked about in the pub afterwards as ’boutique catalogues’.

During the Mashcat session I asked the audience what types of materials/collections it might be interesting to look at with this in mind. In the session itself we came up with three suggestions – so I thought I’d capture these here, and maybe try to start working out what the SolrMARC index configuration and (if necessary) related scripts might look like. I’d hoped to take advantage of having a lot of cataloguing knowledge in the room to drill into some of these ideas in detail, but in the end – entirely down to me – this didn’t happen. Looking back on it I should have suggested breaking into groups to do this so that people could have discussed the detail and brought it back together – next time …

Please add suggestions through the comments of other ’boutique catalogue’ ideas, additional ideas on what indexes might be useful and what the search/display experience might be like:

(Born) Digital formats

Main suggestion – provide a ‘file format’ facet
If 007/00 = c, then it’s an electronic resource and there’s information to dig out of 007/01-13
Might be worth looking at 347$$a$$b (as well as other subfields)

Would it be possible to look up file formats on Pronom/Droid and add extra information either when indexing or in the display?

Rare books

Main suggestion – use ‘relators’ from the added entry fields – specifically 700$$e. For rare books these can often contain information about the previous owners of the item, which can be of real interest to researchers.

I’d asked a question about indexing rare books previously on LibCatCode – see http://www.libcatcode.org/questions/42/indexing-rare-book-collections. I suspect it might be worth re-asking on the new Stack Exchange site for Libraries and Information Science

Music

Main suggestions – created faceted indexes on the following:

  • Key
  • Beats per minute
  • Time signature

I’m keen on the music idea – MARC isn’t great for cataloguing music in the first place, and much useful information isn’t exposed in generic library catalogue interfaces, so I had a quick look at where ‘key’ might be stored – it turns out in quite a few places – I started putting together an expression of this that could be dropped into a SolrMaRC index.properties file:
musicalkey_facet = custom, getLinkedFieldCombined(600r:240r:700r:710r:800r:810r:830r)
I’m not sure if you actually need to use the ‘getLinkedFieldCombined’ (probably not). I’m also not sure I’ve got all the places that the key can be recorded explicitly – $$r for musical key appears in lots of places.

What I definitely do know is although 600$$r etc can be used to record the musical key, it might appear as a text string in the 245, or possibly a notes (500, 505, etc.) field. Whether it is put explicitly in the 600$$r (or other $$r) will depend on local cataloguing practice. I’m guessing that it might be worth writing a custom indexing script that uses regular expressions to search for expression of musical key in 245 and 5XX fields – although it would need to cover multiple languages (I’d say English, German and French at least).

I haven’t looked at Beats per minute or Time signature and where you might get that from. It seems obvious that also getting information on what instruments are involved etc. would be of interest.

In fact there has already been some work on a customised Blacklight interface for a Music library – mentioned at https://www.library.ns.ca/node/2851 – although I can’t find any further details right now (and I don’t have access to the Library Hi-Tech journal). If the details of this are published online anywhere I’d be very interested. Also the example of building an index of instruments is one of the examples in the SolrMaRC wiki page on the index.properties file.

Perhaps a final word of caution on all this – you can only build indexes on this rich data if it exists in the MaRC record to start with. The MaRC record can hold much more information than is typically entered – and some of the fields I mention in the examples above may not be commonly used – and either the information isn’t recorded at all, or you would have to write scripts to extract it from notes fields etc. The latter, though painful, might be possible; but in the former case, there is nothing you can do…

MaRC and SolrMaRC

At the recent Mashcat event I volunteered to do a session called ‘making the most of MARC’. What I wanted to do was demonstrate how some of the current ‘resource discovery’ software are based on technology that can really extract value from bibliographic data held in MARC format, and how this creates opportunities for in both creating tools for users, and also library staff.

One of the triggers for the session was seeing, over a period of time, a number of complaints about the limitations of ‘resource discovery’ solutions – I wanted to show that many of the perceived limitations were not about the software, but about the implementation. I also wanted to show that while some technical knowledge is needed, some of these solutions can be run on standard PCs and this puts the tools, and the ability to experiment and play with MARC records, in the grasp of any tech-savvy librarian or user.

Many of the current ‘resource discovery’ solutions available are based on a search technology called Solr – part of a project at the Apache software foundation. Solr provides a powerful set of indexing and search facilities, but what makes it especially interesting for libraries is that there has been some significant work already carried out to use Solr to index MARC data – by the SolrMARC project. SolrMARC delivers a set of pre-configured indexes, and the ability to extract data from MARC records (gracefully handling ‘bad’ MARC data – such as badly encoded characters etc. – as well). While Solr is powerful, it is SolrMARC that makes it easy to implement and exploit in a library context.

SolrMARC is used by two open source resource discovery products – VuFind and Blacklight. Although VuFind and Blacklight have differences, and are written in different languages (VuFind is PHP while Blacklight is Ruby), since they both use Solr and specifically SolrMARC to index MARC records the indexing and search capabilities underneath are essentially the same. What makes the difference between implementations is not the underlying technology but the configuration. The configuration allows you to define what data, from which part of the MARC records, goes into which index in Solr.

The key SolrMARC configuration file is index.properties. Simple configuration can be carried out in one line for example (and see the SolrMARC wiki page on index.properties for more examples and details):

title_t = 245a

This creates searchable ‘title’ index from the contents of the 245 $$a field. If you wanted to draw information in from multiple parts of the MARC record, this can be done easily – for example:

title_t = 245ab:246a

Similarly you can extract characters from the MARC ‘fixed fields’:

language_facet = 008[35-37]:041a:041d

This creates a ‘faceted’ index (for browsing and filtering) for the language of the material based on the contents of 008 chars 35-37, as well as the 041 $$a and $$d.

As well as making it easy to take data from specific parts of the MARC record, SolrMARC also comes pre-packaged with some common tasks you might want to carry out on a field before adding to the index. The three most common are:

Removing trailing punctuation – e.g.
publisher_facet = custom, removeTrailingPunct(260b)

This does exactly what it says – removes any punctuation at the end of the field before adding to the index

Use data from ‘linked’ fields – e.g.
title_t = custom, getLinkedFieldCombined(245a)

This takes advantage of the ability in MARC to link MARC fields to alternative representations of the same text – e.g. for the same text in a different language.

Map codes/abbreviations to proper language – e.g.
format = 000[6-7], (map.format)

Because the ‘format’ in the MARC leader (represented here by ‘000’) is represented as a code when creating a search index it makes sense to translate this into more meaningful terms. The actual mapping of terms can either be done in the index.properties file, or in separate mapping files. The mapping for the example above looks like:

map.format.aa = Book
map.format.ab = Serial
map.format.am = Book
map.format.as = Serial
map.format.ta = Book
map.format.tm = Book
map.format = Unknown

These (and a few other) built in functions make it easy to index the MARC record, but you may still find that they don’t cover exactly what you want to achieve. For example, they don’t allow for ‘conditional’ indexing (such as ‘only index the text in XXX field when the record is for a Serial), or if you want to extract only specific text from a MARC subfield.

Happily, you can extend the indexing by writing your own scripts which add new functions. There are a couple of ways of doing this, but the easiest is to write ‘bean shell’ scripts (basically Java) which you can then call from the index.properties file. Obviously we are going beyond simple configuration and into programming at this point, but with a little knowledge you can start to work the data from the MARC record even harder.

Once you’ve written a script, you can use it from index.properties as follows:

format = script(format-full.bsh), getFormat

This uses the getFormat function from the format-full.bsh script. In this case I was experimenting with extracting not just basic ‘format’ information, but also more granular information on the type of content as described in the 008 field – but the meaning of the 008 field varies based on type of material being catalogue so you get code like:

f_000 = f_000.toUpperCase();
if (f_000.startsWith("C"))
{
result.add("MusicalScore");
String formatCode = indexer.getFirstFieldVal(record, null, "008[18-19]").toUpperCase();
if (formatCode.equals("BT")) result.add("Ballet");
if (formatCode.equals("CC")) result.add("ChristianChants");
if (formatCode.equals("CN")) result.add("CanonsOrRounds");
if (formatCode.equals("DF")) result.add("Dances");
if (formatCode.equals("FM")) result.add("FolkMusic");
if (formatCode.equals("HY")) result.add("Hymns");
if (formatCode.equals("MD")) result.add("Madrigals");
if (formatCode.equals("MO")) result.add("Motets");
if (formatCode.equals("MS")) result.add("Masses");
if (formatCode.equals("OP")) result.add("Opera");
if (formatCode.equals("PT")) result.add("PartSongs");
if (formatCode.equals("SG")) result.add("Songs");
if (formatCode.equals("SN")) result.add("Sonatas");
if (formatCode.equals("ST")) result.add("StudiesAndExercises");
if (formatCode.equals("SU")) result.add("Suites");
}
else if (f_000.startsWith("D"))

(I’ve done an example file for parsing out detailed format/genre details which you can get from https://github.com/ostephens/solrmarc-indexproperties/blob/master/index_scripts/format-full.bsh – but although more granular it still doesn’t exploit all possible granularity from the MARC fixed fields)

Once you’ve configured the indexing, you run this over a file of MARC records. The screenshot here shows a Blacklight with a faceted ‘format’ which I created using a custom indexing script

 

These tools excite me for a couple of reasons:

  1. A shared platform for MARC indexing, with a standard way of programming extensions gives the opportunty to share techniques and scripts across platforms – if I write a clever set of bean shell scripts to calculate page counts from the 300 field (along the lines demonstrated by Tom Meehan in another Mashcat session), you can use the same scripts with no effort in your SolrMARC installation
  2. The ability to run powerful, but easy to configure, search tools on standard computers. I can get Blacklight or VuFind running on a laptop (Windows, Mac or Linux) with very little effort, and I can have a few hundred thousand MARC records indexed using my own custom routines and searchable via an interface I have complete control over

While the second of these points may seem like it’s a pretty niche market – and of course it is – we are seeing increasingly librarians and scholars making use of this kind of solution, especially in the digital humanities space. These solutions are relatively cheap and easy to run. Indexing a few hundred thousand MARC records takes a little time, but we are talking tens of minutes, not hours – you can try stuff, delete the index and try something else. You can focus on drawing out very specific values from the MARC record and even design specialist indexes and interfaces for specific kinds of material – this is not just within the grasp of library services, but the individual researcher.

In the pub after the main mashcat event had finished, we were chatting about the possibilities offered by Blacklight/VuFind and SolrMARC. I used a phrase I know I borrowed from someone else, but I don’t know who – ’boutique search’ – highly customised search interfaces that server a specific audience or collection.

A final note – we have the software, what we need is data – unless more libraries followed the lead of Harvard, Cambridge and others and make MARC records available to use, any software which produces consumes MARC records is of limited use …