Boutique Catalogues

In my previous post on MaRC and SolrMaRC I described how SolrMaRC could be used, as part of Blacklight, VuFind or other discovery layers, to create indexes from various parts of the MaRC record. My question to those at Mashcat was “what would you do, or like to do, with these capabilities” – I was especially interested in how you might be able to offer better search experiences for specific types of materials – what we talked about in the pub afterwards as ’boutique catalogues’.

During the Mashcat session I asked the audience what types of materials/collections it might be interesting to look at with this in mind. In the session itself we came up with three suggestions – so I thought I’d capture these here, and maybe try to start working out what the SolrMARC index configuration and (if necessary) related scripts might look like. I’d hoped to take advantage of having a lot of cataloguing knowledge in the room to drill into some of these ideas in detail, but in the end – entirely down to me – this didn’t happen. Looking back on it I should have suggested breaking into groups to do this so that people could have discussed the detail and brought it back together – next time …

Please add suggestions through the comments of other ’boutique catalogue’ ideas, additional ideas on what indexes might be useful and what the search/display experience might be like:

(Born) Digital formats

Main suggestion – provide a ‘file format’ facet
If 007/00 = c, then it’s an electronic resource and there’s information to dig out of 007/01-13
Might be worth looking at 347$$a$$b (as well as other subfields)

Would it be possible to look up file formats on Pronom/Droid and add extra information either when indexing or in the display?

Rare books

Main suggestion – use ‘relators’ from the added entry fields – specifically 700$$e. For rare books these can often contain information about the previous owners of the item, which can be of real interest to researchers.

I’d asked a question about indexing rare books previously on LibCatCode – see http://www.libcatcode.org/questions/42/indexing-rare-book-collections. I suspect it might be worth re-asking on the new Stack Exchange site for Libraries and Information Science

Music

Main suggestions – created faceted indexes on the following:

  • Key
  • Beats per minute
  • Time signature

I’m keen on the music idea – MARC isn’t great for cataloguing music in the first place, and much useful information isn’t exposed in generic library catalogue interfaces, so I had a quick look at where ‘key’ might be stored – it turns out in quite a few places – I started putting together an expression of this that could be dropped into a SolrMaRC index.properties file:
musicalkey_facet = custom, getLinkedFieldCombined(600r:240r:700r:710r:800r:810r:830r)
I’m not sure if you actually need to use the ‘getLinkedFieldCombined’ (probably not). I’m also not sure I’ve got all the places that the key can be recorded explicitly – $$r for musical key appears in lots of places.

What I definitely do know is although 600$$r etc can be used to record the musical key, it might appear as a text string in the 245, or possibly a notes (500, 505, etc.) field. Whether it is put explicitly in the 600$$r (or other $$r) will depend on local cataloguing practice. I’m guessing that it might be worth writing a custom indexing script that uses regular expressions to search for expression of musical key in 245 and 5XX fields – although it would need to cover multiple languages (I’d say English, German and French at least).

I haven’t looked at Beats per minute or Time signature and where you might get that from. It seems obvious that also getting information on what instruments are involved etc. would be of interest.

In fact there has already been some work on a customised Blacklight interface for a Music library – mentioned at https://www.library.ns.ca/node/2851 – although I can’t find any further details right now (and I don’t have access to the Library Hi-Tech journal). If the details of this are published online anywhere I’d be very interested. Also the example of building an index of instruments is one of the examples in the SolrMaRC wiki page on the index.properties file.

Perhaps a final word of caution on all this – you can only build indexes on this rich data if it exists in the MaRC record to start with. The MaRC record can hold much more information than is typically entered – and some of the fields I mention in the examples above may not be commonly used – and either the information isn’t recorded at all, or you would have to write scripts to extract it from notes fields etc. The latter, though painful, might be possible; but in the former case, there is nothing you can do…

MaRC and SolrMaRC

At the recent Mashcat event I volunteered to do a session called ‘making the most of MARC’. What I wanted to do was demonstrate how some of the current ‘resource discovery’ software are based on technology that can really extract value from bibliographic data held in MARC format, and how this creates opportunities for in both creating tools for users, and also library staff.

One of the triggers for the session was seeing, over a period of time, a number of complaints about the limitations of ‘resource discovery’ solutions – I wanted to show that many of the perceived limitations were not about the software, but about the implementation. I also wanted to show that while some technical knowledge is needed, some of these solutions can be run on standard PCs and this puts the tools, and the ability to experiment and play with MARC records, in the grasp of any tech-savvy librarian or user.

Many of the current ‘resource discovery’ solutions available are based on a search technology called Solr – part of a project at the Apache software foundation. Solr provides a powerful set of indexing and search facilities, but what makes it especially interesting for libraries is that there has been some significant work already carried out to use Solr to index MARC data – by the SolrMARC project. SolrMARC delivers a set of pre-configured indexes, and the ability to extract data from MARC records (gracefully handling ‘bad’ MARC data – such as badly encoded characters etc. – as well). While Solr is powerful, it is SolrMARC that makes it easy to implement and exploit in a library context.

SolrMARC is used by two open source resource discovery products – VuFind and Blacklight. Although VuFind and Blacklight have differences, and are written in different languages (VuFind is PHP while Blacklight is Ruby), since they both use Solr and specifically SolrMARC to index MARC records the indexing and search capabilities underneath are essentially the same. What makes the difference between implementations is not the underlying technology but the configuration. The configuration allows you to define what data, from which part of the MARC records, goes into which index in Solr.

The key SolrMARC configuration file is index.properties. Simple configuration can be carried out in one line for example (and see the SolrMARC wiki page on index.properties for more examples and details):

title_t = 245a

This creates searchable ‘title’ index from the contents of the 245 $$a field. If you wanted to draw information in from multiple parts of the MARC record, this can be done easily – for example:

title_t = 245ab:246a

Similarly you can extract characters from the MARC ‘fixed fields’:

language_facet = 008[35-37]:041a:041d

This creates a ‘faceted’ index (for browsing and filtering) for the language of the material based on the contents of 008 chars 35-37, as well as the 041 $$a and $$d.

As well as making it easy to take data from specific parts of the MARC record, SolrMARC also comes pre-packaged with some common tasks you might want to carry out on a field before adding to the index. The three most common are:

Removing trailing punctuation – e.g.
publisher_facet = custom, removeTrailingPunct(260b)

This does exactly what it says – removes any punctuation at the end of the field before adding to the index

Use data from ‘linked’ fields – e.g.
title_t = custom, getLinkedFieldCombined(245a)

This takes advantage of the ability in MARC to link MARC fields to alternative representations of the same text – e.g. for the same text in a different language.

Map codes/abbreviations to proper language – e.g.
format = 000[6-7], (map.format)

Because the ‘format’ in the MARC leader (represented here by ‘000’) is represented as a code when creating a search index it makes sense to translate this into more meaningful terms. The actual mapping of terms can either be done in the index.properties file, or in separate mapping files. The mapping for the example above looks like:

map.format.aa = Book
map.format.ab = Serial
map.format.am = Book
map.format.as = Serial
map.format.ta = Book
map.format.tm = Book
map.format = Unknown

These (and a few other) built in functions make it easy to index the MARC record, but you may still find that they don’t cover exactly what you want to achieve. For example, they don’t allow for ‘conditional’ indexing (such as ‘only index the text in XXX field when the record is for a Serial), or if you want to extract only specific text from a MARC subfield.

Happily, you can extend the indexing by writing your own scripts which add new functions. There are a couple of ways of doing this, but the easiest is to write ‘bean shell’ scripts (basically Java) which you can then call from the index.properties file. Obviously we are going beyond simple configuration and into programming at this point, but with a little knowledge you can start to work the data from the MARC record even harder.

Once you’ve written a script, you can use it from index.properties as follows:

format = script(format-full.bsh), getFormat

This uses the getFormat function from the format-full.bsh script. In this case I was experimenting with extracting not just basic ‘format’ information, but also more granular information on the type of content as described in the 008 field – but the meaning of the 008 field varies based on type of material being catalogue so you get code like:

f_000 = f_000.toUpperCase();
if (f_000.startsWith("C"))
{
result.add("MusicalScore");
String formatCode = indexer.getFirstFieldVal(record, null, "008[18-19]").toUpperCase();
if (formatCode.equals("BT")) result.add("Ballet");
if (formatCode.equals("CC")) result.add("ChristianChants");
if (formatCode.equals("CN")) result.add("CanonsOrRounds");
if (formatCode.equals("DF")) result.add("Dances");
if (formatCode.equals("FM")) result.add("FolkMusic");
if (formatCode.equals("HY")) result.add("Hymns");
if (formatCode.equals("MD")) result.add("Madrigals");
if (formatCode.equals("MO")) result.add("Motets");
if (formatCode.equals("MS")) result.add("Masses");
if (formatCode.equals("OP")) result.add("Opera");
if (formatCode.equals("PT")) result.add("PartSongs");
if (formatCode.equals("SG")) result.add("Songs");
if (formatCode.equals("SN")) result.add("Sonatas");
if (formatCode.equals("ST")) result.add("StudiesAndExercises");
if (formatCode.equals("SU")) result.add("Suites");
}
else if (f_000.startsWith("D"))

(I’ve done an example file for parsing out detailed format/genre details which you can get from https://github.com/ostephens/solrmarc-indexproperties/blob/master/index_scripts/format-full.bsh – but although more granular it still doesn’t exploit all possible granularity from the MARC fixed fields)

Once you’ve configured the indexing, you run this over a file of MARC records. The screenshot here shows a Blacklight with a faceted ‘format’ which I created using a custom indexing script

 

These tools excite me for a couple of reasons:

  1. A shared platform for MARC indexing, with a standard way of programming extensions gives the opportunty to share techniques and scripts across platforms – if I write a clever set of bean shell scripts to calculate page counts from the 300 field (along the lines demonstrated by Tom Meehan in another Mashcat session), you can use the same scripts with no effort in your SolrMARC installation
  2. The ability to run powerful, but easy to configure, search tools on standard computers. I can get Blacklight or VuFind running on a laptop (Windows, Mac or Linux) with very little effort, and I can have a few hundred thousand MARC records indexed using my own custom routines and searchable via an interface I have complete control over

While the second of these points may seem like it’s a pretty niche market – and of course it is – we are seeing increasingly librarians and scholars making use of this kind of solution, especially in the digital humanities space. These solutions are relatively cheap and easy to run. Indexing a few hundred thousand MARC records takes a little time, but we are talking tens of minutes, not hours – you can try stuff, delete the index and try something else. You can focus on drawing out very specific values from the MARC record and even design specialist indexes and interfaces for specific kinds of material – this is not just within the grasp of library services, but the individual researcher.

In the pub after the main mashcat event had finished, we were chatting about the possibilities offered by Blacklight/VuFind and SolrMARC. I used a phrase I know I borrowed from someone else, but I don’t know who – ’boutique search’ – highly customised search interfaces that server a specific audience or collection.

A final note – we have the software, what we need is data – unless more libraries followed the lead of Harvard, Cambridge and others and make MARC records available to use, any software which produces consumes MARC records is of limited use …

Dutch Culture Link

This session by Lukas Koster.

Works for Library of the University of Amsterdam was ‘system librarian’, then Head of Library Systems Department, now Library Systems Coordinator – means responsible for MetaLib, SFX … and new innovative stuff – including mobile web and the Dutch Culture Link project – which is what he is going to talk about today.

Lukas is a ‘shambrarian’ – someone who pretends they know stuff!

Blogs at http://commonplace.net

Lukas described situation in Netherlands regarding libraries, technology and innovation. How much leeway to get involved in innovation and mashups – depends very much on individual institutions and local situation. Large Library 2.0 community – but much more in public libraries – especially user facing widgets and UI stuff – via a Ning network. Have ‘Happe.Nings’ – but more looking at social media etc. rather than data mashups. Lukas blogged the last one at http://www.lukaskoster.net/2010/06/happe-ning-in-haarlem. Next Happe.Ning about streaming music services

Lukas talking about Linked Open Data project – partners are:

  • DEN – Digital Heriatage Foundation of the Netherlands – digital standards for heritage institutions, promoting linked open data – museums etc. – simple guidelines how to publish linked open data
  • UBA – library of the University of Amsterdam
  • TIN – Theater Institute of the Netherlands

Objectives of project:

  • Set example
  • Proof of concept
  • Pilot
  • Convince heritage institutions
  • Convince TIN, UBA management

Project called “Dutch Culture Link” – aim to link cultural data and institutions through semantic web

Linked data projects – 2 viewpoints – publishing and use – no point publishing without use. Lukas keen that project includes examples of how the data can be used.

So – the initial idea is that the UBA (Aleph) OPAC will be used to get data published from the TIN collection and enhance OPAC

TIN use AdLib library system (AdLib also used for museums, archives etc.) – TIN contains objects and audio-visual material as well as bibliographic items

Started by modelling TIN collection data model – entities:

  • Person
  • Part (person plays part in play)
  • Appearance
  • Production
  • Performance
  • Location
  • Play

Images, text files, a-v material related to these entities – e.g. Images from a performance

Lukas talking about FRBR – “a library inward vision” – deals with bibliographic materials – but can perhaps be mapped to plays…

  • Work = Play
  • Expression = Production?
  • Manifestation = Production?
  • Item = Performance (one time event)

FRBR interesting model, but needs to be extended to the real world! (not just inward looking for library materials)

Questions that arise:

  • Which vocabulary/ontology to use?
  • How to implement RDF?
  • How to format URIs?
  • Which tool, techniques, languages?
  • How to find/get published linked data?
  • How to process retrieved linked data?

Needed training – but no money! Luckily were able to attend free DANS Linked Open Data workshop

Decided to start with a quick and dirty approach:

  • Produced URIs for data entities in TIN data – expressed data as JSON (not RDF)
  • At the OPAC end:
  • Javascript: construct TIN URI
  • Process JSON
  • Present data in record

URI:

  • <base-url>/person/<personname>
  • <base-url>/play/<personname>/<title>
  • <base-url>/production/<personname>/<title>/<opening>

e.g. URI <base-url>/person/Beckett, Samuel returns JSON record

So, in OPAC, find the author name – form URI from information in MARC record – but strip out any extraneous information. Get JSON and parser with javascript, and display into OPAC.

But – this not Linked Data yet – not using formal Ontology, not using RDF. But this is the approach – quick and dirty, tangible results

Next steps: at ‘Publishing’ end

  • Vocabulary for Production/Performance subject area
  • Vocabulary for Person (FOAF?), Subject (SKOS?)
  • RDF in JSON (internal relationships)
  • Publish RDF/XML
  • More URIs – form performances etc
  • External links
  • Content negotiation
  • Links to a-v objects etc.

At ‘use’ end:

  • More ‘search’ fields (e.g. Title)
  • Extend presentation
  • Include relations
  • Clickable
  • More information – e.g. could list multiple productions of same play from script in library catalogue

Issues:

  • Need to use generic, really unique URIs
  • Person: ids (VIAF?)
  • Plays: ids

Open Bibliography (and why it shouldn’t have to exist)

Today I’m at Mashspa – another Mashed Library event.

Ben O’Steen is talking about a JISC project he is currently involved with. Project about getting bibliographic information into the open. For Ben Open means “publishing bibliographic information under a permissive license to encourage indexing, re-use and re-purposing”. Ben believes that some aspects – such as attribution – should be part of ‘community norm’, not written into a license.

In essence an open bibliography is all about Advertising! Telling other people what you have.

Bibliographic information allows you to:

  • Identify and find an item you know you want
  • Discovery related items or items you believe you want
  • Serendipitously discover items you would like with knowing they might exist
  • …other stuff

This list (from top to bottom) require increasing investment. Advertising isn’t about spending money – it’s about investment.

To maximise returns you maximise the audience

Ben asks “Should the advertising target ‘b2b’ or ‘consumers’?”

Ben acknowledges that it may not be necessary to completely open up the data set – but believes that in the long term open is the way forward.

Some people ask “Can’t I just scrape sites and use the data – it’s just facts isn’t it?”. However Directive 96/9/EC of the European Parliament which codifies a new protection based on “sui generis” rights – rights earned by the “sweat of the brow”. So far this law seems to have only solidified existing monopolies – not generated new economic growth (which was apparently the intention of the law)

When project asked UK PubMedCentral if we could reproduce the bibliographic data they share through their OAI-PMH service? – they said ‘Generally, No’ – paraphrasing that basically UK PubMedCentral said they didn’t have the rights to give away the data (except the stuff from Open Access journals) – NOTE – this is the metadata not the full text articles we are talking about – they said they could not grant the right to reuse the metadata [would this, for example, mean that you could not use this metadata in a reference management package to then produce a bibliography?]

Principles:

  • Assign a license when you publish data
  • Use a recognised license
  • If you want your data to be effectively used and added to by other it should be open – in particular non-commercial and other restrictive licenses should be avoided
  • Strongly recommend using CC0 or PDDL (latter in the EU only)
  • Strongly encourage release of bibliographic data into the ‘Open’

Sliding scale:

  • Identify – e.g. for author simple identifier could just be name – cheap, more expensive identifiers – e.g. URIs or ORIDs
  • Discover –
  • Serendipity –

If you increase investment you get more use – difficult to reuse data without identifiers for example.

1. Where there is human input, there is interpretation – people may interpret standards in different ways, use fields in different ways

Ben found a lot of variation across data in PubMed data set – different journals or publishers interpret where information should go in different ways – “Standards don’t bring interoperability, people do”

2. Data has been entered and curated without large-scale sharing as a focus – lots of implicit, contextual  information left out – e.g. if you are working in a specialist Social Science library, perhaps you don’t mention that the item is about Social Sciences as that is implicit by (original) context

3. Data quality is generally poor – example from the BL ISBN = £2.50!

In a closed data set you may not discover errors – when you have lots of people looking at data (with different uses in mind) you pick up different types of error.

The data clean-up process is going to be PROBABALISTIC – we cannot be sure – by definition – that we are accurate when we deduplicate or disambiguate. Typical methods:

  • Natural Language Processing
  • Machine learning techniques
  • String metrics and old school record reduplication – easiest of the the 3 (for Ben)

Not just about matching uniquely – looking at level of similarity and making decisions

List of string metrics at http://staffwww.dcs.shef.ac.uk/people/s.chapman/stringmetrics.html

Felligi-Sunter method for old school deduplication – not great, but works OK .

Can now take a map-reduce approach (distribute processing across servers)

Do it yourself:

When de-duping – need to be able to unmerge so you can correct if necessary – canonical data that you have, and data that you publish to the public

Directions with Bibliographic data: So far much effort has been directed at ‘Works’ – we need to put much more effort into their ‘Networks’ – starts to help (for example) disambiguate people

Network examples:

  • A cites B
  • Works by a given author
  • Works cited by a given author
  • Works citing articles that have since been disproved, redacted or withdrawn
  • Co-authors
  • …other connections that we’ve not even thought of yet

Ben says – Don’t get hung up on standards …

VuFind Virtual Bootcamp

As part of the Lucero project I’m currently working on at the Open University, I’m looking at lots of library catalogue records. While exploring the first set of data I was playing with (around 25,000 records in MARC format) it struck me that one of the more recent library ‘search’ products might be helpful. These new products (sometimes known as ‘next gen’ (NG) discovery platforms) are being taken up by libraries to replace their (often aging, rarely pretty) ‘OPACs’ (online public access catalogues) which tend to be a web interface onto what is, at heart, a ‘business’ system – one that administers books, users, serials, and other library stuff.

These discovery platforms tend to work by taking an import of data from the library catalogue on a regular basis, and specialise in indexing the data, rather than the many other administrative tasks that the library catalogue hides. Using dedicated software, that isn’t worrying about any other functionality, these new platforms tend to be much faster returning search results, and give a lot of flexibility in how indexes are built on the data.

While many of the available products are commercial pieces of software (or increasingly, services), there are a couple of relatively high profile open source solutions –  VuFind and Blacklight. If you are interested in a comparison of these two systems, keep an eye on the CREDAUL at the University of Sussex (http://credaul.wordpress.com) which is looking at the both.

So I decided I’d try installing VuFind and use that to explore the data. VuFind is PHP based, but also makes use of the SOLR search platform, which runs on Java. It took me a couple of hours or so fiddling to get the whole thing working – but I thought that was pretty good going – by the end of it, I had my 25k records fully indexed, and was ready to use the system to explore the data.

All of this gave me an idea – this is something you can run on a laptop, and is a great way of looking at your library catalogue data – often exposing issues with the data that you can correct in the catalogue if you want to as well. So, I had the idea that at the next Mashed Library event (Mashspa in Bath) we could run a VuFind ‘bootcamp’, helping delegates get VuFind installations up and running.

Being an impatient sort, 29th October was far too long to wait to get started, so then I thought that maybe I could do a ‘virtual’ version of the bootcamp beforehand (and that would also make sure I was prepared on the day!). So, the idea is that I’m going to post weekly blog posts dealing with the installation of VuFind step by step. I’ll focus on Windows, but already have some people who are interested in doing an install on Linux and Mac OS X. Along side these, I’ll run weekly ‘support sessions’ where I’ll be online to try to help work through problems/issues that people are having – the idea is that these will be live sessions – although I don’t know whether that will be via chat, voice or something else.

Anyway, the starting point is this blog post, and this forum on the Mashed Library site. If you are interested in joining in, sign up to http://www.mashedlibrary.com/groups/vufind-virtual-bootcamp/ and follow along – I’m intending to post the first set of instructions within the week, with a support session to follow shortly after.

Finally if you are interested in the various ‘next gen’ discovery interfaces for libraries, I’d recommend having a look at this list of JISC projects http://code.google.com/p/jisclms/w/list that all deal with improving/experimenting with the library discovery interface and experience.