February | 2009 | Overdue Ideas

Recently Chris Keene (University of Sussex) sent an email to the LIS-E-RESOURCES email list about the fact that in academic libraries we are now doing a lot more ‘import’ and ‘export’ of records in our library management systems – bringing in bibliographic records from a variety of sources like book vendors/suppliers, e-resource systems, institutional repositories. He was looking for some shared experience and how other sites coped.

One of the responses mentioned the new ‘next generation’ search systems that some libraries have invested in, and Chris said:

“Next gen catalogues are – I think – certainly part of the solution, but only when you just want to make the records available via your local web interface.”

One of the points he made was that the University of Sussex provides records from their library management system to others to allow Union catalogues to be built – e.g. InforM25, COPAC, Suncat.

I sympathise with Chris, but I can’t help but think this is the point at which we have to start doing things a bit differently – so I wrote a response to the list, but thought that I’d blog a version of it as well:

I agree that library systems could usefully support much better bulk processing tools (although there are some good external tools like MarcEdit of course – and, scripting/programming tools (e.g. the MARC Perl module) if you have people who can programme them. However, I'd suggest that we need to change the way with think about recording and distributing information about our resources, especially in the light of investment in separate 'search' products such as Aquabrowser, Primo, Encore, Endeca, &c. &c.

If we consider the whole workflow here, it seems to me that as soon as you have a separate search interface the role of the 'library system' needs to be questioned – what are you using it for, and why? I'm not sure funnelling resources into it so they can then be exported to another system is really very sensible (although I absolutely understand why you end up doing it).

I think that once you are pushing stuff into Aquabrowser (taking Sussex as an example) there is little point in also pushing them into the catalogue – what extra value does this add? For books (print or electronic) you may continue to order them via the library system – but you only need an order record in there, not anything more substantial – you can put the 'substantial' record into Aquabrowser. The library system web interface will still handle item level information and actions (reservations/holds etc.) – but again, you don't need a substantial bib record for these to work – the user has done the 'searching' in the search system.

For the ejournals you could push directly from SFX into Aquabrowser – why push via the library system? Similarly for repositories – it really is just creating work to covert these into MARC (probably from DC) to get them into your library system, to then export for Aquabrowser (which seems to speak OAI anyway).

One of your issues is that you still need to put stuff into your library system, as this feeds other places – for example at Imperial we send our records to CURL/COPAC as well as other places – but this is a poor argument going forward – how long before we see COPAC change the way it works to take advantage of different search technology (MIMAS have just licensed the Autonomy search product …). Anyway – we need to work with those consuming our records to work out more sensible solutions in the current environment.

I'd suggest what we really need to think about is a common 'publication' platform – a way of all of our systems outputting records in a way that can then be easily accessed by a variety of search products – whether our own local ones, remote union ones, or even ones run by individual users. I'd go further and argue that platform already exists – it is the web! If each of your systems published each record as a 'web page' (either containing structured data, or even serving an alternative version of the record depending on whether a human or machine is asking for the resource – as described in Cool URIs), then other systems could consume this to build search indexes – and you've always got Google of course… I note that Aquabrowser supports web crawling – could it cope with some extra structured data in the web pages (e.g. RDFa)?

I have to admit that I may be over estimating how simple this would be – but it definitely seems to me this is the way to go – we need to adapt our systems to work with the web, and we need to start now.

"Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun.

Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea."

Douglas Adams, The Hitchhiker's Guide to the Galaxy

Digital Libraries, Digital Repositories, Born Digital, Digital Objects – the idea of digital information has become an intrinsic part of the library landscape in the 21st century. However, I believe that as we manage more information in digital formats, we need to think about managing it in analogue, rather than digital, ways.

What do I mean by 'digital' and 'analogue' in this context? Well – to be clear, I'm in favour of using computers to help manage our data – in fact, I think this is key to our ability to take an 'analogue' approach!

Digital values are absolute – something is either on or off, 1 or 0, black or white. Analogue values live along a continuous scale – from black to white and all the shades of grey in between. Computers store information as a series of bits – which can either be on or off – there is no grey here, a bit is either on (1) or off (0) – they are literally digital.

When dealing with physical items on a shelf, and entries in a printed or card catalogue, it is difficult to do anything but take a digital approach to managing your library – something is either on this shelf, or that shelf; on this card or that card; about this subject or about that subject.

Even now we don't rely on printed/card catalogues, and many items are available in electronic, rather than physical, format, we are still managing our collections in this 'digital' way. We treat all information in our catalogues as 'absolute' – from titles to subject headings.

I've heard Tim Spalding of LibraryThing talk about this in terms of subject headings – he said 'somebody wins' when you assign subject headings in a traditional library catalogue.

Even questions of fact, which you'd generally expect to have a single answer may not be entirely 'digital' (right or wrong). The classic example used in library school for reference questions is 'how high is Mount Everest?' – if you check several reference works you may come up with several answers – Wikipedia covers some of the various answers and why they are different.

At this point you may be wondering what the alternative is – you've still got to allocate a subject heading at some point (assign a title, author etc.) – right? Well, I think the answer in one of the most effective mechanisms for storing and retrieving information we've got – the web.

What makes the web 'analogue' rather than 'digital' in the way I'm using the terms is the link. We can see this clearly in the way Google was originally designed to work. In "The Anatomy of a Large-Scale Hypertextual Web Search Engine" Sergey Brin and Larry Page describe how Google was designed to make use "of both link structure and anchor text".

As is well known, Google uses the concept of the 'Page Rank', which is calculated based on the links between pages, but as illustrated by this diagram, it isn't a straightforward count of the number of links to a specific page, but allows for different weights to be assigned to the links

You can see that E has many more links than C, but does not get such a high page rank as it is, in turn, not linked to by any high ranking pages.

The Page Rank gives some kind of 'authority' to a page, but then there is the question of what the page is actually about. This latter question is not simple, but one factor that Brin and Page were explicit about is that "The text of links is treated in a special way in our search engine … we associate it with the page the link points to"

This means that not only is each link a 'vote' for a page in terms of page rank, but that it is also a piece of metadata about the page it is linked to. If you look at all the text of each link used, you are bound to get a wide range of text – as different people will link to a page from different perspectives – using different terminology and even different languages.

Suddenly here we are thinking about a way of classifying a document (web page) that allows many, many people to participate – in fact, as many people as want to – the architecture of the web puts not limit on the number of links that can be supported of course.

Alongside each assertion of a description also has a weight associated with it – so some pieces of metadata can be seen as having 'more weight' than others.

This allows for a much more analog measurement of what a document is 'about'. A document can be 'about' many things, but to different extents. This brings us back to the way tags work in LibraryThing – many people can allocate different tags to the same book, and this allows a much more complex representation of 'aboutness'.

I don't think that this just applies to 'aboutness'. I believe other pieces of metadata could also benefit from an analogue approach – but I think I'm going to have to save this argument for another post.

The key thing here (for me) is that exploiting this linking and the network built using them is something that already exists – it is the web – and with it this brings a way of breaking out of our 'digital' approach to library data, that card or printed catalogues had to adopt by their very nature.

If every book in your catalogue had it's own URL – essentially it's own address on your web, you would have, in a single step, enabled anyone in the world to add metadata to the book – without making any changes to the record in your catalogue. I'd go further than this – but again that's going to need a post of its own – I hope I manage to get these written!

So, we have the means of enabling a much more sophisticated ('analogue') approach to metadata, and what is frustrating is that we have not yet realised this, and we still think 'digital data' is a 'pretty neat idea'.

Overdue Ideas

Ideas linking Libraries, Computing, E-learning, and anything else that springs to mind.

Monthly Archives ⇒ February 2009

Its time to change library systems

The Future is Analogue