Visualisation

Iman Moradi is talking about how we organise library stock and spaces – he’s going through at quite a pace, so very brief notes again.

Finding things is complex

It’s a cliched that library users often remember the colour of the book more than the title – but why don’t we respond to this? Organise books by colour – example from Huddersfield town library.

Iman did a demonstrator – building a ‘quotes’ base for a book – use a pen scanner to scan chunk of text from book, and associate with book via ISBN – starts to build a set of quotes from the book that people found ‘of interest’

Think about libraries in terms of games – users are ‘players’, the library is the ‘game environment’. Using libraries is like a game:

  • Activities = Finding, discovery, collection
  • Points/levels = acquiring knowledge

Mash Oop North

Today I’m at Mash Oop North aka #mashlib09 – and kicking off with a presentation from Dave Pattern – some very brief notes:

Making Library Data Work Harder

Dave Pattern – www.slideshare.net/daveyp/

Keyword suggestions – about 25% of keyword searches on Huddersfield OPAC give zero results.
Look at what people are typing in the keyword search – Huddersfield found ‘renew’ was a common search term – so can pop up a information box with information about renewing your books.

By looking at common keyword combinations can help people refine their searches

Borrowing suggestions – people who borrowed this item, also borrowed …
Tesco’s collect and exploit this data. Do libraries sometimes assume we know what is best for our users – but we perhaps need to look at data to prove or disprove our assumptions

Because borrowing driven by reading lists, perhaps helps suggestions stay on-topic

Course specific ‘new books’ list – based on what people on specific courses borrow
Able to do amazon-y type personalised suggestions

Borrowing profile for Huddersfield – average number of books borrowed shows v high peak in October, lull during the summer – now can see the use of the suggestions following this with a peak in November.

Seems to be a correlation between introduction of suggestions/recommendations with increase in borrowing – how could this be investigated further?

Started collecting e-journal data via SFX – starting to do journal recommendations based on usage.

Suggested scenario – can start seeding new students experience – 1st time student accesses website can use ‘average’ behaviour of students on same course – so highly personalised. Also, if information delivered via widgets could drag and drop to other environments.

JISC Mosaic project, looking at usage data (at National level I think?)

So – some ideas of stuff that you might do with usage data:

#1 Basic library account info:
Just your bog standard library optionss
– view items on loan. hold requests etc
– renew items
Confgure alerting options
– SMS, Facebook, Google Teleppathy
Convert Karma
– rewards for sharing information/contributing to pool of data – perhaps swap karma points for free services/waiving fines etc.

#2 Discovery service
Single box for search

#3 Book recommendations
Students like book covers
Primarily a ‘we think you might be interested in’ service
Uses database of circulation transactions, augmented with Mosaic data
time relevant to the modules student is taking
Asapts to choices student makes over time

#4 New books
Data-mining of books borrowed by student on a course
Provide new books lists based on this information (already doing this at Huddersfield I think)

#5 Relevant Journals

#6 Relevant articles
– Whenever student interacts with library services e.g. keywords etc. – refines their profile

Its time to change library systems

Recently Chris Keene (University of Sussex) sent an email to the LIS-E-RESOURCES email list about the fact that in academic libraries we are now doing a lot more ‘import’ and ‘export’ of records in our library management systems – bringing in bibliographic records from a variety of sources like book vendors/suppliers, e-resource systems, institutional repositories. He was looking for some shared experience and how other sites coped.

One of the responses mentioned the new ‘next generation’ search systems that some libraries have invested in, and Chris said:

“Next gen catalogues are – I think – certainly part of the solution, but only when you just want to make the records available via your local web interface.”

One of the points he made was that the University of Sussex provides records from their library management system to others to allow Union catalogues to be built – e.g. InforM25, COPAC, Suncat.

I sympathise with Chris, but I can’t help but think this is the point at which we have to start doing things a bit differently – so I wrote a response to the list, but thought that I’d blog a version of it as well:

I agree that library systems could usefully support much better bulk processing tools (although there are some good external tools like MarcEdit of course – and, scripting/programming tools (e.g. the MARC Perl module) if you have people who can programme them. However, I'd suggest that we need to change the way with think about recording and distributing information about our resources, especially in the light of investment in separate 'search' products such as Aquabrowser, Primo, Encore, Endeca, &c. &c.

If we consider the whole workflow here, it seems to me that as soon as you have a separate search interface the role of the 'library system' needs to be questioned – what are you using it for, and why? I'm not sure funnelling resources into it so they can then be exported to another system is really very sensible (although I absolutely understand why you end up doing it).

I think that once you are pushing stuff into Aquabrowser (taking Sussex as an example) there is little point in also pushing them into the catalogue – what extra value does this add? For books (print or electronic) you may continue to order them via the library system – but you only need an order record in there, not anything more substantial – you can put the 'substantial' record into Aquabrowser. The library system web interface will still handle item level information and actions (reservations/holds etc.) – but again, you don't need a substantial bib record for these to work – the user has done the 'searching' in the search system.

For the ejournals you could push directly from SFX into Aquabrowser – why push via the library system? Similarly for repositories – it really is just creating work to covert these into MARC (probably from DC) to get them into your library system, to then export for Aquabrowser (which seems to speak OAI anyway).

One of your issues is that you still need to put stuff into your library system, as this feeds other places – for example at Imperial we send our records to CURL/COPAC as well as other places – but this is a poor argument going forward – how long before we see COPAC change the way it works to take advantage of different search technology (MIMAS have just licensed the Autonomy search product …). Anyway – we need to work with those consuming our records to work out more sensible solutions in the current environment.

I'd suggest what we really need to think about is a common 'publication' platform – a way of all of our systems outputting records in a way that can then be easily accessed by a variety of search products – whether our own local ones, remote union ones, or even ones run by individual users. I'd go further and argue that platform already exists – it is the web! If each of your systems published each record as a 'web page' (either containing structured data, or even serving an alternative version of the record depending on whether a human or machine is asking for the resource – as described in Cool URIs), then other systems could consume this to build search indexes – and you've always got Google of course… I note that Aquabrowser supports web crawling – could it cope with some extra structured data in the web pages (e.g. RDFa)?

I have to admit that I may be over estimating how simple this would be – but it definitely seems to me this is the way to go – we need to adapt our systems to work with the web, and we need to start now.

The Future is Analogue

"Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun.

Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea."

Douglas Adams, The Hitchhiker's Guide to the Galaxy

Digital Libraries, Digital Repositories, Born Digital, Digital Objects – the idea of digital information has become an intrinsic part of the library landscape in the 21st century. However, I believe that as we manage more information in digital formats, we need to think about managing it in analogue, rather than digital, ways.

What do I mean by 'digital' and 'analogue' in this context? Well – to be clear, I'm in favour of using computers to help manage our data – in fact, I think this is key to our ability to take an 'analogue' approach!

Digital values are absolute – something is either on or off, 1 or 0, black or white. Analogue values live along a continuous scale – from black to white and all the shades of grey in between. Computers store information as a series of bits – which can either be on or off – there is no grey here, a bit is either on (1) or off (0) – they are literally digital.

When dealing with physical items on a shelf, and entries in a printed or card catalogue, it is difficult to do anything but take a digital approach to managing your library – something is either on this shelf, or that shelf; on this card or that card; about this subject or about that subject.

Even now we don't rely on printed/card catalogues, and many items are available in electronic, rather than physical, format, we are still managing our collections in this 'digital' way. We treat all information in our catalogues as 'absolute' – from titles to subject headings.

I've heard Tim Spalding of LibraryThing talk about this in terms of subject headings – he said 'somebody wins' when you assign subject headings in a traditional library catalogue.

Even questions of fact, which you'd generally expect to have a single answer may not be entirely 'digital' (right or wrong). The classic example used in library school for reference questions is 'how high is Mount Everest?' – if you check several reference works you may come up with several answers – Wikipedia covers some of the various answers and why they are different.

At this point you may be wondering what the alternative is – you've still got to allocate a subject heading at some point (assign a title, author etc.) – right? Well, I think the answer in one of the most effective mechanisms for storing and retrieving information we've got – the web.

What makes the web 'analogue' rather than 'digital' in the way I'm using the terms is the link. We can see this clearly in the way Google was originally designed to work. In "The Anatomy of a Large-Scale Hypertextual Web Search Engine" Sergey Brin and Larry Page describe how Google was designed to make use "of both link structure and anchor text".

As is well known, Google uses the concept of the 'Page Rank', which is calculated based on the links between pages, but as illustrated by this diagram, it isn't a straightforward count of the number of links to a specific page, but allows for different weights to be assigned to the links

725px-PageRanks-Example 

You can see that E has many more links than C, but does not get such a high page rank as it is, in turn, not linked to by any high ranking pages.

The Page Rank gives some kind of 'authority' to a page, but then there is the question of what the page is actually about. This latter question is not simple, but one factor that Brin and Page were explicit about is that "The text of links is treated in a special way in our search engine … we associate it with the page the link points to"

This means that not only is each link a 'vote' for a page in terms of page rank, but that it is also a piece of metadata about the page it is linked to. If you look at all the text of each link used, you are bound to get a wide range of text – as different people will link to a page from different perspectives – using different terminology and even different languages.

Suddenly here we are thinking about a way of classifying a document (web page) that allows many, many people to participate – in fact, as many people as want to – the architecture of the web puts not limit on the number of links that can be supported of course.

Alongside each assertion of a description also has a weight associated with it – so some pieces of metadata can be seen as having 'more weight' than others.

This allows for a much more analog measurement of what a document is 'about'. A document can be 'about' many things, but to different extents. This brings us back to the way tags work in LibraryThing – many people can allocate different tags to the same book, and this allows a much more complex representation of 'aboutness'.

I don't think that this just applies to 'aboutness'. I believe other pieces of metadata could also benefit from an analogue approach – but I think I'm going to have to save this argument for another post.

The key thing here (for me) is that exploiting this linking and the network built using them is something that already exists – it is the web – and with it this brings a way of breaking out of our 'digital' approach to library data, that card or printed catalogues had to adopt by their very nature.

If every book in your catalogue had it's own URL – essentially it's own address on your web, you would have, in a single step, enabled anyone in the world to add metadata to the book – without making any changes to the record in your catalogue. I'd go further than this – but again that's going to need a post of its own – I hope I manage to get these written!

So, we have the means of enabling a much more sophisticated ('analogue') approach to metadata, and what is frustrating is that we have not yet realised this, and we still think 'digital data' is a 'pretty neat idea'.

A gathering place for UK Research

I’m the project director for EThOSNet – which is establishing a service, run by the British Library, to provide access to all UK PhD and Research Theses. The service itself is called EThOS (Electronic Theses Online Service).

Today, EThOS has gone into public beta – without fanfare, the service is now available, and can be found at http://ethos.bl.uk. The key parts of the service are:

  • A catalogue of the vast majority of UK Research Theses
  • The ability to download electronic versions where they exist
  • The ability to request an electronic version be created where it doesn’t already exist

I’m incredibly excited about this  – of all the projects I’ve been involved in, although not the biggest in terms of budget (I don’t think), it has the most potential to have an incredible impact of the availability of research. Until now, if you wanted to read a thesis you either had to request it via ILL, or take a trip to the holding university. Now you will now be able to obtain it online. To give some indication of the difference this can make, the most popular thesis from the British Library over the entire lifetime of the previous ‘Microfilm’ service was requested 58 times. The most popular electronic thesis at West Virginia University (a single US University) in the same period was downloaded over 37,000 times. If we can even achieve a relatively modest increase in downloads I’ll be happy – if we can hit tens of thousand then I’ll be delighted.

The project to setup EThOS has been jointly funded by JISC and RLUK, with contributions from the British Library, and a number of UK Universities and other partners, including my own, Imperial College London, which leads the project. The launch of the service is the culmination of several projects, including ‘Theses Alive!‘, ‘Electronic Theses‘, ‘DAEDALUS‘, ‘EThOS‘, and the current ‘EThOSNet‘.

With so much work done before and during the EThOSNet project, my own involvement (which started someway into the EThOSNet project, when I took over as Project Director from Clare Jenkins in autumn 2007), looks pretty modest, so thanks to all who have worked so hard to make EThOS possible, and get it live.

One of the biggest issues that has surfaced several times during the course of these projects, is the question of IPR (Intellectual Property Rights). EThOS is taking the bold, and necessary, step of working as an ‘opt-out’ service. This is based on a careful consideration of all the issues which has concluded:

  • The majority of authors wish to demonstrate the quality of their work.
  • Institutions wish to demonstrate the quality of their primary research

In order that authors can opt-out if they do not want their thesis to be made available via EThOS there is a robust take-down policy – available at EThOS Toolkit

As an author, you can also contact your University to let them know that you do not wish your thesis to be included in the EThOS service.

By making this opt-out and take-down approach as transparent as possible (including doing things like advertising it on this blog), we believe that authors have clear options they can exercise if they have any concerns about the service.

Finally, the derivation of the word Ethos (according to wikipedia) is quite interesting ™. There are many aspects of the word that felt relevant to the service – the idea of a ‘starting point’, and the idea that ‘ethos’ belongs to the audience both resonate with what EThOS is trying to do. However, for the title of the post I decided to draw on Michael Halloran’s assertion that "the most concrete meaning given for the term in the Greek lexicon is 'a habitual gathering place'." – which I believe is what EThOS will become to those looking for UK research dissertations.

Technorati Tags: ,

The morning after the mash before

Yesterday was Mashed Libraries UK 2008 – the first in what I hope might be a series of Library Mashup/Tech events in the UK.

The idea of holding the event germinated while I was at ALA earlier this year, inspired also by the Mashed Museums event. I blogged the idea, and before I knew it, had offers of rooms, sponsorship and quite a few people saying they'd like to come along.

Some questions, and answers about the event:

Did it go well? I hope so. It was a bit like hosting a party – you spend the whole time worrying if people are enjoying themselves, and then when its all over, you’re knackered!

What did we do? I’d worried quite a bit about the structure of the day. I knew I had a real mix of people coming along and I wanted to ensure that everyone felt there was something for them. I’m not sure I completely succeeded, but I think the mix was OK.

The day started with some presentations from Rob Styles (on the Talis Platform) and Tony Hirst (on mashup tools like Google Spreadsheets and Yahoo Pipes). A third speaker was planned, but unfortunately unable to make it, so a few people agreed to help fill (Timm from Ex Libris, Mark from OCLC and Ashley from MIMAS – thank you)

What surprised me (but perhaps was not really surprising) was the extent to which these presentations set the agenda for the day – people tended to look at the stuff covered in these presentations. This isn’t a bad thing at all, but worth noting when planning this kind of event.

After the presentations (done by about midday) we broke for lunch, and people chatted etc. After this, I really left it up to people as to what they wanted to get on with. I’d collected some ideas on the Mashed Library Ning  beforehand – but on the day the content of the presentations influenced what people did much more than this.

I wasn’t sure how to ‘manage’ the more free-form part of the day – how to make sure that people weren’t left thinking ‘what on earth do I do now’, while still ensuring that people could get on with what they wanted. I think it worked – but perhaps others are better placed to say.

Some of the things that people did on the day were:

Mash along with Tony Hirst – Tony conducted a Gordon Ramsay style mash along using Yahoo Pipes to pull data from different bits of the Talis Platform, and trying to output locations of books plotted on a map. There were some problems, and the limitations of Pipes become apparent (as well as issues with aspects of the data)

Rob Styles and Chris Keene (Sussex) were aiming at something similar, but with PHP and Javascript, and additionally pulling in data from another Talis source ‘Silkworm‘ – a directory of library information (including more detailed location information)

Matthew Phillips (Dundee) and Ed Chamberlain (Cambridge) played around with output from the Aleph API and Pipes (unfortunately some real challenges with this one), and Matthew also showed off his use of graphical bars to illustrate overlapping journal holdings – in print and from various supplier electronically.

A few groups messed with Pipes and locations – finding the nearest Travel Agents, Museums and Pubs to a specific location.

David Pattern (Huddersfield) messed around with usage data, trying to see if heavy use of one book today meant that another title would be in high demand next week, to allow planning of moves of titles into Short Loan etc.

Nick Day (Cambridge) used WorldCat to look up identifiers for citations he had coded up in RDF from some PhD Theses.

There was also the chance simply to soak up the atmosphere, wander round to see what others were doing, and generally chat and network.

Towards the end of the day Paul Bevan from the National Library of Wales talked about how they were engaging with Web 2.0, and the different challenges that they faced compared to ‘academic libraries’. (and sadly we were lacking attendees from the public or commercial library sectors)

A collection of pictures from the event is forming on Flickr, and I took some video footage on the day, which I hope to turn into something – I’ll post it here when it’s done.

What would you do differently? I’d remember that a room full of people with laptops needs lots of power points. The room we had unfortunately only had 6 – but with a bit of help from the local support staff, and a local Maplins, this was sorted out before people ran out of juice

I’d also remember that you need to order Vegan options for Vegans (I’m really sorry Ashley – hope you did get something to eat)

Thanks to? Thanks to Imperial College for letting me spend time organising this; UKOLN for sponsoring it (without whom, there would not have been cake); Paul Walk (of UKOLN) for encouragement and arranging the sponsorship; David Flanders (Birkbeck) for sorting the room, local details, and generally being helpful; all the presenters; and everyone who came along and contributed to the day.

In Summary? I didn’t really have any huge expectations of the day – I wanted it to bring together a group of interested people, and do some interesting things. I hope that the people who came along felt that in the main we managed this.

I can definitely see potential for different kinds of events that perhaps focus on more development, or on training – if you weren’t able to make it, then you might consider going to the JISC Developer Happiness days in February next year, which include an introductory day to develop skills, as well as a 2 day coding session (with prizes).

I think overall I can say Mashed Library ’08 was a success. I’d like to do another library tech event in the new year – so if you are interested, keep watching, or drop me a line.

 

Photos courtesy of Dave Pattern, via Flickr

Technorati Tags:

Infrastructure, Infra dig

Yesterday I attended the first meeting of the new JISC Resource Discovery Infrastructure Taskforce.

I enjoyed the day, and the Task Group is bringing together a great set of people who all had an incredible amount to contribute to the discussion.

The day was much about establishing some basics – like agreeing the Terms of Reference for the group – and also about getting some of the issues and assumptions out in the open. I'd been asked to prepare a 10 minute presentation titled 'What if we were starting from scratch'. Paul Miller from Talis also presented. Originally there had been a suggestion that Tim Spalding of LibraryThing would also present, but that didn't happen in the end (which was a disappointment)

My talk is available via Slideshare, but at last look, the speakers notes were not displaying properly, so I'd recommend using the 'download' option to get the powerpoint file, as the Speaker's Notes are essentially the script of the talk (although without my witty improvisations).

One of the things I struggled with as I wrote the talk, is that I knew what I wanted to say in terms of what a 'starting from scratch' approach might look like, but I had no idea of how this linked to user need. This may seem a bit backwards – perhaps arrogant? – in a world where we recognise that serving the user need is paramount, but even during the day we seemed to come up against this problem more than once – how does the infrastructure relate to the user? Are they aware of it? Do they care what it looks like? How do they inform it?

After researching and thinking, I eventually hit upon Ranganathan's 5 Laws of Library Science as a way of thinking about the user need and still relating it to the infrastructure. If you have seen (or remember) the 5 laws:

  1. Books are for use.
  2. Every reader his [or her] book.
  3. Every book its reader.
  4. Save the time of the User.
  5. The library is a growing organism.

Then I really recommend you read the full text of Ranganathan’s original book – as the thinking behind these laws are so much more important than this plain statement of them.

One final thing on the presentation – in it I describe a linked environment that I say is ‘not necessarily the web’ – I think this is true in terms of what I’m describing and for the purposes of the presentation. I want to state though that in reality, if we are implementing something along these lines the linked environment would absolutely have to be the web – there is no point in coming up with something separate.

Overall the discussions on the day were very interesting, and really just emphasised how much there was to discuss:

  • How does discovery relate to delivery
  • Are we talking about discovery via metadata or other routes (e.g. full-text searching)
  • What is good/bad about what we’ve got
  • Are we talking about any ‘resource’ or just ‘bibliographic’
  • What does ‘world class’ mean in the context of resource discovery

Some of this may seem trivial, and some fundamental, but I guess this is what happens when you try and tackle this kind of big issue.

However, the one thing that I came away wondering overall was ‘what do we mean by infrastructure’? (luckily I think I’m clearer on Resource Discovery, otherwise we’d be in real trouble!)

Dictionary.com has the following definition of infrastructure:

  1. the basic, underlying framework or features of a system or organization.
  2. the fundamental facilities and systems serving a country, city, or area, as transportation and communication systems, power plants, and schools.
  3. the military installations of a country.

Ruling out the last one (I hope) as not relevant, I think the first two definitions sum up the problem. On the one hand, infrastructure can be seen as the very basic framework. If you talk about Infrastructure in the context of Skyscrapers then you are talking about the metal frame, the foundations, the concrete etc. This seems to me like meaning (1) above.

On the other hand, in terms of urban planning infrastructure might refer not just to underlying frameworks (e.g. roads, sewers) but also basic services (e.g. refuse collection, metro system)

I think that when we talked about ‘resource discovery infrastructure’ some people think ‘plumbing’ or ‘foundations’ (this includes me), and some think ‘metro’ or ‘refuse collection’.

To take a specific example, is a geographical Union Catalogue like the InforM25 Union List of Serials part of a resource discovery ‘infrastructure’ or is the ‘infrastructure’ in this case the MARC record and ftp which allows the records from many catalogues to be dumped together, merged and displayed?

Going back to the question of how the user relates to the infrastructure – you can see how I (as a user) relate very much to the mass transit system that is provided where I live – but I don’t care about the gauge of rail on which it runs (perhaps I should, but I don’t)

The group is planning another meeting in the New Year, and definitions are one of the things we need to talk about – I think the question of what qualifies as Infrastructure needs to be close to the top of that list.

Doing the Library Mash

Registration for Mashed Libraries UK 2008 have now closed, and I’m really looking forward to the event next Thursday (27th November).

I’m really pleased that we’ve got about 30 people coming, from all over the UK (and a couple from further afield).

Although registration is closed, you can still contribute to the day, by joining the Ning at http://mashedlibrary.ning.com and posting ideas to the forums – you never know, someone attending might pick up on your idea and do something with it – and if not, then there is always the next Mashed Libraries event (I hope)

For those coming, I’m looking forward to meeting you all, for those not able to attend look for updates on this blog, on the Ning, the CILIP Update, and perhaps other places.

Technorati Tags:

LIS08

I’ve just spent the day at LIS08 at the Birmingham NEC. Unfortunately there didn’t seem to be any publicly available wifi network – but that does give me a chance to try the new Typepad iphone interface to write this blog entry!

The main attractions of the show (and certainly the biggest stands) were the RFID vendors – Intellident, D-tech, 2CQR, 3M, SB etc. Were so there.

The two most interesting things I saw were the ‘Smartblades’ from Intellident, which were RFID antennas that you placed between your book stock on the shelves (about 3 per meter of shelving) and they would do a stock check of your library at the touch of a button. They also added the ability to go from searching for a book to an interactive map showing where the item was in the library at that instant. Although there is still clearly some work to do on the practical implementations (how to get power and networking to the blades for example) the demo I saw was very convincing. I was also pleased to hear them talk about the exploitation of RFID in the library supply chain – a much underestimated area by RFID vendors in my opinion – however thus will need engagement from library suppliers and system vendors before it can become a reality.

I was also very taken by D-Tech’s people counter, which used thermal imaging to detect people passing a point, and some clever software which allowed you to draw lines across the sensor area to dictate when a ‘count’ occurred. This would allow not only in and out counting, but also if you put a sensor at a crossroads or junction, draw counter lines in such a way that you could count his many people went each way from the crossroads – very neat, and the price seemed reasonable too.

I also got to see a walkthrough of the new search interface from DS Ltd (Arena), chatted to Lorensbergs about room booking systems and Talking Tech about SMS overdue and hold notification. So all in all a useful day and I caught up with some current and past colleagues as well.

Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa