The morning after the mash before

Yesterday was Mashed Libraries UK 2008 – the first in what I hope might be a series of Library Mashup/Tech events in the UK.

The idea of holding the event germinated while I was at ALA earlier this year, inspired also by the Mashed Museums event. I blogged the idea, and before I knew it, had offers of rooms, sponsorship and quite a few people saying they'd like to come along.

Some questions, and answers about the event:

Did it go well? I hope so. It was a bit like hosting a party – you spend the whole time worrying if people are enjoying themselves, and then when its all over, you’re knackered!

What did we do? I’d worried quite a bit about the structure of the day. I knew I had a real mix of people coming along and I wanted to ensure that everyone felt there was something for them. I’m not sure I completely succeeded, but I think the mix was OK.

The day started with some presentations from Rob Styles (on the Talis Platform) and Tony Hirst (on mashup tools like Google Spreadsheets and Yahoo Pipes). A third speaker was planned, but unfortunately unable to make it, so a few people agreed to help fill (Timm from Ex Libris, Mark from OCLC and Ashley from MIMAS – thank you)

What surprised me (but perhaps was not really surprising) was the extent to which these presentations set the agenda for the day – people tended to look at the stuff covered in these presentations. This isn’t a bad thing at all, but worth noting when planning this kind of event.

After the presentations (done by about midday) we broke for lunch, and people chatted etc. After this, I really left it up to people as to what they wanted to get on with. I’d collected some ideas on the Mashed Library Ning  beforehand – but on the day the content of the presentations influenced what people did much more than this.

I wasn’t sure how to ‘manage’ the more free-form part of the day – how to make sure that people weren’t left thinking ‘what on earth do I do now’, while still ensuring that people could get on with what they wanted. I think it worked – but perhaps others are better placed to say.

Some of the things that people did on the day were:

Mash along with Tony Hirst – Tony conducted a Gordon Ramsay style mash along using Yahoo Pipes to pull data from different bits of the Talis Platform, and trying to output locations of books plotted on a map. There were some problems, and the limitations of Pipes become apparent (as well as issues with aspects of the data)

Rob Styles and Chris Keene (Sussex) were aiming at something similar, but with PHP and Javascript, and additionally pulling in data from another Talis source ‘Silkworm‘ – a directory of library information (including more detailed location information)

Matthew Phillips (Dundee) and Ed Chamberlain (Cambridge) played around with output from the Aleph API and Pipes (unfortunately some real challenges with this one), and Matthew also showed off his use of graphical bars to illustrate overlapping journal holdings – in print and from various supplier electronically.

A few groups messed with Pipes and locations – finding the nearest Travel Agents, Museums and Pubs to a specific location.

David Pattern (Huddersfield) messed around with usage data, trying to see if heavy use of one book today meant that another title would be in high demand next week, to allow planning of moves of titles into Short Loan etc.

Nick Day (Cambridge) used WorldCat to look up identifiers for citations he had coded up in RDF from some PhD Theses.

There was also the chance simply to soak up the atmosphere, wander round to see what others were doing, and generally chat and network.

Towards the end of the day Paul Bevan from the National Library of Wales talked about how they were engaging with Web 2.0, and the different challenges that they faced compared to ‘academic libraries’. (and sadly we were lacking attendees from the public or commercial library sectors)

A collection of pictures from the event is forming on Flickr, and I took some video footage on the day, which I hope to turn into something – I’ll post it here when it’s done.

What would you do differently? I’d remember that a room full of people with laptops needs lots of power points. The room we had unfortunately only had 6 – but with a bit of help from the local support staff, and a local Maplins, this was sorted out before people ran out of juice

I’d also remember that you need to order Vegan options for Vegans (I’m really sorry Ashley – hope you did get something to eat)

Thanks to? Thanks to Imperial College for letting me spend time organising this; UKOLN for sponsoring it (without whom, there would not have been cake); Paul Walk (of UKOLN) for encouragement and arranging the sponsorship; David Flanders (Birkbeck) for sorting the room, local details, and generally being helpful; all the presenters; and everyone who came along and contributed to the day.

In Summary? I didn’t really have any huge expectations of the day – I wanted it to bring together a group of interested people, and do some interesting things. I hope that the people who came along felt that in the main we managed this.

I can definitely see potential for different kinds of events that perhaps focus on more development, or on training – if you weren’t able to make it, then you might consider going to the JISC Developer Happiness days in February next year, which include an introductory day to develop skills, as well as a 2 day coding session (with prizes).

I think overall I can say Mashed Library ’08 was a success. I’d like to do another library tech event in the new year – so if you are interested, keep watching, or drop me a line.

 

Photos courtesy of Dave Pattern, via Flickr

Technorati Tags:

Infrastructure, Infra dig

Yesterday I attended the first meeting of the new JISC Resource Discovery Infrastructure Taskforce.

I enjoyed the day, and the Task Group is bringing together a great set of people who all had an incredible amount to contribute to the discussion.

The day was much about establishing some basics – like agreeing the Terms of Reference for the group – and also about getting some of the issues and assumptions out in the open. I'd been asked to prepare a 10 minute presentation titled 'What if we were starting from scratch'. Paul Miller from Talis also presented. Originally there had been a suggestion that Tim Spalding of LibraryThing would also present, but that didn't happen in the end (which was a disappointment)

My talk is available via Slideshare, but at last look, the speakers notes were not displaying properly, so I'd recommend using the 'download' option to get the powerpoint file, as the Speaker's Notes are essentially the script of the talk (although without my witty improvisations).

One of the things I struggled with as I wrote the talk, is that I knew what I wanted to say in terms of what a 'starting from scratch' approach might look like, but I had no idea of how this linked to user need. This may seem a bit backwards – perhaps arrogant? – in a world where we recognise that serving the user need is paramount, but even during the day we seemed to come up against this problem more than once – how does the infrastructure relate to the user? Are they aware of it? Do they care what it looks like? How do they inform it?

After researching and thinking, I eventually hit upon Ranganathan's 5 Laws of Library Science as a way of thinking about the user need and still relating it to the infrastructure. If you have seen (or remember) the 5 laws:

  1. Books are for use.
  2. Every reader his [or her] book.
  3. Every book its reader.
  4. Save the time of the User.
  5. The library is a growing organism.

Then I really recommend you read the full text of Ranganathan’s original book – as the thinking behind these laws are so much more important than this plain statement of them.

One final thing on the presentation – in it I describe a linked environment that I say is ‘not necessarily the web’ – I think this is true in terms of what I’m describing and for the purposes of the presentation. I want to state though that in reality, if we are implementing something along these lines the linked environment would absolutely have to be the web – there is no point in coming up with something separate.

Overall the discussions on the day were very interesting, and really just emphasised how much there was to discuss:

  • How does discovery relate to delivery
  • Are we talking about discovery via metadata or other routes (e.g. full-text searching)
  • What is good/bad about what we’ve got
  • Are we talking about any ‘resource’ or just ‘bibliographic’
  • What does ‘world class’ mean in the context of resource discovery

Some of this may seem trivial, and some fundamental, but I guess this is what happens when you try and tackle this kind of big issue.

However, the one thing that I came away wondering overall was ‘what do we mean by infrastructure’? (luckily I think I’m clearer on Resource Discovery, otherwise we’d be in real trouble!)

Dictionary.com has the following definition of infrastructure:

  1. the basic, underlying framework or features of a system or organization.
  2. the fundamental facilities and systems serving a country, city, or area, as transportation and communication systems, power plants, and schools.
  3. the military installations of a country.

Ruling out the last one (I hope) as not relevant, I think the first two definitions sum up the problem. On the one hand, infrastructure can be seen as the very basic framework. If you talk about Infrastructure in the context of Skyscrapers then you are talking about the metal frame, the foundations, the concrete etc. This seems to me like meaning (1) above.

On the other hand, in terms of urban planning infrastructure might refer not just to underlying frameworks (e.g. roads, sewers) but also basic services (e.g. refuse collection, metro system)

I think that when we talked about ‘resource discovery infrastructure’ some people think ‘plumbing’ or ‘foundations’ (this includes me), and some think ‘metro’ or ‘refuse collection’.

To take a specific example, is a geographical Union Catalogue like the InforM25 Union List of Serials part of a resource discovery ‘infrastructure’ or is the ‘infrastructure’ in this case the MARC record and ftp which allows the records from many catalogues to be dumped together, merged and displayed?

Going back to the question of how the user relates to the infrastructure – you can see how I (as a user) relate very much to the mass transit system that is provided where I live – but I don’t care about the gauge of rail on which it runs (perhaps I should, but I don’t)

The group is planning another meeting in the New Year, and definitions are one of the things we need to talk about – I think the question of what qualifies as Infrastructure needs to be close to the top of that list.

Doing the Library Mash

Registration for Mashed Libraries UK 2008 have now closed, and I’m really looking forward to the event next Thursday (27th November).

I’m really pleased that we’ve got about 30 people coming, from all over the UK (and a couple from further afield).

Although registration is closed, you can still contribute to the day, by joining the Ning at http://mashedlibrary.ning.com and posting ideas to the forums – you never know, someone attending might pick up on your idea and do something with it – and if not, then there is always the next Mashed Libraries event (I hope)

For those coming, I’m looking forward to meeting you all, for those not able to attend look for updates on this blog, on the Ning, the CILIP Update, and perhaps other places.

Technorati Tags:

Send in the clouds

Cloud computing seems to be the buzzword of the moment, and there is currently a lot of media coverage – especially in the light of the recent Microsoft announcement of their own take on the cloud with Azure. I’ve also been following some less high profile, but nonetheless thought provoking, discussions about other aspects of the cloud such as the ‘data cloud’ on blogs and twitter.

Why cloud? At some point it became usual to represent the network as a ‘cloud’ in network and computing architecture diagrams like this:

(courtesy of stephthegeek AttributionNoncommercialShare Alike Some rights reserved

The cloud represents the complexity of the internet here, but also says it is quite simple from the network point of view – stuff goes in one end, and comes out the other.

The concept of ‘cloud computing’ is that you can have services that sit on the Internet (i.e. ‘in the cloud’) and use them in the same way – push stuff in, get stuff out, don’t really care how it happens.

Examples of ‘cloud computing’ that are often cited are:

  • Google Docs
  • Amazon S3
  • Amazon EC2

The first is a suite of office tools that exist only online – you don’t download anything to your PC, and you interact with them via your browser.

The second two are services from Amazon – the first is a storage service, where you can use hard disk space on Amazon servers to store stuff, and the second is service which allows you to run virtual servers, on demand, on Amazon hardware.

This week Microsoft announced it’s own take on the cloud in the form of Azure – which will provide a way of synchronising between your desktop, your mobile and ‘the cloud’.

In a recent post to the ZDNet Semantic Web blog, Paul Miller goes on to talk about the possibility of a ‘data cloud’ – linking it to the idea behind the ‘semantic web’ – that is creating a web of data, all inter-linked.

Paul Walk argued (and I’m inclined to agree), that you couldn’t talk about data in the same way as computing power – as you cared about data in a different way to your computing power – you would never accept just ‘any old data’. In a comment on this post Chris Rusbridge suggests that what Paul is referring to is the provenance of the data – again I’m inclined to agree.

However, I’d say that actually this is as true of obtaining computing power as it is data. Although, as Paul Walk notes, I may not care about the particular hardware that I’m utilising, or where it lives, I do care that I’m being offered a robust service – and so I don’t want just any old computing power – I want  the good stuff! I personally use the Amazon S3 service to backup data – because I trust that Amazon is going to be pretty reliable – I wouldn’t trust the same data to some bloke running a ‘cloud computing’ service from his garage.

The difference for me between the Internet as a ‘cloud’ and the idea of ‘cloud computing’ is that when I transmit data over the Internet as a network I’m trusting not in a single provider, but essentially in a technical protocol and infrastructure to get stuff from one place to another – and although one part of that journey is governed by someone I’ve chosen (my ISP) most of it isn’t. When I choose a ‘cloud computing’ service I trust a ‘brand’ that provides the service – admittedly I don’t ask questions about how they provide the service (do they subcontract? how would I know?), but I not just throwing a task at a generalised technical solution and saying ‘store this’ or ‘process that’.

I would argue that peer-to-peer networks are much closer to the idea of ‘cloud’ computing than Amazon’s or Google’s services. If I upload something to a peer-to-peer network, then it is potentially going to be stored in lots of places, and I won’t know where it is. For some data this might work (stuff that I really want to share), but for others (stuff I want to keep but perhaps not share) it isn’t.

Skype also uses peer-to-peer technology to route Skype calls – and again, I would argue that this is much closer to a situation where you really “don’t care” where the processing takes place – as long as your call holds up.

So, I think that what is being called ‘cloud computing’ is actually SaaS – Software (or Storage I guess) as a Service. SaaS is a model where you obtain access to software that is hosted elsewhere – so typically via the Internet. When I use Google Docs or Amazon S3 this is really what I’m doing.

Several Library system vendors offer – although without incredibly enthusiastic uptake in the academic library sector (see the post from Dave Pattern and various comments at http://www.daveyp.com/blog/archives/303). I have in the past been a bit sceptical about the idea of SaaS, but as I note in my comment on the post above, I’m now much more convinced.

I think that Paul Miller’s arguments make more sense in this context – DaaS (Data as a Service) – that is getting your data that is hosted somewhere else makes sense. However, Paul is arguing for something a bit more than this – data that is hosted in a way that makes in it accessible and linkable – and this is something that I think libraries need to get to grips with – there is a lot of talk of ‘data silos’ and how libraries are guilty of perpetuating this – we need to break out of this paradigm. I was very depressed to see a comment on an email list this week that said ""There is something to be said for the library's catalogue being self contained and inhouse" – I think this is an attitude we have to change – although I understand the arguments about reliability (e.g. in the face of network failure) we can overcome these problems without having systems that are ‘self contained’ and if we are to have library data as part of the cloud, we need to.

Euan Semple – keynote

The opening keynote is from Euan Semple (http://www.euansemple.com/). Euan is at the BBC as head of Knowledge Management, and has had to help the BBC adapt to ‘Web 2.0’. When faced with a manager who said ‘if I gave my staff access to that kind of tool, they would just end up wasting their time’ – Euan’s reply was ‘have you thought that your recruitment policy might not be working?’.

So Euan’s opening question is what will ‘Businesslike’ look like when business isn’t like business any more?

Euan’s talking about tools he has used or seen used in the process of implementing technology in the area of KM. Firstly ‘Talk.Gateway’ – a discussion/chat board. He draws a distinction between this approach to ‘document management’ systems "where information goes to die gracefully". He suggests that by using something like a discussion board allows you to access all the collective knowledge of the organisation (in the case of the BBC accessing the collective knowledge of 23k employees). An example where someone asked about a policy and got 6 different answers, as well as a link to the official policy document. Euan’s point is that the discussion board didn’t create this sitution, but surfaced it – don’t blame the system for surfacing inconsistencies or problems.

Second tool, Connect.Gateway – a place where you can post details about yourself – expertise, interests, contact details etc. plus ability to join ‘interest groups’ to bring together people with common interests – espeically the ‘new’ stuff that wasn’t captured in the corporate structure.

Euan is pretty sceptical about structure in these systems – taxonomies etc. He says that with the discussion board, originally there were just two sections. Eventually he came under pressure to provide more structure to the boards – however, as soon as he did this, usage drops. He draws a parallel to a ‘cotswold village’ that grows up gradually over time with no particular plan, compared to organised ‘new towns’ like Milton Keynes which are ‘planned’ to be systematic, but end up being very easy to get lost in. I’m not sure this completely holds up – there are definite advantages and disadvantages to both approaches, but the point with Milton Keynes is that once you understand the layout, it becomes quite easy. Also with a systematic approach, then you can apply the same system to different places – once you understand the system for numbering/naming streets in on US city, you can apply it in others. However, each Cotswold village is different. To make this more concrete, the point is that once you understand LCSH you can apply that to each library catalogue you use, but if we all used local terminology then this would not be possible. On the otherhand this perhaps means you don’t get the advantage of localisation which leads to ease of use for regular users. I think a dual approach can work, and there is no doubt that libraries have traditionally taken a very structured approach, and haven’t yet exploited the ‘organic growth’ approach to any extent.

Euan has just covered blogs as a communication and dicussion tool, and is now mentioning wikis – these are all tools that have been used at the BBC.

Just as an aside – one of my reasons for blogging (especially conferences) is to share information with colleagues. However, I also want to engage in a discussion with a wider community online. At Imperial they have recently introduced the ‘Confluence’ system for blogging and wikis, which I think is great, and some of my teams are already using, or investigating. However, at the moment the blogs we can setup on Confluence are only available internally – which wouldn’t support me in engaging with the wider community – hence I’m blogging on my own site instead. I hope that this might change…

So, wikis – Euan making the point that they are highly auditable, and to some extent self-correcting.

The BBC now have guidelines on blogging etc – again, something I asked about at Imperial before I started blogging as an Imperial employee – but at the moment there doesn’t seem to be anything in Imperial policies or guidelines relating to this.

Euan now coming onto tagging etc. name checking David Weinberger and his book ‘Everything is Miscellaneous’ (http://www.amazon.com/Everything-Miscellaneous-Power-Digital-Disorder/dp/0805080430). Euan is covering use of del.icio.us – with use of tags and his ‘network’ of trusted people who use del.icio.us. Also use of RSS to track this – and telling a story of how he did a talk, and when he came off stage his RSS aggregator picked up a new item tagged with his name, and found that it was someone blogging the talk he had just given (wonder if he will pick up this post?)

Euan mentioning the use of the Google blog search – different type of content to what you would get in response to a normal Google search – he argues more useful.

Now mentioning last.fm – I still haven’t got into this (probably don’t listen to enough music!) – but the point is the power of the ‘network’ – harnassing the knowledge of a network of people. Suggesting that something like this for TV is on it’s way – why watch a programmed ‘channel’ when you can choose to watch something that your ‘trusted’ network is recommending.

Now mentioning ‘Plazes‘ which I haven’t come across – once you connect to a  wireless network, it works out where you are and shows it on a map – so people can see where you are, and you can see if you are near to people you want to meet etc.

Twitter – the ‘intenstity of the mundane’ – what about ‘on my way to meeting with CEO’
Facebook – making contact
Dracos.co.uk – tracks changes to BBC News homepage – allows you to see stuff that has been changed – so can’t hide stuff that you’ve said…

Some final examples – Innocentive from a pharmaceutical company where questions can be posted, and people can bid for answers – story of a member of staff at an Indian university who set the questions for students, and posted answers – one student got £75k for an answer.
A final lighthearted example of an online application – Meeting Miser – works out how much a meeting has cost the organisation based on time and salaries of those involved – the point being, don’t value physical meetings over virtual collaboration.

Coffee time!

Euan Semple – keynote

The opening keynote is from Euan Semple (http://www.euansemple.com/). Euan is at the BBC as head of Knowledge Management, and has had to help the BBC adapt to ‘Web 2.0’. When faced with a manager who said ‘if I gave my staff access to that kind of tool, they would just end up wasting their time’ – Euan’s reply was ‘have you thought that your recruitment policy might not be working?’.

So Euan’s opening question is what will ‘Businesslike’ look like when business isn’t like business any more?

Euan’s talking about tools he has used or seen used in the process of implementing technology in the area of KM. Firstly ‘Talk.Gateway’ – a discussion/chat board. He draws a distinction between this approach to ‘document management’ systems "where information goes to die gracefully". He suggests that by using something like a discussion board allows you to access all the collective knowledge of the organisation (in the case of the BBC accessing the collective knowledge of 23k employees). An example where someone asked about a policy and got 6 different answers, as well as a link to the official policy document. Euan’s point is that the discussion board didn’t create this sitution, but surfaced it – don’t blame the system for surfacing inconsistencies or problems.

Second tool, Connect.Gateway – a place where you can post details about yourself – expertise, interests, contact details etc. plus ability to join ‘interest groups’ to bring together people with common interests – espeically the ‘new’ stuff that wasn’t captured in the corporate structure.

Euan is pretty sceptical about structure in these systems – taxonomies etc. He says that with the discussion board, originally there were just two sections. Eventually he came under pressure to provide more structure to the boards – however, as soon as he did this, usage drops. He draws a parallel to a ‘cotswold village’ that grows up gradually over time with no particular plan, compared to organised ‘new towns’ like Milton Keynes which are ‘planned’ to be systematic, but end up being very easy to get lost in. I’m not sure this completely holds up – there are definite advantages and disadvantages to both approaches, but the point with Milton Keynes is that once you understand the layout, it becomes quite easy. Also with a systematic approach, then you can apply the same system to different places – once you understand the system for numbering/naming streets in on US city, you can apply it in others. However, each Cotswold village is different. To make this more concrete, the point is that once you understand LCSH you can apply that to each library catalogue you use, but if we all used local terminology then this would not be possible. On the otherhand this perhaps means you don’t get the advantage of localisation which leads to ease of use for regular users. I think a dual approach can work, and there is no doubt that libraries have traditionally taken a very structured approach, and haven’t yet exploited the ‘organic growth’ approach to any extent.

Euan has just covered blogs as a communication and dicussion tool, and is now mentioning wikis – these are all tools that have been used at the BBC.

Just as an aside – one of my reasons for blogging (especially conferences) is to share information with colleagues. However, I also want to engage in a discussion with a wider community online. At Imperial they have recently introduced the ‘Confluence’ system for blogging and wikis, which I think is great, and some of my teams are already using, or investigating. However, at the moment the blogs we can setup on Confluence are only available internally – which wouldn’t support me in engaging with the wider community – hence I’m blogging on my own site instead. I hope that this might change…

So, wikis – Euan making the point that they are highly auditable, and to some extent self-correcting.

The BBC now have guidelines on blogging etc – again, something I asked about at Imperial before I started blogging as an Imperial employee – but at the moment there doesn’t seem to be anything in Imperial policies or guidelines relating to this.

Euan now coming onto tagging etc. name checking David Weinberger and his book ‘Everything is Miscellaneous’ (http://www.amazon.com/Everything-Miscellaneous-Power-Digital-Disorder/dp/0805080430). Euan is covering use of del.icio.us – with use of tags and his ‘network’ of trusted people who use del.icio.us. Also use of RSS to track this – and telling a story of how he did a talk, and when he came off stage his RSS aggregator picked up a new item tagged with his name, and found that it was someone blogging the talk he had just given (wonder if he will pick up this post?)

Euan mentioning the use of the Google blog search – different type of content to what you would get in response to a normal Google search – he argues more useful.

Now mentioning last.fm – I still haven’t got into this (probably don’t listen to enough music!) – but the point is the power of the ‘network’ – harnassing the knowledge of a network of people. Suggesting that something like this for TV is on it’s way – why watch a programmed ‘channel’ when you can choose to watch something that your ‘trusted’ network is recommending.

Now mentioning ‘Plazes‘ which I haven’t come across – once you connect to a  wireless network, it works out where you are and shows it on a map – so people can see where you are, and you can see if you are near to people you want to meet etc.

Twitter – the ‘intenstity of the mundane’ – what about ‘on my way to meeting with CEO’
Facebook – making contact
Dracos.co.uk – tracks changes to BBC News homepage – allows you to see stuff that has been changed – so can’t hide stuff that you’ve said…

Some final examples – Innocentive from a pharmaceutical company where questions can be posted, and people can bid for answers – story of a member of staff at an Indian university who set the questions for students, and posted answers – one student got £75k for an answer.
A final lighthearted example of an online application – Meeting Miser – works out how much a meeting has cost the organisation based on time and salaries of those involved – the point being, don’t value physical meetings over virtual collaboration.

Coffee time!

Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa

Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa

Clouding Over

Just playing around with ZoomClouds. Assuming it works, you should see below a ‘cloud’ (that is a collection of keywords) from the RHUL library recent acquisitions list. The larger the word, the more times it occured in the list. Seems pretty cool.

iTunes U (but not K)

Talking to the Apple representatives at UCISA, it seems like iTunes U isn’t going to be available in the UK in the immediate future due to ‘import/export restrictions’. This seems a real shame. Can’t believe Apple managed this for the music industry, but education has defeated them (for the moment at least).