Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa

Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa