This blog post was written during a presentation at the British Library Labs Symposium in November 2014. It is likely full of errors and omissions having been written real-time.
Why this tool?
Libraries contain thousands of literary & documentary artefacts up to 4000 years old. How to bring these effectively to a modern audience.
“Images of their own are dull” – browse interfaces tend not to give the user much information. Even at the page image level, it can be difficult to make sense of what you are seeing.
One approach is to put the text on top of an image:
* Correltates words in image/text
* can be searched but…
* only works with OCR
* if text has errors, hard to fix
* text can’t be formattted or annotated
Another approach is to put the text next to the image:
* format text for different devices
* can annotate test for stufy
* easy to verify and edit
* must keep image and text in sync
* increases mental effort to find corresponding words in the text/image
If you are going to link text to the image of the text, what level should you do this at?
* ilink at page-level – useful but too coarse. Doesn’t reduce mental effort much
* link at line level
* link at word level
Word level probably most desirable, but how to achiev it?
* Manually draw shapes around words
* link them to the text by adding markup to the transcription
* tedious & expensive
* markup gets complex
* end up needing multiple transcriptions
* find word in an image without reconginsing their content
* Use an exsiting transcripto f the page content
* Link these two component mostly automatically
^ ^ ^
Image Text or HTML GeoJSON
^ ^ v
TILT web-based GUI
First you have to prepare image in GUI – identify different parts of the text
* Colour to greyscale
* Greyscale to Black and White
* Find lines
* Find word shapes
* link word-shapes to text
Recognising words is a challenge:
* (in most languages) Words are blocks of connected pixels with small gaps between them
* But if there are 300 words on a page are the 299 largest gaps always between words
How to represent word shapes? Simple polygons do the trick
Measure width of words in text, and then tries to match against lengths in transcription – so if the word shapes have not been recognised correctly, the matching algorithm just selects more or less text in the transcription.
Now looking at using of ‘anchor points’ in text that allows the user to identify the start and end of ‘clean’ text in a larger manuscript which might have messy sections that can’t be done automatically. This allows you do what you can automatically, and only deal with the messy bits in a manual way.
Still working on a GUI to work with
Code on GitHub