{"id":1678,"date":"2014-11-03T17:25:51","date_gmt":"2014-11-03T16:25:51","guid":{"rendered":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/?p=1678"},"modified":"2014-11-03T17:48:06","modified_gmt":"2014-11-03T16:48:06","slug":"tilt-text-to-image-linking-tool","status":"publish","type":"post","link":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/2014\/11\/tilt-text-to-image-linking-tool\/","title":{"rendered":"TILT: Text to Image Linking Tool"},"content":{"rendered":"<p>This blog post was written during a presentation at the <a href=\"http:\/\/britishlibrary.typepad.co.uk\/digital-scholarship\/2014\/10\/british-library-labs-symposium-2014.html\">British Library Labs Symposium<\/a> in November 2014. It is likely full of errors and omissions having been written real-time.<\/p>\n<p>Why this tool?<\/p>\n<p>Libraries contain thousands of literary &amp; documentary artefacts up to 4000 years old. How to bring these effectively to a modern audience.<\/p>\n<p>&#8220;Images of their own are dull&#8221; &#8211; browse interfaces tend not to give the user much information. Even at the page image level, it can be difficult to make sense of what you are seeing.<\/p>\n<p>One approach is to put the text on top of an image:<br \/>\n* Correltates words in image\/text<br \/>\n* can be searched but&#8230;<br \/>\n* only works with OCR<br \/>\n* if text has errors, hard to fix<br \/>\n* text can&#8217;t be formattted or annotated<\/p>\n<p>Another approach is to put the text next to the image:<br \/>\n* format text for different devices<br \/>\n* can annotate test for stufy<br \/>\n* easy to verify and edit<br \/>\n* BUT<br \/>\n* must keep image and text in sync<br \/>\n* increases mental effort to find corresponding words in the text\/image<\/p>\n<p>If you are going to link text to the image of the text, what level should you do this at?<br \/>\n* ilink at page-level &#8211; useful but too coarse. Doesn&#8217;t reduce mental effort much<br \/>\n* link at line level<br \/>\n* link at word level<\/p>\n<p>Word level probably most desirable, but how to achiev it?<br \/>\nManual approach:<br \/>\n* Manually draw shapes around words<br \/>\n* link them to the text by adding markup to the transcription<br \/>\n* BUT<br \/>\n* tedious &amp; expensive<br \/>\n* markup gets complex<br \/>\n* end up needing multiple transcriptions<\/p>\n<p>TILT approach:<br \/>\n* find word in an image without reconginsing their content<br \/>\n* Use an exsiting transcripto f the page content<br \/>\n* Link these two component mostly automatically<\/p>\n<p>Design:<\/p>\n<p>TILT Service<br \/>\n^ ^ ^<br \/>\nImage Text or HTML GeoJSON<br \/>\n^ ^ v<br \/>\nTILT web-based GUI<\/p>\n<p>First you have to prepare image in GUI &#8211; identify different parts of the text<br \/>\nStages:<br \/>\n* Colour to greyscale<br \/>\n* Greyscale to Black and White<br \/>\n* Find lines<br \/>\n* Find word shapes<br \/>\n* link word-shapes to text<\/p>\n<p>Recognising words is a challenge:<br \/>\n* (in most languages) Words are blocks of connected pixels with small gaps between them<br \/>\n* But if there are 300 words on a page are the 299 largest gaps always between words<\/p>\n<p>How to represent word shapes? Simple polygons do the trick<\/p>\n<p>Measure width of words in text, and then tries to match against lengths in transcription &#8211; so if the word shapes have not been recognised correctly, the matching algorithm just selects more or less text in the transcription.<\/p>\n<p>Now looking at using of &#8216;anchor points&#8217; in text that allows the user to identify the start and end of &#8216;clean&#8217; text in a larger manuscript which might have messy sections that can&#8217;t be done automatically. This allows you do what you can automatically, and only deal with the messy bits in a manual way.<\/p>\n<p>Still working on a GUI to work with<\/p>\n<p>Code on GitHub<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post was written during a presentation at the British Library Labs Symposium in November 2014. It is likely full of errors and omissions having been written real-time. Why this tool? Libraries contain thousands of literary &amp; documentary artefacts up to 4000 years old. How to bring these effectively to a modern audience. &#8220;Images [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[99],"class_list":["post-1678","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-bl_labs"],"_links":{"self":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1678","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/comments?post=1678"}],"version-history":[{"count":3,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1678\/revisions"}],"predecessor-version":[{"id":1698,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1678\/revisions\/1698"}],"wp:attachment":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/media?parent=1678"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/categories?post=1678"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/tags?post=1678"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}