This blog post was written during a presentation at the British Library Labs Symposium in November 2014. It is likely full of errors and omissions having been written real-time.
Dr Beatrice Alex, University of Edinburgh
Looking for mentions of places in Edinburgh using data sources including:
* British Library Nineteenth Century Books Collection (main source)
* Project Gutenberg
* Oxford Text Archive data
Interested in using EEBO/ECCO
* Digitised documents from collections above
* Document retrieveal and filtering -> to get ranked lists of Edinburgh specific candidates
* Manual curation – curation of Edinburgh specific literature – need a human in the loop to get the level of detail they desired
* Text minimg – fine-grained location extraction and geo-referencing using the Edinburgh Geoparser
* All data stored in database that then powers the visualisations etc.
Big data IN -> Small data OUT
All input documents must first be:
* Converted to a common format
* Identified as written English text
* Post-corrected automatically if necesssary
* Linguistic pre-processing
- Document retrieval. The goal is to find all Edinburgh loco-specific items which fit our remit (fiction, autobio, travel)
- Get ranked dcouments
- Assisted Curation is done with Palimpsest Annotation Tool (developed at St Andrew’s). Human makes decisions about whether items are ‘in or out’ (e.g. poetry marked as such and then excluded for the moment – may come back to this later)
* Text minign tools use the Edinburgh Geoparser to mark-up place names and resolve them to coordinates with a choice of gazetteer as the reference source – e.g. Geonames
Not all place matches in the gazetteer are interesting to the project – e.g. ‘Spring’. Clean these out. Have built the gazetteer and now building on this – e.g. want to do further linguistic analysis, building a mobile app so you can explore the literature based on your location
Final outputs will be web-based visualisations and a mobile app – the aim is to create interfaces for both literary scholars and the general public.