Workshop Reports

This morning kicks off with reports from four breakout sessions yesterday afternoon

Workflow

What is a ‘workflow tool’ – everything from Excel to Goobi. Decided to be inclusive
Things people were using:
Czech national database systel ‘Digital Registry’ – locally developed s/w with possibility it might go open source
Goobi
Zend framework
In-house systems

Those not using workflow tools often saw them as overly complex for small projects
Existing tools are related to the scale of the project
But projects don’t have to be that large to generate workflow – 500 manuscripts – or less if lots of people involved

What would ideal workflow tool look like?
Reusable in multiple projects at all sizes
Monitors performance
Gives statistical evidence about every step
Track books & activites over multiple sites

Needs to manage notes, tags and details for every item (fragility, lack of metadata, broken parts etc)
Tool should interoperate with ILS/Repository s/w
Workflow systems should work with each other – e.g. being able to push information to centralised workflow tools that could aggregate view of what has been done
Should index new digital item in the ILS
Automatically create records in the ILS when new digital item is available
Scalable … and free of charge!

Business Models
Looked at 6 different business models

  1. Publisher model
    • Proquests Early European Books. Publisher funds digitisation and offers subscription for x years to subscribers outside country of origin; free in country of origin; Year x+1 resource is full open access
    • brightsolid and the BL – making 40 million pages of newspapaers available – brightsolid makes material available via paid for website
    • lots of potential for more activity in this model
  2. National Government funding
    • e.g. France gave 750 million euros to libraries to digitise materials. However, government then decided it wanted financial return, so now National Library has launched an appeal for private partners
    • Seems unlikely to be viable model in near future
    • Government Research Councils/Research funders have mandates for data management plans and data curation – but perhaps not always observed by those doing the work. Perhaps if compliance was better enforced would give better outcome
  3. International funding – specifically EU funding
    • LIBER in discussion with EU bodies to have libraries considered as part of European research infrastructure- which would open new funding streams through a new framework programme
  4. Philanthropic funding
    • National Endowment for the Humanities/Library of Congress fund a National Digital Newspaper programme
    • Santander who funds digitisation – e.g. of Cervantes. Motiviation for company is good PR

Two further models that are possibilties going forward:

  1. Public funding/crowdsource model
    • e.g. Wikipedia raised ‘crowdsourced’ funding
    • Can the concept of citizen science be applied to digistation – e.g. FamilySearch has 400,000 volunteers doing scanning of genealogical records and transcription
  2. Social Economy Enterprise Models
    • Funders might fund digitistaion for ‘public good’ reasons – people employed will have more digital skills as a result; progresses employment agenda – for the funder the digitisation is not the point, it is the other outcomes
    • Such a model might draw investors from a range of sectors – KB in The Netherlands which uses such an approach for the preparation of material for digitisation

User ExperienceFirst discussed ‘what do we know about user experience’ – have to consider what users want from digitisation
Crowdsourcing – experience and expectations – experience so far seems to suggest there is lots of potential. However noted need to engage with communities via social media etc. Question of how sustainable these approaches are – need to have a plan as to how you preserve the materials being added by volunteers. Have to have clear goals – volunteers need to feel they have reach an ‘endpoint’ or achieved something concrete

Challenge to get input from ‘the crowd’ outside scholarly community

Metadata and Reuse
Group got a good view of the state of ‘open (meta)data’ across Europe and slightly beyond. Lots of open data activity across the board – although better developed in some areas. In some countries clear governmental/political agenda for ‘open’, even if library data not always top of the list

Some big plans to publish open data – e.g. 22 million records from a library network in Bavaria planned for later this year.

A specific point of interest was a ruling in France that publicly funded archives could not restrict use of the data they made available – that is they could not disallow commercial exploitation of content e.g. by genealogical companies

Also another area of legal interest, in Finland a new Data management law – emphasises interoperability and open data open metadata etc. The National library – building a ‘metadata reserve’ (would have been called a ‘union catalogue’) – bibliographic data, identifiers, authorities.

There was some interesting discussion around the future of metadata – especially the future of MARC in light of the current Library of Congress initiative to look at a new bibliographic framework – but not very clear what is going to happen here. However discussion suggested that whatever comes there will be an increased use of identifiers throughout the data – e.g. for people, places, subjects etc.

It was noted that Libraries, archives and museums have very different traditions and attitudes in terms of their metadata – which leads to different views on offering ‘open’ (meta)data. The understanding of ‘metadata’ very different across libraries, museums and archives. The point was made that ‘not all metadata equal’ – for example an Abstract may need to be treated differently when ‘opening up’ data than the author/title. A further example here was where Table of Contents information had been purchased separately to catalogue records, and so had different rights/responsibilities attached in terms of sharing with others

There was some discussion of the impact of projects which require a specific licence. For example, some concern that the Europeana exchange agreements which will require data to be licensed as CC0 will lead to some data being withdrawn from the aggregation.

Final part of discussion turned to search engines – they are looking at different metadata formats – i.e. http://schema.org. Also our attitudes to data sharing change when there is clear benefit – while some organisations enter into specific agreements with search engines – e.g. OCLC – in the main most libraries seemed happy for Google to index their data without the need for agreements or licensing. Those with experience noted the immediate increase in web traffic once their collections were indexed in Google.

Upscaling digitisation at the Wellcome Library

Wellcome library – part of Wellcome trust, a charitable foundation which funds research and includes research/contextualisation etc of medical history

Wellcome library has a lot of unique content – which is the focus of their digitisation efforts. Story so far:

Image library created from transparencies/prints – and on demand photography – 300,000 images
Journal backfiles digitisations
Wellcome Filme – 500+ titles
AIDS poster projects
Arabic manuscripts – 500 manuscripts (probably biggest single project)
17th Century recipe books

Contribute to Europeana

Digitisation part of longterm strategy for the library – but while aim is to eventually digitise everything, need target content.

Digitisation archival material, around 2000 books 1850-1990 (pilot project – and of course will test waters in copyright areas). Also contributing to Early European Books project – commercial partnership with ProQuest.

Approach to digitisation projects has changed. Previously did smaller (<10,000 pages) projects, relatively ad hoc, entirely open access, library centric, no major IT investment – but now doing large project (>100,000 pages) with involvement from wider range of stakeholders – within and outside organisation, needs major IT development. Also increasing commercial partnerships mean not all outputs will be ‘open access’ – although feel that this is about additional material that would not have been done otherwise…

Need to move

  • Manual processes -> Automated processes (where possible)
  • Centralised conservation -> distributed conservation
  • Low QA -> increased QA, error minimization
  • Using TIFF -> JPEG 2000 (now 100% JPEG 2000 after digital copy created)
  • From detailed and painstaking to streamlined and pragmatic

Streamlining:

  • Staff dedicated to specific projects or streams of work
  • Carry out sample workflow tests for new types of material
  • Right equipment for right job – eliminate the ‘fiddly bits’ – led to:
  • Live-view monitors
  • Easy-clean surfaces
  • Foot-pedals
  • Photographers do the photography
  • Prepare materials separately
  • Leave loose pages and bindings as they are – easier to digitise that way
  • Use existing staff as support
  • Minimise movement
  • Plenty of shelving and working space
  • Find preferred supplier for ad hoc support

Upscaling and streamlining digitisation requires a higher level of project management

Goobi http://www.goobi.org/:
Web-based workflow system
Open source (core system)
Use by many libraries in Germany
Wellcome use the Intranda version (Intranda a company who do develop Goobi)

Goobi is task-facuse, customisable workflows – developed specifically by Intranda
User-specific dashboard
Import/export and store metadata
Encode data as METS
Display progress of tasks, stats on activities
tracks projects, batches and unit
Can call other systems – e.g. ingest or OCR

Q: Is Goobi scalable? Can it be used for very big projects
A: Goobi works well for small institutions – don’t need programmers to implement and relatively cheap. But probably scalability going to be limited by hardware rather than anything else

Q: How does Intranda version differ to other version of Goobi
A: at least at Wellcome … e.g Goobi doesn’t handle ‘batches’ of material – Intranda added this material. Goobi uses Z39.50 to get metadata, Wellcome wanted to get metadata elsewhere, so adjusted to do that by Intranda

The well-behaved document

Presentation from John W Miescher – from Bizgraphic (Geneva). He says a ‘well behaved document’ is an electronic document that is both user friendly and library friendly – easy to read and navigate – should have bookmarks and interactive table of content. So many long electronic documents that lack basic functions – and long reports rarely designed to be read cover-to-cover.

Embedded metadata:
average information consumer interested in descriptive metadata and less in the structured and administrative metadata. They don’t care about semantics, namespaces and refinements. Dublin Core terms probably best option.

John says it isn’t that he is particularly a fan of DC – but it is there and it is convenient. However there are challenges – authors not very aware of it, not always completed, libraries use MARC21 and crosswalking to DC has limitations. But DC tags can be embedded into PDFs – but there are lack of decent tools for editing document metadata.

digi-libris a tool intended to help organize documents and collections – automatically scanning for metadata from documents, allows editing of metadata and then can re-embed metadata into the files – so anyone you pass the file onto benefits…

In summary – well-behaved documents
cater to the needs of (and empower) the information consumer
have a better chance of being found (in search engines)

[…. at this point had temporary outage when my battery died – plugged in now]

Some interesting points from the floor in the Q&A about changing the metadata in a PDF changing the checksum, and creating version/preservation problems – suggesting that integrating metadata into the document isn’t a good approach. I sympathise but tend to disagree – why not integrate into the document – the description and the thing together makes sense as we deal with more digital docs…

But… I think there are real issues around the nature of ‘documents’ – it’s a print/physcial paradigm, and not sure how far it applies as we move to more digital content. I also felt the emphasis on pdfs in the presentation was worrying – I asked about this, and speaker emphasised the work that he does covers Epubs and HTML docs as well – but HTML more difficult….

Would have liked to ask him about tools like Mendeley and Zotero etc. that extract metadata from PDFs and Mendeley that provides reading functionality as well.

Suspect the issue is tying up content with other aspects of the ‘document’ – why should ‘table of contents’ or bookmarking be something ‘baked in’ to the document? Need to think about how content separate from metadata separate from functionality etc. Got me thinking anyway 🙂

Why we need to adopt APIs for Digitised Content

Alastair Dunning from JISC (http://twitter.com/alastairdunning). Slides at http://www.slideshare.net/xcia0069/creating-a-hive-of-activity-why-we-need-to-adopt-apis-for-digitised-content

Commercial content services – Flickr, Google Books, Twitter – use APIs to all multiple services to access, manipulate and display their content.

ImageKind – example of commercial service built on Flickr API
Bulkr – enables functionality (data management) that Flickr doesn’t supply
Flickr Stackr – iPad app which uses the Flickr API
Picnik – online photo editor, which can use Flickr API – so successful that Flickr have now embedded into their site
Oskope – visual search of Flickr content – one of many examples (I’ve used http://compfight.com to search Flickr)

The Flickr API has led to the creation of innovative products and services.

But cultural and educational resources tend to lock data and interface together. Three examples Alastair showing – all use the ‘SEARCH then LIST’ approach to the interface. This is really useful, but it’s only one way of interacting with the content

The trouble with current resources is that they demand cerain ways of analysing and representing the resource – and they constitute the creators’ way of seeing the world, not the users’. Diffierent audiences may benefit from different ways of interacting with the resources.

More importantly an API can help break down the notion of a ‘collection’ and the related sils.

Alastair now going to give examples of projects that have used APIs to give access to content.

The NEWTON project – transcribing Isaac Newton’s manuscripts at University of Sussex. A completely separate project at the the University of Cambridge looking at putting scanned images of manuscripts online (I think). JISC funded some work to use APIs to bring content from these two projects together – so you can view both images and transcripts together.

CERL Thesaurus – bringing together discovery mechanisms from CERL and Europeana

Old Bailey – transcriptions of almost 200,000 trials – accessible via API. Means e.g. you can download transcriptions of groups of cases and manipulate locally – e.g. do computerised text analysis – example of using Voyeur tool to analyse all trials meeting certain keyword

Using the API, researchers could test and revise the historical narrative of the evolution of court room practise at the Old Bailey.

Another e.g. “Locating London’s Past” (coming soon) – API enables data to be brought together from multiple sources and visualised in new ways

Some short-term wins of adopting APIs – bringing dispersed resources together and interacting with them in new ways.

But long-term challenges – getting people to build, document and sustain, and explain why they are important. Some publishers suspicious; technical knowledge required to provide and exploit APIs

Ultimately APIs open the possibility of moving beyond just presenting the information and get to exploiting the information.

Public Library online

Steph Duncan from Bloomsbury (publishers). Public Library online project/initiative motivated when the Wirrall libraries were threatened with closure.

Believe that libraries are essential for a literate society. Also enlightened self-interest – libraries help create book readers, book readers are book buyers.

Libraries cater to all and provide a means for people to discovery new things – books or formats, and making books available online increases their discoverability.

Stephanie runs down some Reading Agency Library usage stats 09-10 – shows how well used UK libraries are, and how many books are both borrowed and bought – and situation reflected in other countries – Australia and US.

Publishers may be wary of getting into ‘ebook lending’ – worried that will stop people buying books. But historical stats suggest borrowing books doesn’t decrease purchasing books.

Library lending models:
Download (e.g. Overdrive) – can lend 1 copy at a time to one person – kind of a physical equivalent
Streaming model – that public library online promotes – online access to content, no download, no offline access. Concurrent user and/or site license, DRM to suppress or all copy/paste/print/share
Patron Driven Acquisition – ful catalogue available – licensor paid on usage

All modles rely on robust, secure geographical membership – revenue model doesn’t work if you don’t have some controls

OverDrive great example of how libraries help discoverability – lots of stats showing success
Public Library online – onlines access model – immediate access to a lona experience; shared reading experiences – reading groups; Community reading; aim to drive sales through local bookshops.
Model is sell ‘shelves’ (collections) of content – at £100 (or 100 euros) per 100,000 people with access to shelf.

Access controlled by IP range (in library) and membership number (for remote access).

Looking for library/publisher partnership opportunities – author events programme (and Skype based author events); use of social media etc.

Public library online – built on goal shared by publishers and libraries to support reading and literacy through public libraries

Aim to be affordable to libraries, but revenue generating. Subscription model means there is annual income. Some authors making more through this library programme than from deep list paperback.

Promoting Access to in Copyright Newspapers in Finland

This presentation from Maria Sorjonen and Majlis Bremer-Laamanen from the National Library of Finland.

Newspapers well known for being fragile, so digitisation historical material always challenging. But why digitise e-born files from paper prints?

COMELLUS project. Beleives that for longterm preservation Microfilm is still the way to go. However, for access electronic clearly what is needed. Vision of access to all newspaper content from single service

To achieve this had to do deals with publishers and software house. Looking to digitise the historical content from partner newspapers, develop software for automatic delivery of the printe files and also for automatic metadata collection. Then looking to enrich the digital material (DocWorks?)

Getting electronic originals means Logical structure and metadata preserved along with the physical pages; no need for OCR correction; no need for manual articla structure construction; no delay from deposit to online access.

Lots of benefits both to libraries and to the newspapers – but a big barrier = Copyright. Solution based on Finnish law of ‘extended collective licensing’ – can cover material even if rights holders cannot be reached – administered by an organisation. This agreement covers journals/newspapers/photographs (although not books).

However, still plenty of issues to consider. EU law in this area continues to develop. The extended collective licensing approach in Finland allows for collective negotiation and application of rights which avoids the need to engage with each rights holder individually – makes it manageable.

IMPACT: Centre of Competence in Text Digitisation

For the next two days I’m at the 3rd LIBER-EBLIDA Workshop on Digitization of Library Material in Europe. I’m here because I’m speaking later today about the JISC Guide to Open Bibliographic Data which I co-authored, but around all that there is a very interesting programme.

First up this morning is Hildelies Balk on the IMPACT project – http://www.impact-project.eu/news/coc/. This project is trying to tackle the issues related to OCR of digitised historical texts. The main achievements of IMPAC so far:

Improved commercial OCR (ABBYY ‘IMPACT’ Finereader 10 on market)
Effective tool for OCR correction with volunteer involvement (IBM CONCERT) ready for implementation
Novel approaches to preprocessing, OCR and post-correction available
Computer lexica for 9 languages close to delivery
Digitisation Framework with evaluation tools available
Facility to plug in other tools (if you have tools you can integrate)
Large dataset with sophisticated ‘ground truth’ close to final delivery
Unique network of expertise
….

Challenges in digitisation of historic material still there – there is no lak of novel approaches to improve access – both within IMPACT and many other projects
The challenge is translating from these novel approaches to real life implementation – many of the developments do not integrate into library workflows well
Where next? Direction needed for work – e.g. should we really be investing in mass re-keying of content?

To sustain IMPACT, they need to have a Business Model which would keep the centre running after the end of the current EU funding. IMPACT have done workshops throughout the project – covering all levels of staff. Used approach described is http://www.businessmodelgeneration.com.

First questions they tackled – what is the value proposition and what are the customer segments?

Major customer segment – the ‘service providers’ (presumably companies like Proquest? – not clear). IMPACT has all major content holders in the consortium – so clear value proposition – access to the content holders through single route

Another major customer segment – the content holders. Ideas proposed included mediating consultancy between content holders and others with expertise.

So these ideas discussed, and of course moved onto other parts of the business model. Often find people move to the ‘rational’ side of the model quickly – e.g. people often focus on costs before other issues sorted out.

Outcomes:

Centre of Competence – benefits for content holders:
Exchange of best practice in ocmmunity of content holders
KnowledgeBank with comprehensive and up to date information and tech watch reports
Training on demand and online tutorial
Online support thtrough a helpdesk
Support in the implementation of the innovateive IMPACT solution for imrpoving access to text
Access to the IMPACT dataset with ‘groudn truth’ and tools for evaluation
Digitisation framework – guidelines of using the open source workflow management system Tavernana
Language resources
and more!

Three levels of membership:
Open – access to forum – part of content
Basic membership (fee) – access to all facilities, reduced fee for conferences
Premium membership (fee) – member of the board, privileges such as free entry to conferences

Follow IMPAC on twitter http://twitter.com/impactocr