The well-behaved document

Presentation from John W Miescher – from Bizgraphic (Geneva). He says a ‘well behaved document’ is an electronic document that is both user friendly and library friendly – easy to read and navigate – should have bookmarks and interactive table of content. So many long electronic documents that lack basic functions – and long reports rarely designed to be read cover-to-cover.

Embedded metadata:
average information consumer interested in descriptive metadata and less in the structured and administrative metadata. They don’t care about semantics, namespaces and refinements. Dublin Core terms probably best option.

John says it isn’t that he is particularly a fan of DC – but it is there and it is convenient. However there are challenges – authors not very aware of it, not always completed, libraries use MARC21 and crosswalking to DC has limitations. But DC tags can be embedded into PDFs – but there are lack of decent tools for editing document metadata.

digi-libris a tool intended to help organize documents and collections – automatically scanning for metadata from documents, allows editing of metadata and then can re-embed metadata into the files – so anyone you pass the file onto benefits…

In summary – well-behaved documents
cater to the needs of (and empower) the information consumer
have a better chance of being found (in search engines)

[…. at this point had temporary outage when my battery died – plugged in now]

Some interesting points from the floor in the Q&A about changing the metadata in a PDF changing the checksum, and creating version/preservation problems – suggesting that integrating metadata into the document isn’t a good approach. I sympathise but tend to disagree – why not integrate into the document – the description and the thing together makes sense as we deal with more digital docs…

But… I think there are real issues around the nature of ‘documents’ – it’s a print/physcial paradigm, and not sure how far it applies as we move to more digital content. I also felt the emphasis on pdfs in the presentation was worrying – I asked about this, and speaker emphasised the work that he does covers Epubs and HTML docs as well – but HTML more difficult….

Would have liked to ask him about tools like Mendeley and Zotero etc. that extract metadata from PDFs and Mendeley that provides reading functionality as well.

Suspect the issue is tying up content with other aspects of the ‘document’ – why should ‘table of contents’ or bookmarking be something ‘baked in’ to the document? Need to think about how content separate from metadata separate from functionality etc. Got me thinking anyway 🙂

One thought on “The well-behaved document

  1. Zotero looks for catalogue data in online libraries or Google Scholar and Mendeley tries to extract metadata from the content itself with very mixed results which are, for the most part, not very useful*. Neither of them seems to look for meta data within the file itself. Too bad.

    They are ok for research work in conjunction with published papers and articles, but useless (in terms of metadata) with non-scientific or self-published documents (which make about 75% of all PDF documents you’d find on Google et al.). Not every web user is savvy enough to download a document and then look elsewhere for metadata.

    In an ideal world all electronic documents are born well-behaved. The author sets the metadata during writing (saves it perhaps as XMP file to ensure it survives publishing) and the publisher uses styles to automatically generate bookmarks and tables of contents (where applicable) then re-imports the XMP.

    Librarians and faculty should use their influence, relations and know-how with all parties involved in the generation of documents to make sure everybody plays the game.
    john m.
    *I fed it a PDF document which started out with “ORGANISATION EUROPÉENNE POUR LA RECHERCHE NUCLÉAIRE” (CERN) and below had “EuCARD-AccNet-EuroLumiWorkshop – The High-Energy Large Hadron Collider” as title.

    Mendely returned the title correctly but returned “Europ, Organisation” ,”Pour, Enne” and “Recherche, L A” as authors and had nothing to offer for Abstract, Keywords and Identity although all of this was embedded as meta data in Dublin Core format.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.