LiFE Model

This talk by Paul Wheatley, the Digital Preservation Manager at the British Library.

Paul starting by describing the LiFE model, and the shortcomings of the LiFE Model v1.0. Some of these were addressed in v1.1, and v2.0 of the model is due out in August 2008.

Version 1.1 of the model makes some changes – especially differentiating between bitstream preservation and content preservation, and also separating out creation/acquisition costs slightly, as they don’t always apply.

For Version 2.0, they are looking at bringing in elements for ‘Disposal’. How Metadata is handled has divided the LiFE time, and there are some changes in v2.0.

Quite a lot of detail being covered by this report, but unfortunately it isn’t terribly gripping – I would guess reading the reports out of the LiFE projects would cover all this.

At the end some questions about the model. One interesting point about rising cost of electricity.

LiFE^2 – Some Economics of Digital Preservation

The keynote by Paul Courant.

Since libraries are concerned with ‘the past’ (with an eye on the future), and the past grows in scope literally by the second, we’ve got a real challenge on our hands.

Paul starting by asking ‘What is Preservation?’ – saying that he will leave talk of digital until the end of his talk, as he believes that if we understand preservation, we generally understand digital preservation (with some caveats).

You have to have ‘something’ to preserve – information or artifacts or both – an “object”. Preservation activity affects the flow of current and future services available from the “object”. The potential usefulness of the object in the future is dependent on the preservation activity that we have undertaken.

Lifecycle cost according to LiFE said that the cost over time equated to the cost of acquisitions plus time dependent costs associated with: Ingest, Metadata, Access, Storage and Preservation.

Paul saying that the benefits are:

  • Findability (we need to be able to find it)
  • Usefulness (we need to be able to use it)
  • Reliability (we need to do both of the above reliably)

Paul says: Finding a needle in a haystack is relatively straightforward if you know it is there – much better than trying to find a needle in any haystack when you aren’t even sure if the needle is there in the first place.

Paul now quoting from an economist Robert Solow:

“The duty imposed by sustainability is to bequeath to posterity not any particular thing – with rare exceptions such as Yosemite, for example – but rather to endow them with whatever it takes to achieve a standard of living at least as good as our own and to look after the next generation similarly”

This draws an interesting distinction between the general level of preservation – that we just need a ‘body’ of resource that is sustained – and the need to preserve specific things because of their particular impact. I think this is a good concept – and that the thing that is difficult is to define the specific things that are the ‘rare exceptions’ – because most stuff isn’t important in itself, but as it represents a body of resource.

Paul now arguing that ‘markets’ in general won’t do preservation. Quote from Anand and Sen, 2000:

“sustainability cannot be left entirely to the market. The future is not adequately represented by the market – at least not the distant future”

Paul relating the problem of trying to study iPod adverts – the ‘market’ isn’t interested in preserving these.

Paul saying that the cost of adding extra ‘users’ to resources approaches zero (perhaps especially in the context of digital information). I’m not entirely convinced by this addendum, although clearly the cost is low, dealing with a million regular users is a different level of resource to dealing with 1000 regular users.

Paul arguing that there are a number of values related to Natural Resources:

  • Public Good
  • Use Value (you can do something with the resource)
  • Existence Value (knowing something is there is important in a general sense, even if you don’t use it)
  • Option Value (it is important to have the option to use a resource)

Paul now dividing two types of sustainability:

  • Specific sustainability – preserving a specific object (e.g. Magna Carta original manuscript)
  • Value sustainability – preserving the value encoded in an object (e.g. the text of the Magna Carta)

Paul now showing some points from the NSF BRP on Economically Sustainable Digital Preservation and Access:

  • Recognition of benefits of preservation by people who can move resources (Demand)
  • Incentives to people who have the stuff
  • Mechanisms to move resources to the stuff as routine or default, including handoffs
  • Efficient use (don’t save everything perfectly, make choices)
  • Organization and governance of the many relevant players (Paul saying that for this, UK is relatively well positioned, having clear national government, a national library and JISC funding national work – compared to the US)

Paul saying you can’t expect library materials to come with full costs of preservation – we would never have bought any books if we had started like this.

Now Paul saying, all the above is true about preservation in general, so what is different about digital?

  • Fragile – in a different way to paper based stuff
  • Too much staff
  • Rights Environment
  • Use doesn’t wear it out (and may even make it more usable in the future)
  • Functionality and Links (very fragile)
  • Public Goods Implications – once something is available digitally on a server, there are very low distribution costs – this changes the business model – having unique aspects to a physical collection concentrates people around the resource – not true with digital collections

Some points about Digital Scholarship:

  • Easy (sort of) cases
    • Digitized print (Google and the SDR)
    • Journals (Portico, LOCKSS, Some National Libraries)
    • Astronomical Data (because the astronomy community wants to and likes to share data, not because the data is particularly easy)
  • Harder cases
    • Multimedia projects
    • Things with links and embedded functionality (from excel spreadsheets on up)
    • Data from Chemistry experiments (chemists are the opposite of astronomers!)
  • Hardest
    • The cultural record itself
    • Business records, etc.

Paul finishing by saying that only collecting what you know you can sustainably (indefinitely) keep is a “Really Bad Idea”.

Q: Michigan one of the early adopters in regards Google digitization – what economic factors did you look at?

A: Did some calculations about holding 7 million books on servers. University committed to finding the money when the time came. University stood by this committment – and academic value was clear. They did not make an argument about savings to be made by digitization

Q: Can you comment on how preserving websites differs to what you have outlined in your talk?

A: Need a strategy to do a small sample to very high quality, and then do a very large sample at low quality, and recognise that you cannot preserve everything (and we have never done this, or strived to do it). “It is as much museum like as library like – but a lot of things are becoming more museum like, than library like”

Q: One of the things you said is different about digital is loss of local control – can you comment on the impact on the economics and business models?

A: The economics and business models change. The BL exists not just for love, but for profit – it is a differential asset for the UK. Once you look at digital, this is harder – will require high level agreements between governments, Universities etc. That the payoff for having a great local collection might no longer exist is a problem – but what if you can say you have a high level of local skill (in the library) to exploit and integrate digital and physical resources you might get local investment there – but who will pay for making the material available? Not clear.

LiFE^2

http://www.life.ac.uk

http://www.life.ac.uk/blog

Today I’m at the LiFE2 conference at the British Library. LiFE2 is a follow up to the original LiFE project, which looked at the lifecycle of digitial resources, and apply the findings to three real collections.

LiFE2 has looked at validating the economics of the LiFE Model, and we’ll be have a presentation on this, followed by several case studies.

The introduction is by Helen Shenton (Head of Collection Care at the British Library) – she is both covering the background of the LiFE and LiFE2 projects, and stressing that we live in a hybrid world – perhaps over egging this a bit for me – anyway, we know that we have a huge print legacy as well as needing to engage with the digital world – and Helen is stressing the credentials of the BL in both areas.

Helen now introducing the keynote speaker – Paul Courant, Dean of Libraries at the University of Michigan.

TILE – Enabling Contribution

This is the second area that TILE is focusing on. Mark van Harmelen presented on this, but I didn’t manage to capture it all. Essentially he considered User Activities

  • Discovering
    • searching via terms and tags (personalisation or not)
    • browsing, as a result of recommending (via human and computer choices)
  • Collecting – bookmarking
  • Consuming – read/use
  • Enhancing – adding comments, dialogues and tags
  • Creating content, repurposing, remixing
  • Publishing – explicitly making visible
  • Curating – by, possibly in quest different ways by users and library/information professionals
  • Collaborating – for learning, teaching and research

And some of the associated problems

  • Control and cultural imperatives
  • User base FE, HE, post formal ed, LL
  • Trust and data quality (are the reviews ‘worthwhile’, could you allow updates to catalogue records? etc.)
  • Data longevity
  • Task support and workflow
  • Technical implementation problems
  • Cost (particularly search engine cost)
  • Hand-off in the context of national data security

Some really interesting discussion around data and control – how we should approach this.

Definite agreement we need to open up data and ‘give up’ control, but not complete agreement on a lot of other things (should we be aiming for aggregation or distribution of data? should there be a ‘UK HE’ search engine?). Lots of debate, that I hope someone else captured better than I have here (I was too busy actually debating to type!)

TILE – Deriving Context

TILE is brainstorming ways in which context can be derived:

  • My Studies – Modules from VLE or VRE
  • My ID
  • My Activity – LMS/VLE/etc. Click streams
  • My Feedback – bookmarks, reviews, ratings
  • My Parameters – e.g. Location, status (and also the idea that you could want to ‘override’ by changing some of the parameters – to get access to 2nd year reading lists when you are in the 1st year etc.)
  • My Interests

They also identified a couple that they are considering outside the initial scope

  • My Networks – e.g. Facebook
  • My Publications – Citation indexes etc.

Difficult to capture all the discussion but some points:

  • Context varies – your ‘facebook’ context could be very different to your ‘academic’ context
  • Context can feedback not just to the individual, but be used to drive information back to tutors (for example information about what students on their course are reading etc.)

Clearly there are privacy issues, but we can accept that context in the main can be derived from data aggregation – without dealing with individuals specific activity (except perhaps someones own personal usage data?)

We looked at a diagram suggesting a SUM for the aggregation of data that could be used to drive context across an institution – but I have to admit I didn’t quite get it – possibly I want to work at a level of detail that the SUM doesn’t go to?

It occurred to me as we discussed the issues, that we talked mainly about institutions not about users in terms of providing context. I need to think about this, but shouldn’t we be thinking about data portability, and how users carry their context with them? I need to think about this more.

TILE – an introduction to the e-framework

The next presentation is on the e-Framework – a brief introduction to what it is etc.

What is in the e-Framework?

At it’s core, it is documentation:

  • Service Oriented Knowledge Base
    • It’s documentation to help others
    • Describes services (based on open standards) and how to use them
    • Describes use of multiple services together
    • Describes best practices in use of services

It’s supported by an International Community covering the UK, Australia, New Zealand and the Netherlands – with a mixture of communities in each country, although there seems to be an ‘education’ focus (worth noting that this is clearly not a library specific thing)

Within the framework services are split into:

  • Service Genre
  • Service Expression

The Service Genre describes what type of service you are talking about (e.g. ‘search’) but doesn’t say anything about how it is achieved (i.e. intended to be technology neutral) (based on ‘behaviours’)

The Service Expression is about how the service is achieved – e.g. SRU/SRW, Z39.50 etc.

Standards and Service Implementations maybe linked to from the e-framework, but aren’t part of the framework themselves.

The idea is that the ‘genre’ would tell you ‘what can be done’

Building on this, you can start to build ‘Service Usage Model’ or SUM.

  • An ‘abstract SUM’ can be created from business processes supported by Genres and Data Sources.
  • An ‘implementation SUM’ can be created from Business Process supported by Expressions and Data Sources.

The abstract SUM would describe the situation in general terms – e.g. ‘you would need a search service’, the implementation SUM would say ‘using SRU/SRW’.

SUMs are where several genres or expressions are used together.

There were some discussions about the e-framework model, and how it worked, and how useful it would be.

I have to admit that I see the reason for it in terms of development – but I’m not completely convinced that it will work in practice, because I’m sceptical about it actually being used by developers in institutions.

Richard Wallis from Talis raised the issue that this highly structured approach seemed at odds with the ‘constant beta’ and agile development.

Towards Implementation of Library 2.0 and the E-Framework

This afternoon I’m at a meeting of ‘TILE’ (‘Towards Implementation of Library 2.0 and the E-framework’). This is a JISC and SCONUL funded project, that follows on from the JISC/SCONUL sponsored report on Library Systems published earlier this year.

We are starting with an introduction from Ken Chad, one of the consultants on the project. Ken is showing some examples of how ‘web 2.0’ type technologies are coming into the library world – e.g. LibraryThing, the work by David Patten at the University of Huddersfield, California State University. Ken noting that soon after Amazon came out an article was published saying ‘should we do this in libraries’ – i.e. functions like recommendations etc., and we are only starting to see this happen 10 years later – why does it take so long?

Ken mentions MESUR project – collecting usage data (spans 100,00 serials etc.)

So, the two ‘pain points’ that TILE are focusing on are

  • ‘Deriving Context’ – in HE we have good contextual information such as the course of study for a student
  • ‘Enabling Contribution’ – the value of recommendations etc.

Following this there was some discussion on these two areas, and if there were other areas that the project ought to look at. The overall feeling was these were the correct areas to look at, but a couple of other areas were raised that the project ought to consider:

  • The ‘back office’ side of library systems and there integration (with each other – e.g. metadata; and with institutional systems – e.g. Finance)
  • The relationship between library systems and repositories, perhaps especially library workflows (e.g. acquisition)

It was also acknowledged that the project couldn’t hope to cover everything, and it was appropriate to focus on a few key areas.

The TILE project would like examples of ‘prototypes’ or ‘exemplars’ of ‘library 2.0’ type services – things like Amazon, LibraryThing etc. – especially examples around the two main points identified above. If you have examples, post a comment here and I’ll pass them on…

Open World Thinking

This final presentation session by Nadeem Shabir from Talis.

Drifted off on another line of thought (sorry Nadeem), so missed the detail at the start.

Nadeem saying that the issue of linked open data is not technological change – it’s a paradigm shift. Tim Berners-Lee said “Linked Data is the Semantic Web done right and the Web done right.”

Nadeem talking about ‘design appropriation’ – like using a book to prop up a television or monitor – not something the creator of the book intended.

Nadeem going to look at two things:

  • Openness of Description
  • Openness of Access (not Open Access)

The Openness of Description is about agreeing on shared ways to describe things – this allows you to share, integrate, relate information.

Nadeem relating to the 8 Learning Events described by Alan in the earlier session as an ontology – would be able to use a common vocabulary to describe learning activities.

Talis has been working on ontologies and has published:

  • Academic Institutions Internal Structures
  • Generic Lifecycle (workflow)
  • Resource List (using SIOC, BIBO, FOAF)

All these are available at www.vocab.org

I wonder if I can find time to describe Imperial using the first of these? I imagine this could be quite a bit of work…

Now Nadeem going onto ‘Openness of Access’ – this is

  • anywhere
  • anytime
  • anyhow

Nadeem argues that this is the key to personalised learning – descriptions and access

In the ‘web of data’ you may publish or consume data – but you don’t ‘own’ it in a traditional sense.

Nadeem has written a piece for the recent copy of ‘Nodalities’ which describes this.

Technorati Tags:

Project Zephyr: Letting Students Weave their own Path

This presentation from Ian Corns from Talis.

Lecturers want to teach differently, students are learning differently. Students are:

  • use to multimedia environment
  • always connected
  • work in groups
  • … insert your ‘millenial’ attributes here etc.

Talis has/had a ‘reading list’ application called ‘Talis List’. However, realised it wasn’t falling short in some areas:

  • Didn’t embrace richness of ‘resources’ rather than more traditional ‘book’ list
  • Didn’t offer value to lecturers – they saw it as a ‘library’ app, not offering them benefit

Decided to develop new approach which addressed value to lecturer:

  • Needed to be easier to author list in the system, than in Word or usual authoring tool
  • Offer ability to embed the list into any (web) environment.
  • Improve quality of lists – feedback based on usage (what resources used, in what assignments etc.)
  • Feedback directly from students
  • Start to see connections between lists within and between institutions

From a data point of view, by compiling a set of resources into a list for a specific module, the ‘expert’ is adding implicit information to the resources.

Now Ian showing some screenshots. Talking about the power of the system built on flexible platform – e.g. can say ‘show all key resources from all my courses for the next 2 weeks’

Hard to capture screenshots here – there is stuff at http://blogs.talis.com/list/ – there are some demos and screens etc. here.

Technorati Tags:

Project Xulu – Creating a Social Network from a Web of Scholarly Data

This session by Chris Clarke from Talis

The web as it exists at the moment is a ‘web of documents’ – human readable, but not so good for machines.

To exemplify the problem – a novice user might type into

Google “how many people were evacuated from new orleans following Hurricane Katrina”

Google finds a relevant result – the Wikipedia article on the “Effect of Hurricane Katrina on New Orleans” – but data is hidden in document, and Google fails to pick up other relevant facts buried in the article.

Chris says what we need is a ‘machine-readable’ web – and this is the promise of the ‘semantic web’.

To go back to the Wikipedia example – Dbpedia is an attempt to extract ‘facts’ from Wikipedia text and make available in machine-readable format. Powerset is a search engine built on top of this, and a search on Powerset for the same query as above. Powerset ‘understands’ the query, and serves up appropriate answers.

The problem I have with this example is that it isn’t clear to me that Google doesn’t ‘understand’ (at least to the same extent) the question. Doing semantic analysis of the question is different to using the ‘semantic’ nature of the data.

Talis provide a ‘platform’ (an RDF Triple store, plus relevant services) – and they wanted to see how this could support a ‘web of scholarly data’. They took the metadata for 500 articles in XML format and loaded it into the platform. They started to see what links etc. they could get out. Within the data set there was reference to 19808 articles (mainly in citations) and 21209 people.

This is all very interesting, but Talis wanted to see how this could actually be ‘useful’. This is what Xiphos is meant to demonstrate – a scholarly social network based on the data. Xiphos Nework was built in 4 weeks.

So Chris is showing screenshots (although assures us they do have a real system, and we can look at it after – there is a problem with the network connection at the moment). Taking a possible example of a real researcher, they enter a keyword that is actually a persons name ‘flower’.

Results are grouped by type:

  • Things
  • People

By exploiting the data available in the platform, extracted from the 500 articles, they can show connections between the person ‘T P Flower’, and show connections (via co-authoring) to others.

If the user creates an account on the system, then it can see if it already has information about them, and connect up the existing information to the newly created account. This means the user has a pre-populated social network and ‘stuff’ which they can refine.

By exploiting the information in the store, they can breakdown relationships in 4 ways:

  • People you know
  • People you cite
  • People who cite you
  • People you are watching

The middle two are automatically generated from the metadata from the articles. The last one is a ‘one way’ relationship – you don’t know who is watching you (perhaps have an overall number). Had good feedback from academics on this last function.

As well as each person having a network page, each paper also has a ‘network’ – showing citations and cited by, as well as presenting a thumbnail preview of the paper.

You can add papers to ‘collections’ (which can be public or private) – and people can ‘belong’ to collections as well ‘watch’ collections (sounds a bit like the idea of a ‘Twine’)

Alongside this they looked at adding ‘subjects’ using schemes (specifically MeSH), and events. Also had an idea around a ‘Vault’ which allows document stores – but weren’t sure if that should be part of the remit of a system like this, or whether this should be held elsewhere.

This was all from a set of data on 500 articles, and is just one approach to building applications on top of this data. They chose this because it could be used to ‘clean’ data – users would improve the linkages etc. You could build many other types of application.

Chris stressing that although they see a place for Talis in this environment, they believe it is necessary to see many players – suggesting PubMed, Blackwells, Ingenta etc. could be part of the picture.

Question: Do you have a business model in mind?

Answer: This is just a prototype – different approaches depending on your point of view (publishers different to Open Access)

Question: In an environment where you want multiple players how do you get them all working together. Are there tools that do this?

Answer: Open Data looking at some of these issues. Google and Yahoo starting to exploit this type of semantic data.

Question: When users make modifications to ‘the graph’, how do you deal with ‘versioning’

Answer: Change stored in such a way you can roll back to previous versions.

Technorati Tags: