Entries categorized "Digital Preservation"

LiFE^2 - Panel Session

Final session of todays conference. Chris Rusbridge from DCC is introducing it, saying quite a lot of what we thought we knew about Digital Preservation is wrong - and implies that quite a lot of what we think we know now is also wrong.

Some discussion about how case studies might inform real costings or estimates in costings in the future? Suggestion that LiFE will look at this in the write up. Desire for a tool to assess.

Always difficult to write up these discussion sessions - not least because they are more interactive from my point of view (i.e. I take part in the discussion).

Some stuff coming up:

  • Need to have better links between value and economic costs - if we can put a figure on 'value' we will stand a better chance of getting funding
  • Need tools to help us make decisions regarding digital preservation
  • Why is metadata handled as separate part in the LiFE model?

In closing Paul Ayris summing up:

  • Key to sustainable preservation is demand - which is driven by 'perception of value', and we should not be driven by cost of preservation
  • New LiFE model was used in Case Studies described today, and there have been comments from an economist on this suggestion some ways of handling inflation and deprecation
  • If we are looking at developing a generic model, we need to look at the Danish examples, and see how it might apply in different scenarios
  • We are still in the process of learning what 'digital preservation' means, and what the costs truly are

Paul's summarised the following from the panel discussion:

  • LiFE (if it can continue into a new phase) would like to develop a predictive tool to determine costs to help decision making
  • Interest in more case studies
  • Roles and Responsibilities are crucial in digital preservation, and certainly in the UK still need to debate this

Paul says he can't understand why the UK is so far behind some of the best European examples.

LiFE^2 Case Studies - Q and A

Q: To what extent did the Newspaper case study consider the difference between the very well established workflows/processes with analogue compared to new concepts in digital

A: Definitely something that is focused on in the write-up

Q: What are the ideal and realistic timeframes in which the costings for activities in the LiFE model should be reassessed in an institution (to reassess the overall costs)

A: Neil Beagrie says it is important to revisit very regularly. Neil stressing importance of regular audits of institutional digital data. Stephen Grace suggesting this should be an annual thing to revisit costings.

Q: Where do you draw the line between the 'creation costs' and digital preservation costs to be costed by LiFE?

A: No clear answer - but clarification that Royal Holloway costs related to advocacy around acquisitions only included that of staff directly attached to the repository

Q: Note that all the case studies essentially took as a given that they would preserve the material in the format as delivered. Should model be used to predict costs to inform decisions about what to preserve? (Think I got this right - I missed some of the question)

A: A qualified yes basically

Q: Neil mentioned issue of logical format migration. Does anyone have a view on the cost of this?

A: Neil says there is very little in terms of long-term studies of data to give information on this. However, also notes that the more you dig the more you find examples. So far much of the costings around this are based on assumptions of how often we will need to do this, and how much it would cost. In reality there are likely to be large variations between 'trivial' transformations - e.g. from one version of s/w to another, and more major ones.

LiFE^2 - Research Data Costs

This session not quite a case study, but is a description of the application of the LiFE model to research data preservation, by Neil Beagrie - which was used to produce the "Keeping Research Data Safe" report recently published.

They found that a number of factors had an impact on the costings from the model:

  • Costs of 'published research' repositories vs 'research data' repositories
  • Timing - costs c.333 euros for the creation of a batch 1000 records, but 10 years after creation it may cost 10,000 euros to 'repair' a batch of 1000 records with badly created metadata
  • Efficiency curve effects - we should see drop in costs as we move from start-up to operational activity
  • Economy of scale effects - 600% increase in acquisitions only give 325% increase in costs

Noting that a key finding is that the cost of Acquisitions and Ingest costs are high compared to archival storage and preservation costs. This seems to be because existing data services have decided to 'take a hit' upfront in making sure ingest and preservation issues are dealt with at the start of the process. I think this is a key outcome from the report, but based on the discussion today I don't know what this tells us. I guess it is a capital vs ongoing cost question. If you'd asked me at the start of the day I'd have said that the model described was a reasonable one. However, after Paul Courant's talk I wonder if this could result in dangerous inaction - if we can't afford preservation, we won't start collecting. The issue is that we can spread ongoing costs over a long period of time, so does dealing with a heavy upfront cost make sense?

Neil making a number of observations, but stressing that he does not regard the study as a the final word on costs.

LiFE^2 Case Studies - SHERPA-DP

SHERPA-DP - presented by Stephen Grace (Preservation Manager, CeRch)

Within SHERPA-DP (a project to setup a shared preservation environment for the SHERPA project http://www.sherpa.ac.uk).

Stephen running through different aspects of costs. Stephen is one of several presenters to say that Metadata creation isn't really a separate step - I'm left wondering who actually argued in favour of treating it separately?

They found some aspects hard to predict - e.g. preservation action, where they assumed major action (10 days effort) would be needed every 3 years. This may need to be refined as we learn more about digital preservation.

Costing exercises are difficult - they take time, evidence not readily to hand. However, LiFE offers a consistent methodology. They also felt that it showed the value of 3rd party preservation - 'tho' he admits to being biased!

They found that the storage costs had a large impact - so reducing storage costs would have a significant effect.

LiFE^2 Case Studies - SHERPA-LEAP

This being presented by Jacqueline Cook from Goldsmiths. Sherpa-LEAP was a project to setup institutional repositories to hold published research output at a number of University of London colleges. The case study covers Royal Holloway, Goldsmiths and UCL.

Because of the relative 'youth' of the repositories, the major costs were staffing, and the main processes were Acquisition, Ingest and Metadata Creation.

The costs were calculated based on the amount of time spent on each item. Interestingly there are some institution specific variations - Goldsmiths have high ingest costs because of the variety of material submitted. Royal Holloway have high acquisitions costs because they included the costs of holding outreach events (not cleaer that they costed in the time of the academics attending these - sounds like just the cost of the repository staff to run them)

The overall costs for each institution varied considerably:

  • Goldsmiths
    • Year 1 - 31.48
    • Year 5 - 31.95
    • Year 10 - 32.22

And UCL coming in at approximately half these figures, with Royal Holloway in the middle. Clearly these are estimates. Jacqueline is suggesting that the more complex nature of the objects accepted at Goldsmiths which had a large impact on the variation in costs across the institutions. Along side this there were also:

  • Different use cases
  • Phases in development of repositories
  • What was considered as part or outside the lifecycle
  • Method of deposit
  • Staffing levels

Overall the case study observed that:

  • We are working in a fast-changing environment
  • There are limitation of a simple, per-object average
  • Metadata Quality Assurance might be needed as an element (although noted that Metadata creation is actually part of Ingest, although the model treats it as a separate element)
  • Object-related advocacy - there may need to be an advisory role for repository administration
  • We are at an early stage for preservation planning

LiFE^2 Case Studies - British Library Newspapers

This afternoon we are starting with three case studies. The first (presented by Richard Davies from the BL) is for material that is not 'born digital' - in this case Newspapers at the British Library.

The BL wanted to use the "Burney Collection" - 1,100 volumes of the earliest known newspapers, with about 1 million pages, digitised from the microfilm.

They originally wanted to compare the digital collection with the analogue collection. However, because of access restrictions to the printed Burney Collection, they decided it wasn't a particularly good comparison. So instead they compared a snapshot of their analogue legal deposit newspaper collection with the digital Burney Collection.

The point was not to say which was more expensive or better value, but to see if the LiFE model was workable for both types of collection.

The BL tried to cost each part of the process that they went through, including staff costs. They used linked spreadsheets to allow manipulation of the underlying cost assumptions without having to change all the formulae - so they could change (e.g.) salary costs and this would filter through.

The BL found that the overall LiFE model worked very well for the analogue collection. Even though not all terminology applied (for example the model talks about 'bitstream' - which is a digital only idea), the concepts underlying the terminology would apply to both - so perhaps the terminology needs refining to show this.

However, they found that at the element and subelement level, the detail for the digital was different to that required for analogue - which you might reasonably expect.

There were some significant differences:

There were large 'creation' costs for the digital collection, but not for the analogue collection (although it occurs to me, that actually there are large costs for the creation of the analogue collection - just not borne by the BL - does the model need to take into account commercial input?)

The overall conclusions were:

  • Comparison (of the analogue to digital) is complex but workable
  • Retrospective costing adds complications
  • Similar costs across a number of LiFE Stages
  • Analogue lifecycles are well established compared to digital

LiFE^2 - Implementation of the LiFE work

This session is describing some practical implementations of the LiFE costing model (we have more detailed case studies coming this afternoon).

The first is the from Denmark (Anders Bo Nielsen and Ulla Bogvad Kejser):

The aim was to estimate and compare lifecycle costs of preservation of digital material held by Danish culturual heritage institutions, covering the National Archives, the Royal Library and the State and University Library.

The Danish project chose LiFE model as it was already developed, and seem to have reasonable traction in the sector, and had been tested on real data sets - albeit small data sets. However, they have some improvements they would like to see to the model, including:

  • Use of OAIS terminology to ease understanding etc.
  • Breakdown in more generic function entities to avoid bias towards library material (since they are interested in other cultural heritage areas like Museums etc.)
  • Needs to cover all costs - e.g. general admin, facilities, cost of systems to manage lifecycle etc.

They also removed the 'metadata' stage, and spread the metadata elements across the other stages (this was referred to in Paul Wheatley's talk, in terms of disagreement over the best way to handle the metadata aspects of the model) - this latter approach makes more sense to me, rather than regarding 'metadata' as a specific activity, making it a function of other parts of the model. In fact, the more I think about it, the more it strikes me that regarding 'metadata' as an activity in itself is a serious problem, and suggestive of a 'cataloguing' centric view of the world - we should always see the use of metadata as a means to an end, not and end in itself.

Now Ulla Kejser now talking about a specific instance, preserving pictures from celluloid in digital format. They used the LiFE model to estimate the costs of digital preservation vs film preservation, and found that the ongoing costs for film preservation are much lower than digital preservation, however over 5 years the digital preservation turns out to be cheaper, so they have decided to use TIFF digital copies as their 'safety' copy. She also noted that they were dealing with very high resolution images, which increased the cost of digital preservation.

Finally in this session (and the last before lunch) is Paul Ayris (Director of Library Services, UCL), is speaking about the JISC-LC Blue Ribbon Task Force on the economic sustainability of digital preservation (of which both he and Paul Courant are members).

Paul is going to cover:

  • Why is digital preservation important
  • Implications - focusing on UK Exemplars
  • The work of the Blue Ribbon Task Force

UCL has a 5 year library strategy going up to 2010, with 10 over-arching goals, with e-strategy a priority in many of them (Teaching and Learning, Research, Student Experience, Partnership working)

UCL has a model of the user experience - focus on 'value' and user demand rather than on the cost of providing the service. They have a defined a 'generic' user called 'Charlie' (the phrase 'charlie says' spring irresistibly to mind).

They have a number of scenarios for Charlie (although I feel that they have missed the point of the idea of having a 'user' scenario here, as essentially they say Charlie might be a student, or a researcher or something else etc. - surely there should be different exemplars for each type?)

Anyway, this is a hook on which to hang an analysis of what users want from the library, and what other resources they use. UCL is aiming to bring together a number of different things through the 'UCL Portal' (the dreaded words 'one-stop shop' have been uttered - feel like I've stepped back in time by 5 years - does anyone believe in the one-stop shop anymore?). Oddly Paul goes on to describe how the library is only one content provider in a networked environment - this seems a recognition that the one-stop shop is not possible?

Interestingly UCL assume that STM researchers do not come to the physical library (unless they absolutely have to) - from an Imperial point of view, this is ALL (well almost) our researchers!

Anyway, in this new information landscape, long-term digital preservation of assets is essential. Paul says it is irresponsible to steer users towards these digital resources and and to not think about their longterm viability.

Paul now going to talk about two aspects of digital preservation close to my interest - 'Big Science' and 'Small Science'.

Firstly 'Big Science'. Looking at the UK Research Data Service (UKRDS) project - RLUG and RUGIT have issues an invitation to tender, with £200k from HEFCE for a feasibility study into the development of a shared digital research data service for UK HE.

There are other options to the UKRDS:

National services which work for the academic community - e.g.

  • E-Depot in The Hague is a national Dutch exemplar.
  • Commercial services such as Portico
  • Local digital curation services - based at the institution (and institutional repositories are perhaps examples of this - but so far have concentrated on published output rather than primary datasets)

What is the 'Blue Ribbon Task Force'?

Is has been setup by the NSF in the US, with funding from the Mellon Foundation, and partners include the Library of Congress and JISC.

The key questions being addressed are:

  • How will we ensure the long-term preservation and access to our digital information?
  • How will we successfully migrate data from one preservation format to another?
  • Should we preserve everything, or be selective?
  • If we are selective, what criteria do we use?

Also considering economic sustainability:

What is the cost to preserve valuable data and who will pay?

Economically sustainable digital preservation will require:

  • new models for channeling resources to preservation activities
  • efficient organization that will make these efforts affordable
  • recognition by key decision makers of the need to preserve with appropriate incentives to spur action

The Blue Ribbon Task Force is not just about HE - looking at wider environment.

The task force says that we need a recognition of the benefits of preservation - and this needs to happen at the level of key decision makers. I wonder if we have ever taken this approach to preservation before? It comes back to something that Paul Courant said - if we cost in preservation before doing anything, the startup costs will be too high. This seems to be the crux of the issue for me - which approach we take here is key.

LiFE2 - LiFE Model Economic Validation

This talk from Bo-Christer Bjork - Professor at the Swedish School of Economics and Business Administration.

Bo-Christer was asked to validate the econdomic modelling and methodology of the models developed in LiFE.

Bo-Christer is introducing the idea of life cycle costing - which is theoretically attractive, but not applied much in practice - probably because it takes a long view in terms of timescale, which many investors/owners of capital goods are not so interested in, having much shorter term horizons.

However National Libraries and Universities have longer term time horizons, so lifecycle costing method is more attractive to them.

Bo-Christer now talking about Facility management as an example where lifecycle costing can give valuable information - because buildings etc. are owned and operated for decades. A comment that he guesses the cost of the BL building was higher than estimated, but that over the lifecycle you can see this is worth it (some sounds of wry amusement from the audience at this!)

Now covering 'Total Cost of Ownership' - lifecycle costing as applied to IT hardware and software.

Bo-Christer applied IDEFO modelling to validate the LiFE model - a graphical process modelling tool originally developed for the US navy. Models processes with inputs and outputs.

Now some diagrams - unfortunately unreadable from where I'm sitting - but demonstrating the graphical model for inputs, outputs and processes associated with digital object management in libraries.

Bo-Christer was specifically asked to look at how the model should handle inflation. It is standard practice in lifecycle costings to do costings in real monetary terms, which is OK for future costs, but historic costs should be adjusted to take into account inflation. However, in the case of extremely long periods other methods should be used. In terms of LiFE, when they lookd at the Newspapers cast study (something that will be covered later today), then this was an issue.

Bo-Christer now covering the idea of 'discounting' - a technique used where costs and incomes occur in different years. For example, with a discount rate of 5% £100 cost or income in 10 years time is worht £32 today.

Although discounting applies well to large investments (e.g. building a factory), it isn't well suited if there is a steady stream of costs over years, and there is no income to compare it with, so Bo-Christer recommended that it shouldn't be used for LiFE.

Overall, I'm not sure I'm much the wiser at the end of this talk - I'm sure Bo-Christer knows what he is talking about, and I think it is great that they have been working at validating the economics.

A question from Chris Rusbridger - how does the lifecycle model apply to an open-ended 'lifecycle' - Bo-Christer acknowledges that it is an important issue, but not sure what the answer is.

A question/comment from Paul Courant suggesting not using discounting is a problem, because even very small costs become large (or even 'infinite') if you have an open ended lifecycle (i.e. if you commit to preserving something forever)

LiFE Model

This talk by Paul Wheatley, the Digital Preservation Manager at the British Library.

Paul starting by describing the LiFE model, and the shortcomings of the LiFE Model v1.0. Some of these were addressed in v1.1, and v2.0 of the model is due out in August 2008.

Version 1.1 of the model makes some changes - especially differentiating between bitstream preservation and content preservation, and also separating out creation/acquisition costs slightly, as they don't always apply.

For Version 2.0, they are looking at bringing in elements for 'Disposal'. How Metadata is handled has divided the LiFE time, and there are some changes in v2.0.

Quite a lot of detail being covered by this report, but unfortunately it isn't terribly gripping - I would guess reading the reports out of the LiFE projects would cover all this.

At the end some questions about the model. One interesting point about rising cost of electricity.

LiFE^2 - Some Economics of Digital Preservation

The keynote by Paul Courant.

Since libraries are concerned with 'the past' (with an eye on the future), and the past grows in scope literally by the second, we've got a real challenge on our hands.

Paul starting by asking 'What is Preservation?' - saying that he will leave talk of digital until the end of his talk, as he believes that if we understand preservation, we generally understand digital preservation (with some caveats).

You have to have 'something' to preserve - information or artifacts or both - an "object". Preservation activity affects the flow of current and future services available from the "object". The potential usefulness of the object in the future is dependent on the preservation activity that we have undertaken.

Lifecycle cost according to LiFE said that the cost over time equated to the cost of acquisitions plus time dependent costs associated with: Ingest, Metadata, Access, Storage and Preservation.

Paul saying that the benefits are:

  • Findability (we need to be able to find it)
  • Usefulness (we need to be able to use it)
  • Reliability (we need to do both of the above reliably)

Paul says: Finding a needle in a haystack is relatively straightforward if you know it is there - much better than trying to find a needle in any haystack when you aren't even sure if the needle is there in the first place.

Paul now quoting from an economist Robert Solow:

"The duty imposed by sustainability is to bequeath to posterity not any particular thing - with rare exceptions such as Yosemite, for example - but rather to endow them with whatever it takes to achieve a standard of living at least as good as our own and to look after the next generation similarly"

This draws an interesting distinction between the general level of preservation - that we just need a 'body' of resource that is sustained - and the need to preserve specific things because of their particular impact. I think this is a good concept - and that the thing that is difficult is to define the specific things that are the 'rare exceptions' - because most stuff isn't important in itself, but as it represents a body of resource.

Paul now arguing that 'markets' in general won't do preservation. Quote from Anand and Sen, 2000:

"sustainability cannot be left entirely to the market. The future is not adequately represented by the market - at least not the distant future"

Paul relating the problem of trying to study iPod adverts - the 'market' isn't interested in preserving these.

Paul saying that the cost of adding extra 'users' to resources approaches zero (perhaps especially in the context of digital information). I'm not entirely convinced by this addendum, although clearly the cost is low, dealing with a million regular users is a different level of resource to dealing with 1000 regular users.

Paul arguing that there are a number of values related to Natural Resources:

  • Public Good
  • Use Value (you can do something with the resource)
  • Existence Value (knowing something is there is important in a general sense, even if you don't use it)
  • Option Value (it is important to have the option to use a resource)

Paul now dividing two types of sustainability:

  • Specific sustainability - preserving a specific object (e.g. Magna Carta original manuscript)
  • Value sustainability - preserving the value encoded in an object (e.g. the text of the Magna Carta)

Paul now showing some points from the NSF BRP on Economically Sustainable Digital Preservation and Access:

  • Recognition of benefits of preservation by people who can move resources (Demand)
  • Incentives to people who have the stuff
  • Mechanisms to move resources to the stuff as routine or default, including handoffs
  • Efficient use (don't save everything perfectly, make choices)
  • Organization and governance of the many relevant players (Paul saying that for this, UK is relatively well positioned, having clear national government, a national library and JISC funding national work - compared to the US)

Paul saying you can't expect library materials to come with full costs of preservation - we would never have bought any books if we had started like this.

Now Paul saying, all the above is true about preservation in general, so what is different about digital?

  • Fragile - in a different way to paper based stuff
  • Too much staff
  • Rights Environment
  • Use doesn't wear it out (and may even make it more usable in the future)
  • Functionality and Links (very fragile)
  • Public Goods Implications - once something is available digitally on a server, there are very low distribution costs - this changes the business model - having unique aspects to a physical collection concentrates people around the resource - not true with digital collections

Some points about Digital Scholarship:

  • Easy (sort of) cases
    • Digitized print (Google and the SDR)
    • Journals (Portico, LOCKSS, Some National Libraries)
    • Astronomical Data (because the astronomy community wants to and likes to share data, not because the data is particularly easy)
  • Harder cases
    • Multimedia projects
    • Things with links and embedded functionality (from excel spreadsheets on up)
    • Data from Chemistry experiments (chemists are the opposite of astronomers!)
  • Hardest
    • The cultural record itself
    • Business records, etc.

Paul finishing by saying that only collecting what you know you can sustainably (indefinitely) keep is a "Really Bad Idea".

Q: Michigan one of the early adopters in regards Google digitization - what economic factors did you look at?

A: Did some calculations about holding 7 million books on servers. University committed to finding the money when the time came. University stood by this committment - and academic value was clear. They did not make an argument about savings to be made by digitization

Q: Can you comment on how preserving websites differs to what you have outlined in your talk?

A: Need a strategy to do a small sample to very high quality, and then do a very large sample at low quality, and recognise that you cannot preserve everything (and we have never done this, or strived to do it). "It is as much museum like as library like - but a lot of things are becoming more museum like, than library like"

Q: One of the things you said is different about digital is loss of local control - can you comment on the impact on the economics and business models?

A: The economics and business models change. The BL exists not just for love, but for profit - it is a differential asset for the UK. Once you look at digital, this is harder - will require high level agreements between governments, Universities etc. That the payoff for having a great local collection might no longer exist is a problem - but what if you can say you have a high level of local skill (in the library) to exploit and integrate digital and physical resources you might get local investment there - but who will pay for making the material available? Not clear.

December 2008

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      
Free/Busy information