Preserving bits

Just as I posted that last post, including some stuff on preservation of the digital, this piece from Robert Scoble dropped into my Twitter stream. I thought a quick sharing of my approach to digital preservation (such as it is) might be interesting:

Photos
When we copy these from our digital camera, they go straight onto our NAS (networked attached storage), in date labelled folders (named as YYYYMMDD) – one for each day we do a download. I then copy them into iPhoto on our MacBook Pro – which is our primary tool for organising the photos – we might delete some of the pictures we import, but I don’t go back and remove these from the NAS. In iPhoto I take advantage of the various organisation tools to split the photos into ‘events’, and have recently started adding ‘place’ and ‘face’ information (where it was taken, and who is in the photo) using the built in tools.

We then may select some of these to be published on our website. We used to do this in custom software built into our then blogging platform, but now we use Flickr.

The photos on the NAS are backed up to online storage (using JungleDisk, which layers over Amazon S3) on a weekly basis. So that is essentially two local copies, and one remote.

Pictures are taken as JPEGs and stored in that format. I haven’t got a plan for what happens when the standard image format moves away from JPEG – I guess we’ll have to wait and see what happens.

Music
Also on our NAS, and backed up online once a week. Organised by iTunes, but this time on our Mac Mini rather than the MacBook Pro. Files are a mix of AAC and MP3.

Video
Also on the NAS and backed up online once a week. Organised by iMovie on the MacBook Pro again. I think this is an area I’m going to have to revisit, as neither the MBP or the Mac Mini really have enough disk space to comfortably store large amounts of video.

Sometimes I get round to producing some actual films from the video footage, and these are either published to our website (just as video files). I think I’ve only put one on YouTube. I have to admit I’m a bit fuzzy about the format – the camera records MPEG2, but I’m not sure what iMovie does to this on import. I tend to export any finished films as MPEG4.

Documents
Simply stored on the NAS with weekly online backups. Stuff obviously gets put on the MacBook Pro at various times, but I’m pretty good at making sure anything key goes back on the NAS.

I guess that this blog is the other key ‘document’ store – and at the moment I only have a very vague backup policy for this – I do have a snapshot from a couple of months ago stored as web pages on our NAS (and therefore backed up online).

Conclusions

In some ways the video and photos are our biggest problem. However the fact we are already doing some selection should actually make preservation easier I think. We are already ‘curating’ our collections when we decide what goes online, or in a film. It would make sense to focus preservation activities on these parts of the collection – and much cheaper to do as well.

Probably the least ‘curated’ part of our collection is Documents – this contains just about everything I’ve done over the last 10 years – including huge email archives, and documents on every project I’ve been involved in since about 1998. I haven’t deleted much, and everytime I think about pruning it, I realise I don’t know where to start, and besides, compared to the video it hardly takes up any space….

The areas I feel I need to look at are:

  • File formats – are we using the most sensible file formats, check what we use for video
  • Migration strategies – how would I move file formats in the future
  • Curation strategies – should we focus only on the parts of the collection we really care about?
  • What to do about blogs?

What I really don’t believe to be the answer is (as Robert Scoble suggests, and as came up in Giles Turnbull’s Bathcamp talk) ‘print it out’.

A gathering place for UK Research

I’m the project director for EThOSNet – which is establishing a service, run by the British Library, to provide access to all UK PhD and Research Theses. The service itself is called EThOS (Electronic Theses Online Service).

Today, EThOS has gone into public beta – without fanfare, the service is now available, and can be found at http://ethos.bl.uk. The key parts of the service are:

  • A catalogue of the vast majority of UK Research Theses
  • The ability to download electronic versions where they exist
  • The ability to request an electronic version be created where it doesn’t already exist

I’m incredibly excited about this  – of all the projects I’ve been involved in, although not the biggest in terms of budget (I don’t think), it has the most potential to have an incredible impact of the availability of research. Until now, if you wanted to read a thesis you either had to request it via ILL, or take a trip to the holding university. Now you will now be able to obtain it online. To give some indication of the difference this can make, the most popular thesis from the British Library over the entire lifetime of the previous ‘Microfilm’ service was requested 58 times. The most popular electronic thesis at West Virginia University (a single US University) in the same period was downloaded over 37,000 times. If we can even achieve a relatively modest increase in downloads I’ll be happy – if we can hit tens of thousand then I’ll be delighted.

The project to setup EThOS has been jointly funded by JISC and RLUK, with contributions from the British Library, and a number of UK Universities and other partners, including my own, Imperial College London, which leads the project. The launch of the service is the culmination of several projects, including ‘Theses Alive!‘, ‘Electronic Theses‘, ‘DAEDALUS‘, ‘EThOS‘, and the current ‘EThOSNet‘.

With so much work done before and during the EThOSNet project, my own involvement (which started someway into the EThOSNet project, when I took over as Project Director from Clare Jenkins in autumn 2007), looks pretty modest, so thanks to all who have worked so hard to make EThOS possible, and get it live.

One of the biggest issues that has surfaced several times during the course of these projects, is the question of IPR (Intellectual Property Rights). EThOS is taking the bold, and necessary, step of working as an ‘opt-out’ service. This is based on a careful consideration of all the issues which has concluded:

  • The majority of authors wish to demonstrate the quality of their work.
  • Institutions wish to demonstrate the quality of their primary research

In order that authors can opt-out if they do not want their thesis to be made available via EThOS there is a robust take-down policy – available at EThOS Toolkit

As an author, you can also contact your University to let them know that you do not wish your thesis to be included in the EThOS service.

By making this opt-out and take-down approach as transparent as possible (including doing things like advertising it on this blog), we believe that authors have clear options they can exercise if they have any concerns about the service.

Finally, the derivation of the word Ethos (according to wikipedia) is quite interesting ™. There are many aspects of the word that felt relevant to the service – the idea of a ‘starting point’, and the idea that ‘ethos’ belongs to the audience both resonate with what EThOS is trying to do. However, for the title of the post I decided to draw on Michael Halloran’s assertion that "the most concrete meaning given for the term in the Greek lexicon is 'a habitual gathering place'." – which I believe is what EThOS will become to those looking for UK research dissertations.

Technorati Tags: ,

LiFE^2 – Panel Session

Final session of todays conference. Chris Rusbridge from DCC is introducing it, saying quite a lot of what we thought we knew about Digital Preservation is wrong – and implies that quite a lot of what we think we know now is also wrong.

Some discussion about how case studies might inform real costings or estimates in costings in the future? Suggestion that LiFE will look at this in the write up. Desire for a tool to assess.

Always difficult to write up these discussion sessions – not least because they are more interactive from my point of view (i.e. I take part in the discussion).

Some stuff coming up:

  • Need to have better links between value and economic costs – if we can put a figure on ‘value’ we will stand a better chance of getting funding
  • Need tools to help us make decisions regarding digital preservation
  • Why is metadata handled as separate part in the LiFE model?

In closing Paul Ayris summing up:

  • Key to sustainable preservation is demand – which is driven by ‘perception of value’, and we should not be driven by cost of preservation
  • New LiFE model was used in Case Studies described today, and there have been comments from an economist on this suggestion some ways of handling inflation and deprecation
  • If we are looking at developing a generic model, we need to look at the Danish examples, and see how it might apply in different scenarios
  • We are still in the process of learning what ‘digital preservation’ means, and what the costs truly are

Paul’s summarised the following from the panel discussion:

  • LiFE (if it can continue into a new phase) would like to develop a predictive tool to determine costs to help decision making
  • Interest in more case studies
  • Roles and Responsibilities are crucial in digital preservation, and certainly in the UK still need to debate this

Paul says he can’t understand why the UK is so far behind some of the best European examples.

LiFE^2 Case Studies – Q and A

Q: To what extent did the Newspaper case study consider the difference between the very well established workflows/processes with analogue compared to new concepts in digital

A: Definitely something that is focused on in the write-up

Q: What are the ideal and realistic timeframes in which the costings for activities in the LiFE model should be reassessed in an institution (to reassess the overall costs)

A: Neil Beagrie says it is important to revisit very regularly. Neil stressing importance of regular audits of institutional digital data. Stephen Grace suggesting this should be an annual thing to revisit costings.

Q: Where do you draw the line between the ‘creation costs’ and digital preservation costs to be costed by LiFE?

A: No clear answer – but clarification that Royal Holloway costs related to advocacy around acquisitions only included that of staff directly attached to the repository

Q: Note that all the case studies essentially took as a given that they would preserve the material in the format as delivered. Should model be used to predict costs to inform decisions about what to preserve? (Think I got this right – I missed some of the question)

A: A qualified yes basically

Q: Neil mentioned issue of logical format migration. Does anyone have a view on the cost of this?

A: Neil says there is very little in terms of long-term studies of data to give information on this. However, also notes that the more you dig the more you find examples. So far much of the costings around this are based on assumptions of how often we will need to do this, and how much it would cost. In reality there are likely to be large variations between ‘trivial’ transformations – e.g. from one version of s/w to another, and more major ones.

LiFE^2 – Research Data Costs

This session not quite a case study, but is a description of the application of the LiFE model to research data preservation, by Neil Beagrie – which was used to produce the “Keeping Research Data Safe” report recently published.

They found that a number of factors had an impact on the costings from the model:

  • Costs of ‘published research’ repositories vs ‘research data’ repositories
  • Timing – costs c.333 euros for the creation of a batch 1000 records, but 10 years after creation it may cost 10,000 euros to ‘repair’ a batch of 1000 records with badly created metadata
  • Efficiency curve effects – we should see drop in costs as we move from start-up to operational activity
  • Economy of scale effects – 600% increase in acquisitions only give 325% increase in costs

Noting that a key finding is that the cost of Acquisitions and Ingest costs are high compared to archival storage and preservation costs. This seems to be because existing data services have decided to ‘take a hit’ upfront in making sure ingest and preservation issues are dealt with at the start of the process. I think this is a key outcome from the report, but based on the discussion today I don’t know what this tells us. I guess it is a capital vs ongoing cost question. If you’d asked me at the start of the day I’d have said that the model described was a reasonable one. However, after Paul Courant’s talk I wonder if this could result in dangerous inaction – if we can’t afford preservation, we won’t start collecting. The issue is that we can spread ongoing costs over a long period of time, so does dealing with a heavy upfront cost make sense?

Neil making a number of observations, but stressing that he does not regard the study as a the final word on costs.

LiFE^2 – Research Data Costs

This session not quite a case study, but is a description of the application of the LiFE model to research data preservation, by Neil Beagrie – which was used to produce the “Keeping Research Data Safe” report recently published.

They found that a number of factors had an impact on the costings from the model:

  • Costs of ‘published research’ repositories vs ‘research data’ repositories
  • Timing – costs c.333 euros for the creation of a batch 1000 records, but 10 years after creation it may cost 10,000 euros to ‘repair’ a batch of 1000 records with badly created metadata
  • Efficiency curve effects – we should see drop in costs as we move from start-up to operational activity
  • Economy of scale effects – 600% increase in acquisitions only give 325% increase in costs

Noting that a key finding is that the cost of Acquisitions and Ingest costs are high compared to archival storage and preservation costs. This seems to be because existing data services have decided to ‘take a hit’ upfront in making sure ingest and preservation issues are dealt with at the start of the process. I think this is a key outcome from the report, but based on the discussion today I don’t know what this tells us. I guess it is a capital vs ongoing cost question. If you’d asked me at the start of the day I’d have said that the model described was a reasonable one. However, after Paul Courant’s talk I wonder if this could result in dangerous inaction – if we can’t afford preservation, we won’t start collecting. The issue is that we can spread ongoing costs over a long period of time, so does dealing with a heavy upfront cost make sense?

Neil making a number of observations, but stressing that he does not regard the study as a the final word on costs.

LiFE^2 Case Studies – SHERPA-DP

SHERPA-DP – presented by Stephen Grace (Preservation Manager, CeRch)

Within SHERPA-DP (a project to setup a shared preservation environment for the SHERPA project http://www.sherpa.ac.uk).

Stephen running through different aspects of costs. Stephen is one of several presenters to say that Metadata creation isn’t really a separate step – I’m left wondering who actually argued in favour of treating it separately?

They found some aspects hard to predict – e.g. preservation action, where they assumed major action (10 days effort) would be needed every 3 years. This may need to be refined as we learn more about digital preservation.

Costing exercises are difficult – they take time, evidence not readily to hand. However, LiFE offers a consistent methodology. They also felt that it showed the value of 3rd party preservation – ‘tho’ he admits to being biased!

They found that the storage costs had a large impact – so reducing storage costs would have a significant effect.

LiFE^2 Case Studies – SHERPA-DP

SHERPA-DP – presented by Stephen Grace (Preservation Manager, CeRch)

Within SHERPA-DP (a project to setup a shared preservation environment for the SHERPA project http://www.sherpa.ac.uk).

Stephen running through different aspects of costs. Stephen is one of several presenters to say that Metadata creation isn’t really a separate step – I’m left wondering who actually argued in favour of treating it separately?

They found some aspects hard to predict – e.g. preservation action, where they assumed major action (10 days effort) would be needed every 3 years. This may need to be refined as we learn more about digital preservation.

Costing exercises are difficult – they take time, evidence not readily to hand. However, LiFE offers a consistent methodology. They also felt that it showed the value of 3rd party preservation – ‘tho’ he admits to being biased!

They found that the storage costs had a large impact – so reducing storage costs would have a significant effect.

LiFE^2 Case Studies – SHERPA-LEAP

This being presented by Jacqueline Cook from Goldsmiths. Sherpa-LEAP was a project to setup institutional repositories to hold published research output at a number of University of London colleges. The case study covers Royal Holloway, Goldsmiths and UCL.

Because of the relative ‘youth’ of the repositories, the major costs were staffing, and the main processes were Acquisition, Ingest and Metadata Creation.

The costs were calculated based on the amount of time spent on each item. Interestingly there are some institution specific variations – Goldsmiths have high ingest costs because of the variety of material submitted. Royal Holloway have high acquisitions costs because they included the costs of holding outreach events (not cleaer that they costed in the time of the academics attending these – sounds like just the cost of the repository staff to run them)

The overall costs for each institution varied considerably:

  • Goldsmiths
    • Year 1 – 31.48
    • Year 5 – 31.95
    • Year 10 – 32.22

And UCL coming in at approximately half these figures, with Royal Holloway in the middle. Clearly these are estimates. Jacqueline is suggesting that the more complex nature of the objects accepted at Goldsmiths which had a large impact on the variation in costs across the institutions. Along side this there were also:

  • Different use cases
  • Phases in development of repositories
  • What was considered as part or outside the lifecycle
  • Method of deposit
  • Staffing levels

Overall the case study observed that:

  • We are working in a fast-changing environment
  • There are limitation of a simple, per-object average
  • Metadata Quality Assurance might be needed as an element (although noted that Metadata creation is actually part of Ingest, although the model treats it as a separate element)
  • Object-related advocacy – there may need to be an advisory role for repository administration
  • We are at an early stage for preservation planning

LiFE^2 Case Studies – SHERPA-LEAP

This being presented by Jacqueline Cook from Goldsmiths. Sherpa-LEAP was a project to setup institutional repositories to hold published research output at a number of University of London colleges. The case study covers Royal Holloway, Goldsmiths and UCL.

Because of the relative ‘youth’ of the repositories, the major costs were staffing, and the main processes were Acquisition, Ingest and Metadata Creation.

The costs were calculated based on the amount of time spent on each item. Interestingly there are some institution specific variations – Goldsmiths have high ingest costs because of the variety of material submitted. Royal Holloway have high acquisitions costs because they included the costs of holding outreach events (not cleaer that they costed in the time of the academics attending these – sounds like just the cost of the repository staff to run them)

The overall costs for each institution varied considerably:

  • Goldsmiths
    • Year 1 – 31.48
    • Year 5 – 31.95
    • Year 10 – 32.22

And UCL coming in at approximately half these figures, with Royal Holloway in the middle. Clearly these are estimates. Jacqueline is suggesting that the more complex nature of the objects accepted at Goldsmiths which had a large impact on the variation in costs across the institutions. Along side this there were also:

  • Different use cases
  • Phases in development of repositories
  • What was considered as part or outside the lifecycle
  • Method of deposit
  • Staffing levels

Overall the case study observed that:

  • We are working in a fast-changing environment
  • There are limitation of a simple, per-object average
  • Metadata Quality Assurance might be needed as an element (although noted that Metadata creation is actually part of Ingest, although the model treats it as a separate element)
  • Object-related advocacy – there may need to be an advisory role for repository administration
  • We are at an early stage for preservation planning