JISC Innovation Forum: Research data – legal and policy issues

From the title of this session, I feel like it’s somekind of penance – hopefully not!

http://jif08.jiscinvolve.org/theme-2-the-challenges-of-research-data/legal-and-policy-issues/

This session takes the form of a debate with the motion:
“Curating and sharing research data is best done where the researcher’s institution asserts IPR claims over the data”.

It is also being live blogged at http://jif08.jiscinvolve.org/2008/07/15/session-1-legal-and-policy-issues/  I don’t know where it is being blogged I’m afraid.

Live blog is at http://jif08.jiscinvolve.org/2008/07/15/session-1-legal-and-policy-issues/

At  the start there is 4 for, 10 against and the remainder (majority?) abstained.

Speaking for is Charles Oppenheim….

Note – Charles is taking the position for the purposes of the debate – it does not necessarily represent his views (or that of his employer…)

Starting along the lines of the comment he left at http://jif08.jiscinvolve.org/theme-2-the-challenges-of-research-data/legal-and-policy-issues/ saying that in general the employers of researchers (e.g. Universities) would legally own the copyright. However, they tend to waive these rights (implicitly rather than explicitly) and allow researchers to do things like assign copyright to publishers – by not acting when this happens, the employers can be seen as abdicating their rights.

Charles now saying that since curation (backup, refreshing etc.) involves copying – so copyright is important. You need a copyright owner to give permission for this to occur.

OA movement is threatening publishers business – seeks to persuade authors to keep copyright. Publishers looking at other areas, especially research data that supports publication. If we get into a situation where copyright over this data is assigned to the publisher we will have problems with curation.

Therefore the institution should assert it’s copyright over the research data, while granting a royalty free licence to the employee – researcher – to do whatever they want with the data.

Institution can then exercise whatever curation they wish.

Mags McGeever now speaking against the motion:

Mags saying that she believes that curation is best served when noone asserts IPR over the data.

Firstly – there maybe limited IPR available on data in the first place – since facts cannot be copyrighted. However, a database can be copyright – even if the data in it is not – as long as the database fulfills some criteria around ‘creativity’. Although the law is clearly nuanced here, generally a simple database of facts which does nothing more than record the facts, it is unlikely to attract any IPR.

There is much uncertainty around this – many people believe they can assert IPR where it doesn’t actually exist. However, there are some circumstances where data or databases have IPR – but the collaborative aspect of work in the current academic research environment leads to complexities which are difficult or impossible to unravel – ownership may not belong to a single institution, or within a single legal jurisdiction.

Unless there are clear agreements in place at the start, ownership will be unclear, and getting these agreements in place is a long and complicated process.

So – data unencumbered by IPR is easier to share, and to reuse – e.g. US public data.

Science Commons advocates putting data in the public domain.

One issue is that IPR provides an economic incentive to curate. However, this is not generally the driving force in the academic sector – attribution and credit is much more of an incentive. There are some issues here – as data is manipulated, there is a question of maintaining attribution correctly – and we probably need a technical solution here that does something to show the provenance of data.

So overall, better to have data unencumbered – remove barriers to curation.

Mags closes here.

Comment from Sam Pepler from British Atmospheric Data Centre saying how they require researchers to sign contract giving that BADC a license to do whatever they want with it.

Chris Rusbridge – If I want to use data from the BADC in a new venture does this not mean I need to negotiate with each person asserting control over the data sources? The incredible complexity of possible rights, the only way to promote interoperability is to put the data into the public domain.

I lose ability to keep up here as I dive into the debate – see the live blog for more detail (if you can find it …) – got it, it is at http://jif08.jiscinvolve.org/2008/07/15/session-1-legal-and-policy-issues/

JISC Innovation Forum 2008

For the next two days I’m at the JISC Innovation Forum at Keele University. The event brings together many JISC projects and services and hopefully will be a stimulating event. There is an official blog at http://jif08.jiscinvolve.org

The first day is starting with an introduction from Sarah Porter (Head of Innovation Group, JISC), who is starting by outlining what JISC means by ‘innovation’ – introducing new and useful things/services/ideas.

The participants for this forum come from across 100+ organisations, including colleges, universities, funding bodies, JISC services, other support organisations, government, representing library, IT and many other aspects of the community JISC is engaged with.

Sarah is describing how they hope the event will be about conversations, exchange of ideas, building links etc. That is, not just us being talked at. Is it just me, or do JISC seem to be trying to find ways to enable this type of discussions via meetings in perhaps a more explicit way than they have in the past?

Sarah now asking – why do we need ‘innovation’?

    * improve practices and quality
    * respond to changing needs of users
    * respond to new opportunities
    * respond to changing external environment

These needs chime with me particularly at the moment, as this is something I’m very much engaged in at Imperial – and how we best engender and support innovation within a library service, and what structures and practices best support this.

Sarah now describing some of the ways that JISC Services and programmes deliver JISC aims – highlighting SuperJanet and Digital Libraries (since early 1990s and 1996 respectively). Noting how some of this takes time to deliver and filter through to the community.

Highlighting the current digitization programme – 19th Century newspapers; parliamentary papers; medical journals.

E-learning programme – XCRI (software to exchange course related information); learner experience work – lead to print and video publications; e-portfolios – e.g. ‘Simple’ project

Research – MyExperiment Virtual Research Environment; Virtual Environments for Research in Archaeology – increased publication speed after dig season (average dropped from 1 year to just a few months)

Repositories and Information Environment – SWORD (technical development to enable deposit in repositories via a standard technical interface); JORUM; Start Up and Enhancement repository projects (around 45 projects funded here).

Sarah is dotting through a lot of JISC funded work, and many of the things she is highlighting are good pieces of work, but I think it needs mentioning that there is a view that the approach JISC takes to funding projects etc. is not as productive as it should be – concerns are expressed about funding a large number of projects with small amounts of funding. I’ve also had conversations with people who feel that the projects don’t really go anywhere.

I personally probably sit somewhere in the middle. I sometimes feel frustrated by the JISC approach (as I see it), but I also buy in to the ‘thousand flowers bloom’ type approach, and where we see the outputs pushing into practice then we see the value of this.

On the back of this, there is now a question from the audience about ‘evaluation’ – how do we evaluate the work, and adjust our approach to funding future work?

Sarah saying that evaluation is very important, as is assessing impact – and we need to do this at both micro, but perhaps more importantly, macro, level. This is something that JISC is looking at very carefully.

Another question (from Brian Kelly) – in Sarah’s talk she said ‘revolutionary change’ – but many in the sector are conservative – how do we reconcile this? Sarah feels it is about language and communication – don’t need to phrase it as ‘revolution’ but sell the benefits.

RIOJA: Panel Session

The last session of the day – Q and A with panel:

Q: How quickly is change coming, and how can we keep pace with it – specifically in relation to ‘metrics’ such as the REF wants to apply?

A: Metrics are by nature backwards looking – so bound to have some issues with this. Need to engage people like WoS/SCOPUS to look at what they can offer.

By its nature measuring something changes it – we have to constantly review (and change) our measures

Q: How long will libraries continue to subscribe where material is also available via OA?

A: Balancing act – everyone aware of fragility of system – need to keep peer-review, but how it is paid for is a question – while ‘publication’ is the model, need to maintain it.

Q: Libraries can’t afford ‘author pays’ model

A: Not just a library issue – institution need to understand issues and have policy towards OA and how they fund it

Not the biggest issue – the ‘elephant in the room’ is the increase we expect to see from quality research coming out of China and India – unlikely current system will be able to cope with this.

Q: How to physicists use arXiv – do they search? Browse?

A: Get email alerts of new papers each day (or view on website).

Anecdote – physicist saying never used physical copies, but saw them as an ‘archive’ copy (unlike the online version, which they read, but didn’t necessarily see as ‘permanent’)

Comment: OUP has developed number of preservation approaches – dark archive, relationships with Portico, LOCKSS etc. This comes at high cost, and has been led by demands from libraries to publishers – could even see a reversal of this where publishers make demands on libraries for preservation etc. (which is perhaps the model we have with paper!)

Some more discussion which I totally failed to capture – sorry 🙁

Overlay Journal infrastructure for Meteorological Science (OJIMS)

This presentation (the last of the day) by Sam Pepler from a data centre.

The OJIMS project is:

  • Overlay Journal Infrastructure for Meteorological Science
  • JISC and NERC funded
  • Looking specifically at a ‘data journal’ rather than traditional publication
  • Looking to evaluate business models for overlay journals
  • Creating an open access subject based repository for meteorology etc.

The University of Leeds will lead development of a dataset review policy with the Royal Meteorological Socitey (RMetS).

A ‘data journal’ is a journal that links published documents with the data that the publication uses – cf CLADDIER another JISC funded project.

What are the benefits of a data journal?

  • Extend the value of peer-review from papers to data, to provide assurance that data documentation meets the necessary scientific standards
    • Metadata standards
    • Independently understandable
    • Re-useable
    • N.B. about quality of data documentation – not about quality of data set (i.e. “can you use it”, not “is it useful”)
  • Provide an overview of the quality and applications of data, enabling it to be used more easily and appropriately in research and applications
    • adding independent quality statements about usefulness
  • Provide recognition of the work of collecting and describing data
    • High quality, reusable data is not presently a citable resources
    • The writers of papers do not necessarily acknowledge those who collected the data

Why make an overlay journal?

  • Data already in ‘a repository’ – just needs some independent review
  • Because data is bulky, compound and complex – not easy to copy (possibly not as ‘self contained’ as traditional published paper?)

MetRep is a subject based repository for meteorological sciences. This was seen as filling a gap in the market – there is no store for some of the items they want to store. Examples of MetRep items are:

  • Paper from ‘Weather’
  • Set of pictures illustrating cloud forms (e.g. teaching aid)
  • Report documenting a file format for climate models
  • Weather balloon data
  • Recording of a interview with ministers about climate change
  • IPCC reports
  • Logo for a research programme

Although some of these could sit in existing repositories – Institutional Repository, JORUM, websites, etc.

Perhaps MetRep should be an overlay repository? What does it mean to say an item is ‘in the repository’?

So – what is proposed is:

  • Establishing a ‘Overlay document’
    • Metadata about the overlay document
    • Review process information
    • Discovery metadata for the reference document
    • Reference to document (referenceable via a resolvable id in a trusted repository)
  • The ‘review process information’ consists of
    • Version of document in review cycle
      • Submitted
      • In review
      • Published
    • Public comments
    • Description of review process
    • Digital signature?
  • Metadata about the overlay document would contain
    • Author (of overlay no the referenced document)
    • Other DC (Dublin Core) fields
  • Discovery metadata for the referenced document and Reference to document
    • DC metadata harvested from document (not sure if he means from the document, or from the metadata associated with the document?)
    • Resolvable reference to document
    • Other identifiers for document

The overlay repository would have overlay documents pointing to both documents or data

The advantages he sees in this approach:

  • Clear the ‘overlay’ is a document about another document – the two items are distinct self contained
  • Authorship for the referenced and referencing document are allowed to be different – others can submit a document for review
  • The overlay document has the same meaning as a stand alone item – you can take it out of the repository context, and is still meaningful
  • Review mechanisms and repositories do not need adapting to deal with these items
  • You can review a private document/data set – answers the ‘is thing worth buying?’ question

Disadvantages:

  • Authentication issues – might be able to ‘fake’ items?
  • What if the author does not wish for document to be reviewed?

Implementation:

  • Atom XML representation (mention of OAI-ORE here)
  • Already a popular format with many tools
  • Need a tool to create the records
  • Need a web rendering method

Trusting repositories:

  • More than resolvable identifiers – need to believe the object is preserved
  • Need to know what preservation means for complex objects
  • Repositories need to have sound footing – but there are no absolute guarantees

Somewhere along the line I’ve lost the point of what we are trying to achieve with this approach – Sam is now summarising, so hopefully this will help:

  • OJIMS is about widening review processes beyond papers
  • This means storing a wider range of objects – hence MetRep
  • Data is a good e.g. of valued stuff which is not recognised in formal manner – hence ‘data journal’
  • Lots of repositories are already storing the things – hence ‘overlay repository’
  • … didn’t get the last couple of points

Overall seems to be about a way of recording the ‘review process’ alongside the actual object being reviewed.

New models of Peer Review

This session by Ken Carslaw from the School of Earth and Environment for University of Leeds. He is using Atmospheric Chemistry and Physics journal as example.

Atmospheric Chemistry and Physics (ACP) was founded in 2001, run by the European Geosciences Union (EGU). It now has the highest Impact Factor of the 4 Atmospheric Science journals listed by ISI.

Success (they believe) due to:

  • Open access
  • Collaborative peer review and commenting (the most innovative feature in the journal)
  • Speed of publication (two stage publication, with submitted papers available immediately they are submitted)
  • Flexibility (special issues, new article categories, etc.)

A lot being covered in the talk today is available in Poshchl, Learned Publishing, 17, 105-113, 2004

The process started with some basic issues/realisations:

Traditional peer review not an efficient means of quality assurance

  • Limited capacity/competence of editors and referees
    • few editors for large subject areas – limited knowledge of scientific details and specialist referees
    • work overload, conflicts of interest and lttle reqard for referees
  • Retardation and loss of information in Cloased Peer Reivew
    • The right person doing peer review on a paper can make a real contribution – but within closed peer review it may not go to that person (in fact it is unlikely)
  • Spare and late commentaries in Traditional Discussion
    • cf Faraday Discussions – you circulate paper well in advance, then discussion ‘in the round’ at a meeting where paper is presented – get dialogue – they wanted to get that in the published paper environment
    • comment/article ratio has dropped significantly in the last 30 years

Large proportion of scientific publications carelessly prepared and faulty

  • Fraud (rare)
    • selective omission, tuning and fabrication of results
  • Carelessness (frequent)
    • superficial and irreproducible description of experiments and models
    • non-traceable arguments and conclusions, duplicate and split papers

By exposing papers on the web for open peer review, researchers are more careful, as aware that the work will be ‘public’ and reflect on them.

Conflicting needs of scientific publishing: rapid publication vs. thorough review and discussion

  • Rapid publication – widely pursued
    • brief papers, rapid reviews, curtailed review and revision process
  • Thorough review and open discussion – still the exception
    • required to identify scientific flaws and duplications
    • traditionally limited by availability of referees, review time and access to information

Came up with two stage publication with collaborative peer review:

Stage 1 – rapid publication of ‘discussion paper’ (D-paper) – passed by editors, full citeable, typeset and permanently archived

The paper is typeset at this stage, paginated etc.

Followed by public peer review and interactive discussion – to anyone registered – these discussions are also fully citable, and often are, as they contain important information

Stage 2 – review complete, final publication.

Questions about ‘Discussion Papers’

  • Should D paper be paginated?
    • Yes – so it can be cited
  • Should the archive D paper be a ‘journal’?
    • Yes, so it can be cited (not grey literature)
    • Not ISI listed – by lose citations, as many people cite the ‘D’ paper, and don’t bother to cite the final version
  • Should it be reviewed or accepted ‘as is’?
    • A minimum of quality assurance/filtering
  • If the paper is eventually not accepted, should the D-paper be removed?
    • No, Impracticlal Deterrence (don’t want a non-approved paper hanging around)

Now Ken showing the workflow as a diagram… Noting that there has only ever been one instance where they withdrew a comment because it was simply an unsubstantiated attack (I suspect that specific papers could attract particular type of comments – e.g. the original paper on MMR and autism etc.)

The rules for ACP are:

  • Peer reviewers >=2 – can be anonymous or attributed
  • Public commentators – must be registered and are attributed
  • Comments are not review or solicited – should be substantial in nature (although aren’t always)
  • couple more rules I didn’t get…

To see an example go to ACPD website (use Google, I’m not online as I write this!) and navigate to “Most Commented Paper”

The advantages are:

  • All win situation for authors, referees and readers
  • Discussion paper
    • Free speech and rapid publication
  • Public peer review and interactive discussion (collaborative peer review)
    • Direct feedback and public recognition for high uality papers
    • Prevention of hidden obstruction
    • Documentation of critical comments, referee disagreement, controversial arguments, scientific flaws and complementary information
    • Deterrence of careless papers
    • Special issues more collaborative process

Some interesting stats:

  • Get about 5 submissions per month, with rejection rates of these at around 10%
  • Final papers rejections run at about 10% (making 20% in total)
  • Submission-to-publication time is 3-6 months in totla

Impact factor has increased steadily since ACP was established.

EGU (the publisher), has now established 8 interactive OA journals, as well as 2 OA journals with traditional peer-review and 1 subscription journal.

A very interesting model. The journal works on an author pays model – which is a pay on submission (unlike BMC which is pay on publication)

A question – how many of the papers are available in a repository – as far as Ken knows, they aren’t generally (although of course could be happening, and how would they know?)

Research Excellence Framework

Graeme Rosenberg (REF Pilot Manager) from HEFCE presenting on this.

I’ve sat at the back with the only other blogger (afaik) in the room as we both run low on battery, and the only power sockets are next to the projector – which means I can’t hear so well 🙁

Some of this may be a  repetition of stuff I blogged at the earlier REF event at Kings College London. Following consultation the REF is going ahead with 2 key changes – assessment for all subjects will include some metrics and some peer-review process, and timescale lengthened.

The key features of the REF are:

  • Unified framework for research assessment and funding
  • Robust research quality profiles for all disciplines
  • Emphasis on identifying and encouraging excellent research
  • Greater use of metrics than at priesent – including bibliometrics “for all disciplines where these are meaningful”
  • Reduced burden on HEIs

Timetable:

  • Up to spring 2009 – bibliometrics pilot and other development work
  • Spring/Summer 2009 – consult on all main features of the REF
  • By Sept 2009 – decide on main operational features of the framework

Use of bibliometrics:

  • Bibliometrics to be used in those disciplines in which they are meaningful – alongside other data and information
  • Interpretation by expert panels
  • To be based on citation rates per paper – not journal impact factors – and taking account of worldwide norms for the field, year of publication, and document type
  • Results to be aggregated for substantial bodies of work; presented as a citation profile

About to run a pilot – starting imminently, with 22 institutions involved. This will look at a number of issues:

  • Which disciplines?
  • Which staff and papers should be included? Universal or selective coverage? Are papers credited to the research or the institution?
  • How to collect data – and the implications for institutions (looking at Web of Science and Scopus for bibliometric data, but need institutions to at least identify the papers that ‘belong’ to them)
  • Which citation database(s)? (as mentioned looking at WoS and Scopus – they have different coverage, and continue to develop – what is best for the pilot, may not end up being the best for the REF, or may change over time – need to pick the best one at the time)
  • Refining the methods of analysis – including normalisation fields and handling self-citation
  • Thresholds for the citation profile
  • Interpretation by expert panels

Last point is key – allows flexibility in terms of what numbers are presented, as long as the expert panel know what is included and what is not (e.g. this could be a way of dealing with the self-citation issue)

It will be possible to compare the results to the 2008 RAE and investigate discrepancies and why they arise

The pilot institutions are:

  • Bangor
  • Bath
  • Birmingham
  • Bournemouth
  • Cambridge
  • Durham
  • UEA
  • Glasgow
  • Imperial
  • Institute of Cancer Research
  • Leeds
  • LSHTM
  • Sussex

The timetable is:

  • May-Jun 08 – Select HEIs/contractors
  • Aug-Oct 08 – Data collection
  • Nov 08 – early 09 – Data analysis
  • Spring 09 – Pilot results

Participating institutions will be asked to:

  • Provide as much data as available on all researchers and publications eligible for the 2008 RAE (in relevant disciplines)
  • To be matched to Web of Science initially, and supplemented by additional records
  • We will evaluate issues of completeness and accuracy and seek feedback from the institutions
  • JISC project to document the data systems requirements

Some issues for institutions:

  • As REF developed need to assess the potential impact on the sector (accountability burden, equal opportunities and perceived behavioural incentives)
  • Information management
    • Populating database for the initial bibliometrics exercise (Looking at Australia who are moving in a similar direction, and have had a requirement for collecting information on their published research outputs for some time)
    • Ongoing management of bibliographic data for multiple purposes
  • Management information and internal resource allocation
  • Relationships between citation data coverage and publication outlets – where are the gaps in WoS and Scopus?

For more information http://www.hefe.ac.uk/research/ref and/or join the REF-NEWS mailing list (details at the URL given here).

Digital Preservation Challenges: planning and implementing solutions for scientific publishing

This talk by Dr Andreas Rauber (as an aside, it is great to see some academics here, as opposed to librarians – although quite a few of them and publishers here as well) from Vienna University of Technology (in the Dept of Software Technology and Interactive systems)

Andreas starting with ‘what is digital preservation?’, then going to cover preservation planning and a tool called ‘Plato’ – a preservation planning tool.

So – why do we need digital preservation?

Basic issue of ‘keeping the bits alive’ – but this is not really digital preservation. We know a lot about this kind of work, and it can be a lot of work, but a bottom line, can be done.

However, maintaining the bits is just a small part of the problem. Digital Objects require specific environment to be accessible – files need specific programs, proggrams need specific operating systems, and operating systems need specific hardware components.

Software and Hardware environment is not stable – you encounter issues where:

  • Finels cannot be opened anymore
  • Embedded objects are not longer accessible/linked
  • Programs won’t run
  • Information in digital form is lost – usually completely failure rather than gradual degradation

Strategies for Digital Preservation (using http://unesdoc.unesco.org/images/0013/001300.130071e.pdf) for categories:

  • Short term
  • Medium term
  • etc.

Andreas going to look at two approaches:

Migration

  • Transformation into different format

Usually get some changes in transformation – if you do this several times, will have ‘damage’ to the digital object

Emulation

Emulation of h/w or s/w

Both advantage and disadvantage that object is rendered identically – you can access the object, but you may not know how to use the interface.

Looking specifically at Scientific Publishing – what are you trying to preserve?

  • The publication
  • Context of the publication
  • Adjunct material (slides, notes, videos)
  • Demos, exercises, interactive elements
  • Data sets and simulations
  • Community aspects – discussion etc.

So – Digital Preservation is complex

You need to under both the object, and its use and context.

So – ‘Preservation Planning’…

There are many different strategies – how do you know which one is most suitable – and how do you know if you’ve been successful 10/20/50 etc. years later?

As part of the DELOS DP Cluster here was a workflow developed, which has now been refined and integrated within PLANETS. It is based on the ‘utility analysis’ approach developed in Vienna.

Plato is a tool which helps with preservation planning – you need to:

  • Define requirements (requires detailed analysis of what you want and what is important – for e.g. for a web page is the appearance of the hyperlinks important, or just the target information; if there is a web counter is it preserved at a specific date, does it count hits on the archived copy, does it continue to count hits on the ‘live’ copy? etc.)
  • Evaluate alternatives (including not to draw up preservation plan if you want)
  • Consider results
  • Build preservation plan

All this looks interesting but suggests that this is going to be an incredibly expensive process (even to do the preservation planning, nevermind the actual preservation). This drives it home – we need to be good at deciding what is worth preserving in the medium/long term – and only embark on this kind of exercise where we know we want to do the preservation.

Plato is a ‘concretization’ (is that a word?) of the OAIS model, which follows recommendations of TRAC and nestor – it is a pretty generic workflow, so should be easy to integrate it into different settings.

In a case study of electronic theses, found that for these Plain text doesn’t satisfy several minimum requirements, RTF is weak in Appearance and Structure, and that the deactiviation of scripting and security are knock-out criterium (for PDF)

Andreas stressing the key role of the the ‘defining requirements’ stage – this is the point at which people start identifying what is important, and you can start to see cost vs. benefit

http://www.ifs.tuwien.ac.at/dp

http://www.ifs.tuwien.ac.at/dp/plato

Some conferences coming up on Digital Preservation including one at the British Library on 29th July.

Q: Who should take responsibility?

A: Need people from the ‘user’ side who at least know what they want, also need skills in IT, and input from Management on cost etc.

Once there are a number of examples of needs analysis of ‘type’ of material – e.g. e-theses, they can consolidate into a shareable template – however, need a number of studies first to capture wide range of requirements, rather than finding requirements from first study results in others narrowing their view down to whatever the first institution identified.

RIOJA – overview, findings and toolkit

This session starting with Dr Sarah Bridle, a physicist from UCL. Sarah saying because of arXiv in her subject area the library could happily cancel all relevant journal titles from her point of view, since they only serve to ‘badge’ the papers, and she doesn’t need the subscription for this to exist.

However, clearly some issues with the functions journal server (peer-review) and the need for a business model, or process, which enables these functions.

Number of things lead to the idea of ‘overlay journal’ – very little copy editing seems to happen in published version, and often leads to confusion (e.g. with page numbers) between arXiv version and published version. So, they started talking to the library about the idea of running an ‘overlay journal’

Now Dr Panayiota Polydoratou from UCL library relating work that they undertook in partnership with other institutions (funded by JISC) to look at the issues – i.e. the RIOJA project.

The aims of RIOJA were:

  • Build the RIOJA toolkit
    • APIs etc.
  • Sustainability
    • estimate running costs for arXiv overlay journal

[Panayiota going very very fast – can’t get all of this – I guess it will be on the RIOJA website somewhere…)

Started by surveying 4000+ researchers (got 683 responses), and interviewing editorial boards and publishers. In general the latter two categories interested in looking at new models (which isn’t so suprising I guess – the current model, like music etc. is clearly not going to work well in the internet age)

Interestingly when researchers were asked about what they thought ought to be prioritised in terms of payment, paying referees was very low on the agenda.

More work needs to be done on exploring sustainability issues, business model and potential implementations.

Not very clear exactly what RIOJA has done except the survey – but now Antony Lewis from the Institute of Astronomy, Cambridge is going to talk about the APIs and show demos so perhaps that will make it clearer…

OK – so technical objectives were:

  • Develop open API for communication between repositories and journals
  • Develop software for hosting overlaid journals using the API
  • Demonstrate journal s/w using API implemented on arXiv.org repository
  • Develop version of ePrints repository s/w to make complete open source package for any subject area

The RIOJA APIs support:

  • Paper metadata communication
  • Author authentication
  • Integrated submission
  • Publication status

I’m worried there has been no mention of SWORD or OAI-PMH yet – which would seem to cover some of these areas?

There were specific issues:

  • Paper version tracking – only a specific version of the paper on the repository should be ‘published’
  • Science specific issues e.g. handling equations
  • Simplifying workflows, dealing with ‘continuous publication’

Journal software used was ‘Open Journal Systems’. The submission was done ‘via repository ID’ – presumably meaning this is what was used as the identifier (surely would have been better to use a DOI or other independent identifier?)

Antony now showing some screenshots etc. Submitter asked to assign keywords as part of ‘submission’ process.

Conclusions:

  • Software and API infrastructure now mostly in place
    • academics who want to run journals covering costs themselves, they can do it using this
  • Make any number of journals based on any nyumber of repositories in any subject areas
  • Aim to have suite of Open Source s/w for easily setting up repositories
  • Some work still to do
    • Support for metada with equaicty latex display, referee report publciation, options for ways of handling copyediting
  • Then just need some good editors and a small amount of money

Test site: http://arxivjournal.org

Source code and information: http://arxivjournal.org/rioja

API specification: http://cosmologist.info/xml/APIs.html

A number of questions:

  • Are we not rushing ahead here where we haven’t sorted some basic problems around citation (e.g. things cited in arXiv before publication)
  • Mention of ‘Storelink’ as a project looking at some of these areas
  • I asked about overlap with SWORD and OAI-PMH and the Rioja APIs – not convinced by the answer to be honest – but my main concern is lack of overlap here.
  • Some defense of ‘publishers’ and the roles they play from the floor: legal issues, communication, marketing, specialist expertise etc. Bottom line – don’t think you can do high quality publication without publishers, and if you get rid of publishers from the system, you would have to reinvent them in some form.

Journals and Repositories: an evolving relationship

This first session is the keynote, by Stephen Pinfield.

Stephen is going to give some background and definitions, then look at three different models of interaction between journals and repositories, finally looking at issues around implementation of these models.

Stephen takes us back to the Budapest initative which identified two routes to ‘Open Access’:

  • OA Repositories (a.k.a. ‘green’)
  • OA Journals (a.k.a. ‘gold’)

Stephen is saying that these have sometimes been seen as ‘competitive’ – perhaps mutually exclusive? – but we are now seeing these as complimentary, or even overlapping, approaches.

Stephen breaking down some terminology:

  • Repository: a set of systems and services which facilitates the ingest, storage, management, retrieval, display and reuse of digital objects
  • Journal: a collection of quality-assured articles normally within a defined subject area, made available at regular intervals under a single ongoing title’ (n.b. quality-assured in academic context is usually peer-review) – I’m suprised no mention of ‘edited’ here – Stephen saying brand is key, journal titles are a ‘brand’
  • Open Access: where the full content is freely, immediately and permanently available and can be accessed and reused in an unrestricted way. Stephen stresses the ‘timeliness’ of OA as a key point

Stephen now moving onto 3 possible models for repositories and journals interacting:

  1. Repository -> Journal
  2. Journal -> Repository
  3. Repository -> Overlay Journal

1. Repository -> Journal

Stephen describing possible workflow associated with this:

  • Author writes paper, and submits to journal for publication and puts pre-print in repository (which is immediately available)
  • Paper is peer-reviewed by journal, revised, and the author submits final version to the journal, and to the repository (post-print)
  • Journal publisher edits and formats paper, and publishes

In general the ‘repository copy’ is made available before the journal copy. Stephen notes that the pre-print doesn’t have to be deposited, but in typical scenario for this model (arXiv) this is what happens.

There is an assumption that once the paper has been formally published in a journal, usage switches from the post-print (post peer-review that is) copy in the repository, to the journal copy. Study by Henneken et al (http://arxiv.org/abs/cs/0609126) shows that this is what happens (in astronomy at any rate).

All this suggests that in this model repositories and journals can happily coexist

2. Journal -> Repository

  • Author goes through publication process with OA/hybrid journal – with peer-review, revisions, editing and formatting etc. The article is published formally in the journal
  • After formal publication, the author/publisher deposits paper in repository, the repository processes the paper (e.g. restructuring, re-formatting), manages preservation, and makes the paper available

Some key points are in this model the copy in the repository is the published version (unlike above), and it includes management of preservation (which isn’t handled within model 1 at all – although it may be handled outside the model)

This is the model taken by Wellcome Trust/UKPMC particularly. Described by R Terry (2005). In this model the repository sets up the article for re-use and analysis, including ‘mining’ (i.e. machine parsing of the text to extract meaning or data)

UKPMC currently 600,000 hits a month, with 60,000 article downloads a month. Current content stands at 1.4 million full text articles, increasing at about 40,000 articles a month. Specifically this model is being driven by funder mandates.

David Prosser (SPARC) has suggested a development on this model, which is very similar, but rather than there being a ‘published’ copy and a ‘repository’ copy, the publisher publishes to the repository – so one copy, held in the repository, with the publisher linking to the paper in the repository. In this situation (as opposed to the one Stephen is about to come to), the process is still driven through a traditional ‘publisher’ workflow – it is just the final location of the article is different.

Before coming onto the final model Stephen is noting the functions of Scholarly Communication:

  • Registration (e.g. register first discovery against specific researchers)
  • Certification (quality)
  • Dissemination
  • Archiving

David Prosser notes a 5th function:

  • Reward

This is the idea that by publishing in a known journal, this is recognised in ways that reward the author (promotion, reputation etc.)

3. Repository -> Overlay Journal

The final model that Stephen is going to describe, the author interacts with both the publisher and the repository – rather than above, where really the author generally works with the publisher only.

  • Author writes paper, and submits as pre-print to the repository, which makes it available
  • an ‘Overlay journal’ selects the paper (made available by the repository), and subject it to peer-review
  • The author revises the paper on the basis of peer-review
  • Publisher edits and formats the paper (possibly?), and publisher/author deposits paper (post peer-review and post editing) in the repository which deals with management, preservation, access etc. as in model 2 described above

This model has been described by both JWT Smith (1997) and AP Smith (2000). The model involves the repository as the primary means of management and dissemination of content, where the publisher provides quality assurance etc.

Stephen now covering some of the ‘issues’ around these models:

  • Changing shape of the ‘journal’
    • ‘Deconstructed journal’
    • Journal as quality stamp
    • Journal as brand
  • Changing shape of the ‘article’
    • Single article in multiple journals (I can see this, but wonder how interested academics are in this – possibly generating multiple different versions as each journal applies different quality measures, peer-review for each journal could end up with contradictory revisions?)
    • Version identification and management becomes key in this type of scenario – integrity assurances; standards; custom and practice for citations; version of record
  • Changing shape of ‘publication’
    • Formal publication and dissemination
    • Publication process

Overlay journal very ‘new’ – not very many examples of it yet.

Stephen says that in all of these models the ‘repository’ is key. This seems a bit self-fulfilling based on Stephen’s approach – he hasn’t considered any other model here – or possibly his definition of a repository is so encompassing any system that makes the article available becomes a repository. I’d argue that although each model relies of a ‘repository’ it wouldn’t have to bear much resemblance to what we have at the moment (especially if you accept, as model 1 does, that preservation may take place outside the model)

Stephen noting that all the models (to some extent) separate dissemination from quality assurance.

Stephen throwing out some questions:

What are business and funding models – for repositories as well as for publishers, and for research funders and institutions

We need to develop models that allow/enable/encourage(?) institutions to provide funds to author for OA publication in an author-pays mode.

There are still many issues relating to ‘content management’ – technical etc.

There are policy issues – funder requirement, institutional practices, and the REF (Research Evaluation Framework) which is coming, and will contain a citation analysis component using figures from the ‘traditional publishing process’.

Stephen strongly believes the REF risks stifling innovation by
measuring in a specific way, and pushing us back to reliance on traditional models (e.g. academics will want to publish in ISI indexed journals to get figures into ISI citation measures) – I suspect he is right…

Finally Stephen stressing the cultural issues around the way scholarly communication works. However, all the challenges that are there, challenge the traditional publishing model as well as being issues when trying to develop new models.

We need to move away from ‘paper-based’ models to harnessing the power the internet offers.

Question at the end from someone for ALPSP asking about how can it be efficient/economic for each institution to run a repository. Stephen says that repositories deliver more functions to institutions than just dissemination etc. (I think this is a bit of a weak answer – the point made is a good one, and it far from clear that we need each institution to run a repository to enable any of the models Stephen has described)

RIOJA – Repository Interface for Overlaid Journal Archives

I’m at a meeting today at Gonville and Caius College, Cambridge, about the RIOJA project. This is a JISC funded project looking at a new way of approaching publishing, with the concept of an ‘overlay journal’ – which the project defines as ‘a quality-assured journal whose content is deposited to and resides in one or more open access repositories’

However, the day will also have sessions of digital preservation, the REF and other related issues.