eScience, Scholarly Communication and the Transformation of Research Libraries

This talk by Tony Hey – Corporate VP for External Research, Microsoft Research.

So, Tony is saying that we are seeing an ’emergence of a new Data-Centric paradigm for research’, and that Web 2.0 students won’t use the library in the traditional way – so there is a need to redefine the role of the research library.

We have seen (and continue to see) and explosion in the amount of data being produced in scientific research – huge amounts of data being produced by instruments, simulations, sensor networks – we are able to ‘measure’ stuff to an overwhelming degree. Tony sees management and ‘curation’ of this data as a huge challenge for the research community – he says the scale of the challenge is one of the reasons he joined MS.

The ‘Scientific Data Deluge’ – data collection, data processing, digital preservation.

An example – ‘Fighting HIV with Computer Science’:
Research from ‘Spam Blocking’ machine learning project, which then moved to use of machine learning in tools that scientists can use. The original project was aimed to analyse huge amounts of data as to whether it was spam or not – led to drawing out correlations in huge data sets on HIV.

Cyberinfrastructure – this is the real problem, the ‘calculation’ bit is easy, it is the infrastructure needed (both technical and organisational) that is the problem. Tony references the NSF report on this (http://www.nsf.gov/pubs/2007/nsf0728/index.jsp).

Tony makes the point that it isn’t just about e-Science, but e-Research – the same issue applies to arts and humanities.

Tony says research today is:

  • Data intensive
  • Compute intensive
  • Collaborative
  • Multi-disciplinary

Today – web users are using tools that could really help here, but typically Researchers are using custom standalone tools, the ‘sharing’ process is still via long publication process, physical meetings etc.

In eResearch data is easily accessible, shareable, (eg. http://cas.sdss.org/dr5/en), services expose functionality (e.g. BLAST from the NLM, http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome), services are in the cloud rather than installed locally (e.g. Amazon Web Services – S3, EC2 – this also used for home storage  solutions – JungleDisk).

Researchers can be seen as ‘extreme information workers’ – looking for subtle signs in the information available.

Publications as live documents – starting to see examples of figures in electronic publications that are based on ‘live’ data – so the reader can change aspects of a graph, plot different scales, overlay other data etc.

Just discovered that quite a few of the slides that Tony is using are available at http://research.microsoft.com/workshops/CEfS2007/presentations/TonyHey.pdf (although this is from a different talk, many of the slides seem to be the same).

Microsoft are building a Virtual Research Environment (VRE) with the British Library – looks like a web portal with stuff like RSS feeds, funding opportunity alerts, saved searchers, integration with MS tools (e.g. OneNote) for bibliography, Word and Excel 2007 – could add external tools to the ‘ribbon’ – e.g. library research tools)

Tony is going through his slides quite quickly so hard to capture. Now onto Scholarly publishing – the rules are changing – comparing to the Music Industry and music downloads – scholarly publishing industry (publishers and libraries/universities/academics) need to adjust.

Funding bodies now starting to make deposit of research results (publications, data and primary materials) mandatory as part of funding agreement (e.g. ERC)

Referencing article by Paul Ginsparg ‘As we may read‘ published in the Journal of Neuroscience, Sept 20, 2006. Ginsparg was the driving force behind ArXiV – he sees this model being adopted across all research areas. Also, sees a role for libraries and societies – perhaps reclaiming roles they fulfilled in the 19th century. Tony suggests that libraries are not necessarily fulfilling this function – I would argue that universities are not clear they want this…

If you look at ranking of universities on Google Scholar – University of Southampton is the top ranking UK University in this measure – which isn’t a ‘quality’ judge, but think about how available this information is – this means that papers from UoS get more visibility, more citations, more influence.

All the tools to support this need to be completely straightforward for the researcher – no extra effort.

The EU PLANETS Project – Digital Preservation – use of XML – specifically the Office OpenXML – now an ECMA Standard – but also open source ODF to OOXML converter – ODF is the ‘Open Document Format

Tony Hey leaves us with a challenge – once eResearch is ‘in the Cloud’  where is the Research Library?

Question: Will commercial publishers be destroyed by OA?
Answer: No – MS working with publishers. Tony thinks the ‘big’ ones will be fine – Science, Nature etc. But smaller publications may be more challenged – however Tony is keen to work with smaller publications to see how this can work – he doesn’t want them to go out of business but he believes the business model has to change.

Question: Where does payment come in?
Answer: Tony seems not particularly in favour of Author pays – sees problems with the model

Question: Who curates data in ‘mashups’
Answer: It’s a problem – if data coming from different sources, are they all conforming to the same curation standards – seems unlikely – perhaps this is where more commercial opportunity here.

Question (from me): Do researchers want to share their data – data is valuable?
Answer: Tony’s personal opinion is that they should have to share their data, but perhaps after a certain amount of time – keen to stress this is his personal view.

eScience, Scholarly Communication and the Transformation of Research Libraries

This talk by Tony Hey – Corporate VP for External Research, Microsoft Research.

So, Tony is saying that we are seeing an ’emergence of a new Data-Centric paradigm for research’, and that Web 2.0 students won’t use the library in the traditional way – so there is a need to redefine the role of the research library.

We have seen (and continue to see) and explosion in the amount of data being produced in scientific research – huge amounts of data being produced by instruments, simulations, sensor networks – we are able to ‘measure’ stuff to an overwhelming degree. Tony sees management and ‘curation’ of this data as a huge challenge for the research community – he says the scale of the challenge is one of the reasons he joined MS.

The ‘Scientific Data Deluge’ – data collection, data processing, digital preservation.

An example – ‘Fighting HIV with Computer Science’:
Research from ‘Spam Blocking’ machine learning project, which then moved to use of machine learning in tools that scientists can use. The original project was aimed to analyse huge amounts of data as to whether it was spam or not – led to drawing out correlations in huge data sets on HIV.

Cyberinfrastructure – this is the real problem, the ‘calculation’ bit is easy, it is the infrastructure needed (both technical and organisational) that is the problem. Tony references the NSF report on this (http://www.nsf.gov/pubs/2007/nsf0728/index.jsp).

Tony makes the point that it isn’t just about e-Science, but e-Research – the same issue applies to arts and humanities.

Tony says research today is:

  • Data intensive
  • Compute intensive
  • Collaborative
  • Multi-disciplinary

Today – web users are using tools that could really help here, but typically Researchers are using custom standalone tools, the ‘sharing’ process is still via long publication process, physical meetings etc.

In eResearch data is easily accessible, shareable, (eg. http://cas.sdss.org/dr5/en), services expose functionality (e.g. BLAST from the NLM, http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome), services are in the cloud rather than installed locally (e.g. Amazon Web Services – S3, EC2 – this also used for home storage  solutions – JungleDisk).

Researchers can be seen as ‘extreme information workers’ – looking for subtle signs in the information available.

Publications as live documents – starting to see examples of figures in electronic publications that are based on ‘live’ data – so the reader can change aspects of a graph, plot different scales, overlay other data etc.

Just discovered that quite a few of the slides that Tony is using are available at http://research.microsoft.com/workshops/CEfS2007/presentations/TonyHey.pdf (although this is from a different talk, many of the slides seem to be the same).

Microsoft are building a Virtual Research Environment (VRE) with the British Library – looks like a web portal with stuff like RSS feeds, funding opportunity alerts, saved searchers, integration with MS tools (e.g. OneNote) for bibliography, Word and Excel 2007 – could add external tools to the ‘ribbon’ – e.g. library research tools)

Tony is going through his slides quite quickly so hard to capture. Now onto Scholarly publishing – the rules are changing – comparing to the Music Industry and music downloads – scholarly publishing industry (publishers and libraries/universities/academics) need to adjust.

Funding bodies now starting to make deposit of research results (publications, data and primary materials) mandatory as part of funding agreement (e.g. ERC)

Referencing article by Paul Ginsparg ‘As we may read‘ published in the Journal of Neuroscience, Sept 20, 2006. Ginsparg was the driving force behind ArXiV – he sees this model being adopted across all research areas. Also, sees a role for libraries and societies – perhaps reclaiming roles they fulfilled in the 19th century. Tony suggests that libraries are not necessarily fulfilling this function – I would argue that universities are not clear they want this…

If you look at ranking of universities on Google Scholar – University of Southampton is the top ranking UK University in this measure – which isn’t a ‘quality’ judge, but think about how available this information is – this means that papers from UoS get more visibility, more citations, more influence.

All the tools to support this need to be completely straightforward for the researcher – no extra effort.

The EU PLANETS Project – Digital Preservation – use of XML – specifically the Office OpenXML – now an ECMA Standard – but also open source ODF to OOXML converter – ODF is the ‘Open Document Format

Tony Hey leaves us with a challenge – once eResearch is ‘in the Cloud’  where is the Research Library?

Question: Will commercial publishers be destroyed by OA?
Answer: No – MS working with publishers. Tony thinks the ‘big’ ones will be fine – Science, Nature etc. But smaller publications may be more challenged – however Tony is keen to work with smaller publications to see how this can work – he doesn’t want them to go out of business but he believes the business model has to change.

Question: Where does payment come in?
Answer: Tony seems not particularly in favour of Author pays – sees problems with the model

Question: Who curates data in ‘mashups’
Answer: It’s a problem – if data coming from different sources, are they all conforming to the same curation standards – seems unlikely – perhaps this is where more commercial opportunity here.

Question (from me): Do researchers want to share their data – data is valuable?
Answer: Tony’s personal opinion is that they should have to share their data, but perhaps after a certain amount of time – keen to stress this is his personal view.

Euan Semple – keynote

The opening keynote is from Euan Semple (http://www.euansemple.com/). Euan is at the BBC as head of Knowledge Management, and has had to help the BBC adapt to ‘Web 2.0’. When faced with a manager who said ‘if I gave my staff access to that kind of tool, they would just end up wasting their time’ – Euan’s reply was ‘have you thought that your recruitment policy might not be working?’.

So Euan’s opening question is what will ‘Businesslike’ look like when business isn’t like business any more?

Euan’s talking about tools he has used or seen used in the process of implementing technology in the area of KM. Firstly ‘Talk.Gateway’ – a discussion/chat board. He draws a distinction between this approach to ‘document management’ systems "where information goes to die gracefully". He suggests that by using something like a discussion board allows you to access all the collective knowledge of the organisation (in the case of the BBC accessing the collective knowledge of 23k employees). An example where someone asked about a policy and got 6 different answers, as well as a link to the official policy document. Euan’s point is that the discussion board didn’t create this sitution, but surfaced it – don’t blame the system for surfacing inconsistencies or problems.

Second tool, Connect.Gateway – a place where you can post details about yourself – expertise, interests, contact details etc. plus ability to join ‘interest groups’ to bring together people with common interests – espeically the ‘new’ stuff that wasn’t captured in the corporate structure.

Euan is pretty sceptical about structure in these systems – taxonomies etc. He says that with the discussion board, originally there were just two sections. Eventually he came under pressure to provide more structure to the boards – however, as soon as he did this, usage drops. He draws a parallel to a ‘cotswold village’ that grows up gradually over time with no particular plan, compared to organised ‘new towns’ like Milton Keynes which are ‘planned’ to be systematic, but end up being very easy to get lost in. I’m not sure this completely holds up – there are definite advantages and disadvantages to both approaches, but the point with Milton Keynes is that once you understand the layout, it becomes quite easy. Also with a systematic approach, then you can apply the same system to different places – once you understand the system for numbering/naming streets in on US city, you can apply it in others. However, each Cotswold village is different. To make this more concrete, the point is that once you understand LCSH you can apply that to each library catalogue you use, but if we all used local terminology then this would not be possible. On the otherhand this perhaps means you don’t get the advantage of localisation which leads to ease of use for regular users. I think a dual approach can work, and there is no doubt that libraries have traditionally taken a very structured approach, and haven’t yet exploited the ‘organic growth’ approach to any extent.

Euan has just covered blogs as a communication and dicussion tool, and is now mentioning wikis – these are all tools that have been used at the BBC.

Just as an aside – one of my reasons for blogging (especially conferences) is to share information with colleagues. However, I also want to engage in a discussion with a wider community online. At Imperial they have recently introduced the ‘Confluence’ system for blogging and wikis, which I think is great, and some of my teams are already using, or investigating. However, at the moment the blogs we can setup on Confluence are only available internally – which wouldn’t support me in engaging with the wider community – hence I’m blogging on my own site instead. I hope that this might change…

So, wikis – Euan making the point that they are highly auditable, and to some extent self-correcting.

The BBC now have guidelines on blogging etc – again, something I asked about at Imperial before I started blogging as an Imperial employee – but at the moment there doesn’t seem to be anything in Imperial policies or guidelines relating to this.

Euan now coming onto tagging etc. name checking David Weinberger and his book ‘Everything is Miscellaneous’ (http://www.amazon.com/Everything-Miscellaneous-Power-Digital-Disorder/dp/0805080430). Euan is covering use of del.icio.us – with use of tags and his ‘network’ of trusted people who use del.icio.us. Also use of RSS to track this – and telling a story of how he did a talk, and when he came off stage his RSS aggregator picked up a new item tagged with his name, and found that it was someone blogging the talk he had just given (wonder if he will pick up this post?)

Euan mentioning the use of the Google blog search – different type of content to what you would get in response to a normal Google search – he argues more useful.

Now mentioning last.fm – I still haven’t got into this (probably don’t listen to enough music!) – but the point is the power of the ‘network’ – harnassing the knowledge of a network of people. Suggesting that something like this for TV is on it’s way – why watch a programmed ‘channel’ when you can choose to watch something that your ‘trusted’ network is recommending.

Now mentioning ‘Plazes‘ which I haven’t come across – once you connect to a  wireless network, it works out where you are and shows it on a map – so people can see where you are, and you can see if you are near to people you want to meet etc.

Twitter – the ‘intenstity of the mundane’ – what about ‘on my way to meeting with CEO’
Facebook – making contact
Dracos.co.uk – tracks changes to BBC News homepage – allows you to see stuff that has been changed – so can’t hide stuff that you’ve said…

Some final examples – Innocentive from a pharmaceutical company where questions can be posted, and people can bid for answers – story of a member of staff at an Indian university who set the questions for students, and posted answers – one student got £75k for an answer.
A final lighthearted example of an online application – Meeting Miser – works out how much a meeting has cost the organisation based on time and salaries of those involved – the point being, don’t value physical meetings over virtual collaboration.

Coffee time!

Euan Semple – keynote

The opening keynote is from Euan Semple (http://www.euansemple.com/). Euan is at the BBC as head of Knowledge Management, and has had to help the BBC adapt to ‘Web 2.0’. When faced with a manager who said ‘if I gave my staff access to that kind of tool, they would just end up wasting their time’ – Euan’s reply was ‘have you thought that your recruitment policy might not be working?’.

So Euan’s opening question is what will ‘Businesslike’ look like when business isn’t like business any more?

Euan’s talking about tools he has used or seen used in the process of implementing technology in the area of KM. Firstly ‘Talk.Gateway’ – a discussion/chat board. He draws a distinction between this approach to ‘document management’ systems "where information goes to die gracefully". He suggests that by using something like a discussion board allows you to access all the collective knowledge of the organisation (in the case of the BBC accessing the collective knowledge of 23k employees). An example where someone asked about a policy and got 6 different answers, as well as a link to the official policy document. Euan’s point is that the discussion board didn’t create this sitution, but surfaced it – don’t blame the system for surfacing inconsistencies or problems.

Second tool, Connect.Gateway – a place where you can post details about yourself – expertise, interests, contact details etc. plus ability to join ‘interest groups’ to bring together people with common interests – espeically the ‘new’ stuff that wasn’t captured in the corporate structure.

Euan is pretty sceptical about structure in these systems – taxonomies etc. He says that with the discussion board, originally there were just two sections. Eventually he came under pressure to provide more structure to the boards – however, as soon as he did this, usage drops. He draws a parallel to a ‘cotswold village’ that grows up gradually over time with no particular plan, compared to organised ‘new towns’ like Milton Keynes which are ‘planned’ to be systematic, but end up being very easy to get lost in. I’m not sure this completely holds up – there are definite advantages and disadvantages to both approaches, but the point with Milton Keynes is that once you understand the layout, it becomes quite easy. Also with a systematic approach, then you can apply the same system to different places – once you understand the system for numbering/naming streets in on US city, you can apply it in others. However, each Cotswold village is different. To make this more concrete, the point is that once you understand LCSH you can apply that to each library catalogue you use, but if we all used local terminology then this would not be possible. On the otherhand this perhaps means you don’t get the advantage of localisation which leads to ease of use for regular users. I think a dual approach can work, and there is no doubt that libraries have traditionally taken a very structured approach, and haven’t yet exploited the ‘organic growth’ approach to any extent.

Euan has just covered blogs as a communication and dicussion tool, and is now mentioning wikis – these are all tools that have been used at the BBC.

Just as an aside – one of my reasons for blogging (especially conferences) is to share information with colleagues. However, I also want to engage in a discussion with a wider community online. At Imperial they have recently introduced the ‘Confluence’ system for blogging and wikis, which I think is great, and some of my teams are already using, or investigating. However, at the moment the blogs we can setup on Confluence are only available internally – which wouldn’t support me in engaging with the wider community – hence I’m blogging on my own site instead. I hope that this might change…

So, wikis – Euan making the point that they are highly auditable, and to some extent self-correcting.

The BBC now have guidelines on blogging etc – again, something I asked about at Imperial before I started blogging as an Imperial employee – but at the moment there doesn’t seem to be anything in Imperial policies or guidelines relating to this.

Euan now coming onto tagging etc. name checking David Weinberger and his book ‘Everything is Miscellaneous’ (http://www.amazon.com/Everything-Miscellaneous-Power-Digital-Disorder/dp/0805080430). Euan is covering use of del.icio.us – with use of tags and his ‘network’ of trusted people who use del.icio.us. Also use of RSS to track this – and telling a story of how he did a talk, and when he came off stage his RSS aggregator picked up a new item tagged with his name, and found that it was someone blogging the talk he had just given (wonder if he will pick up this post?)

Euan mentioning the use of the Google blog search – different type of content to what you would get in response to a normal Google search – he argues more useful.

Now mentioning last.fm – I still haven’t got into this (probably don’t listen to enough music!) – but the point is the power of the ‘network’ – harnassing the knowledge of a network of people. Suggesting that something like this for TV is on it’s way – why watch a programmed ‘channel’ when you can choose to watch something that your ‘trusted’ network is recommending.

Now mentioning ‘Plazes‘ which I haven’t come across – once you connect to a  wireless network, it works out where you are and shows it on a map – so people can see where you are, and you can see if you are near to people you want to meet etc.

Twitter – the ‘intenstity of the mundane’ – what about ‘on my way to meeting with CEO’
Facebook – making contact
Dracos.co.uk – tracks changes to BBC News homepage – allows you to see stuff that has been changed – so can’t hide stuff that you’ve said…

Some final examples – Innocentive from a pharmaceutical company where questions can be posted, and people can bid for answers – story of a member of staff at an Indian university who set the questions for students, and posted answers – one student got £75k for an answer.
A final lighthearted example of an online application – Meeting Miser – works out how much a meeting has cost the organisation based on time and salaries of those involved – the point being, don’t value physical meetings over virtual collaboration.

Coffee time!

Talis Insight 2007

http://www.talis.com/applications/news_and_events/talis_insight.shtml

Over the next 2 days I’m at the Talis Insight conference in Birmingham (UK). Although Imperial don’t use any Talis products (and don’t have any specific plans to either), I’m hoping that the conference is still very relevant – the programme is varied, and although, as you might expect, it covers a number of Talis products, it also picks up on a number of trends in the Library technology sector.

I’m particularly looking forward to talks by David Patten, and Marshall Breeding about ‘Next Gen’ library catalogues, as well as looking at the Talis approaches to ‘Next Gen’, systems integration, ERM, and Resource/Reading list management.

I’m also hoping to use the event to restart my sporadic blogging career – usually this is limited to conferences, but I’m hoping that I might manage something a bit more frequent from now on.

If anyone else is blogging or tracking this, I’m going to use Insight07 to tag these posts.