Open World Thinking

This final presentation session by Nadeem Shabir from Talis.

Drifted off on another line of thought (sorry Nadeem), so missed the detail at the start.

Nadeem saying that the issue of linked open data is not technological change – it’s a paradigm shift. Tim Berners-Lee said “Linked Data is the Semantic Web done right and the Web done right.”

Nadeem talking about ‘design appropriation’ – like using a book to prop up a television or monitor – not something the creator of the book intended.

Nadeem going to look at two things:

  • Openness of Description
  • Openness of Access (not Open Access)

The Openness of Description is about agreeing on shared ways to describe things – this allows you to share, integrate, relate information.

Nadeem relating to the 8 Learning Events described by Alan in the earlier session as an ontology – would be able to use a common vocabulary to describe learning activities.

Talis has been working on ontologies and has published:

  • Academic Institutions Internal Structures
  • Generic Lifecycle (workflow)
  • Resource List (using SIOC, BIBO, FOAF)

All these are available at www.vocab.org

I wonder if I can find time to describe Imperial using the first of these? I imagine this could be quite a bit of work…

Now Nadeem going onto ‘Openness of Access’ – this is

  • anywhere
  • anytime
  • anyhow

Nadeem argues that this is the key to personalised learning – descriptions and access

In the ‘web of data’ you may publish or consume data – but you don’t ‘own’ it in a traditional sense.

Nadeem has written a piece for the recent copy of ‘Nodalities’ which describes this.

Technorati Tags:

Project Zephyr: Letting Students Weave their own Path

This presentation from Ian Corns from Talis.

Lecturers want to teach differently, students are learning differently. Students are:

  • use to multimedia environment
  • always connected
  • work in groups
  • … insert your ‘millenial’ attributes here etc.

Talis has/had a ‘reading list’ application called ‘Talis List’. However, realised it wasn’t falling short in some areas:

  • Didn’t embrace richness of ‘resources’ rather than more traditional ‘book’ list
  • Didn’t offer value to lecturers – they saw it as a ‘library’ app, not offering them benefit

Decided to develop new approach which addressed value to lecturer:

  • Needed to be easier to author list in the system, than in Word or usual authoring tool
  • Offer ability to embed the list into any (web) environment.
  • Improve quality of lists – feedback based on usage (what resources used, in what assignments etc.)
  • Feedback directly from students
  • Start to see connections between lists within and between institutions

From a data point of view, by compiling a set of resources into a list for a specific module, the ‘expert’ is adding implicit information to the resources.

Now Ian showing some screenshots. Talking about the power of the system built on flexible platform – e.g. can say ‘show all key resources from all my courses for the next 2 weeks’

Hard to capture screenshots here – there is stuff at http://blogs.talis.com/list/ – there are some demos and screens etc. here.

Technorati Tags:

Project Xulu – Creating a Social Network from a Web of Scholarly Data

This session by Chris Clarke from Talis

The web as it exists at the moment is a ‘web of documents’ – human readable, but not so good for machines.

To exemplify the problem – a novice user might type into

Google “how many people were evacuated from new orleans following Hurricane Katrina”

Google finds a relevant result – the Wikipedia article on the “Effect of Hurricane Katrina on New Orleans” – but data is hidden in document, and Google fails to pick up other relevant facts buried in the article.

Chris says what we need is a ‘machine-readable’ web – and this is the promise of the ‘semantic web’.

To go back to the Wikipedia example – Dbpedia is an attempt to extract ‘facts’ from Wikipedia text and make available in machine-readable format. Powerset is a search engine built on top of this, and a search on Powerset for the same query as above. Powerset ‘understands’ the query, and serves up appropriate answers.

The problem I have with this example is that it isn’t clear to me that Google doesn’t ‘understand’ (at least to the same extent) the question. Doing semantic analysis of the question is different to using the ‘semantic’ nature of the data.

Talis provide a ‘platform’ (an RDF Triple store, plus relevant services) – and they wanted to see how this could support a ‘web of scholarly data’. They took the metadata for 500 articles in XML format and loaded it into the platform. They started to see what links etc. they could get out. Within the data set there was reference to 19808 articles (mainly in citations) and 21209 people.

This is all very interesting, but Talis wanted to see how this could actually be ‘useful’. This is what Xiphos is meant to demonstrate – a scholarly social network based on the data. Xiphos Nework was built in 4 weeks.

So Chris is showing screenshots (although assures us they do have a real system, and we can look at it after – there is a problem with the network connection at the moment). Taking a possible example of a real researcher, they enter a keyword that is actually a persons name ‘flower’.

Results are grouped by type:

  • Things
  • People

By exploiting the data available in the platform, extracted from the 500 articles, they can show connections between the person ‘T P Flower’, and show connections (via co-authoring) to others.

If the user creates an account on the system, then it can see if it already has information about them, and connect up the existing information to the newly created account. This means the user has a pre-populated social network and ‘stuff’ which they can refine.

By exploiting the information in the store, they can breakdown relationships in 4 ways:

  • People you know
  • People you cite
  • People who cite you
  • People you are watching

The middle two are automatically generated from the metadata from the articles. The last one is a ‘one way’ relationship – you don’t know who is watching you (perhaps have an overall number). Had good feedback from academics on this last function.

As well as each person having a network page, each paper also has a ‘network’ – showing citations and cited by, as well as presenting a thumbnail preview of the paper.

You can add papers to ‘collections’ (which can be public or private) – and people can ‘belong’ to collections as well ‘watch’ collections (sounds a bit like the idea of a ‘Twine’)

Alongside this they looked at adding ‘subjects’ using schemes (specifically MeSH), and events. Also had an idea around a ‘Vault’ which allows document stores – but weren’t sure if that should be part of the remit of a system like this, or whether this should be held elsewhere.

This was all from a set of data on 500 articles, and is just one approach to building applications on top of this data. They chose this because it could be used to ‘clean’ data – users would improve the linkages etc. You could build many other types of application.

Chris stressing that although they see a place for Talis in this environment, they believe it is necessary to see many players – suggesting PubMed, Blackwells, Ingenta etc. could be part of the picture.

Question: Do you have a business model in mind?

Answer: This is just a prototype – different approaches depending on your point of view (publishers different to Open Access)

Question: In an environment where you want multiple players how do you get them all working together. Are there tools that do this?

Answer: Open Data looking at some of these issues. Google and Yahoo starting to exploit this type of semantic data.

Question: When users make modifications to ‘the graph’, how do you deal with ‘versioning’

Answer: Change stored in such a way you can roll back to previous versions.

Technorati Tags:

Formalising the information – using Hybrid Learning Model to Describe Learning Practices

This session by Dr Alan Masson, University of Ulster.

Alan is the director of the Centre for Institutional E-Learning Services to Enhance the Learning Experience – the emphasis on the learning experience – see http://cetl.ulster.ac.uk/elearning

The group looked at using a ‘rubric’ approach, but decided this came across as ‘judgemental’ – do this, do that – which teachers would not take to. Instead decided to go for ‘modelling’ approach instead,

Quote from JISC (not sure where from) along the lines – teachers don’t posses vocabulary to describe their teaching practice/pedagogy.

Learning Design Issues:

  • Benefits:
    • Provides a structure within which content and assessment can be placed
    • Formal schemas and vocabularies
  • Drawbacks
    • Looks to include resources and assessment
    • LD tools not reflective in nature
    • UI of tools not yet mature

The Hybrid Learning Model brings together “8 Learning Events Model” (8LEM), from the University of Liege, and the “Closed set of learning verbs” by Sue Bennett from the University of Wollongong.

The 8 learning events are:

  • Experiments
  • Debates
  • Creates
  • Imitates
  • Receives
  • Explores
  • Practices
  • Meta-Learns

They found academics could relate to these, and divide their interactions with students into these categories.

The learning verbs force reflection on precise nature of interactions and approaches.

They run sessions with teaching staff (45mins – 1 hour), they use these tools to get them to reflect on their practice. This gets filled into a ‘grid’ – capturing different activities and categorising them the grid contains the columns:

  • Activity/Task/Objective
  • Learning Event (from 8LEM)
  • Teacher’s Role (from verbs)
  • Learner’s Role (from verbs)
  • Learner prompts
  • Tools, Resources and Comments (depending on activity – not always used)
  • Portfolio Preparation issues (depending on activity – not always used)

The outputs from the modelling can be presented as:

  • Text based grid
  • Animated activity plan presented as a process walkthrough (for students) (swf format)
  • Mindmap (for teachers)

Alan now demonstrating the animated activity plan animation. This is done as a shockwave/flash animation – would be interesting if this could be done via a tool like the SIMILE Timeline – define an XML and/or JSON format for the outcomes of the modelling exercise?

What is happening as a result of the modelling process:

People are formalising processes that they have not articulated before. But, more importantly, these models, that both formalise and challenge the processes.

So – what has been the experience of using HLM in practice? Gathering feedback from 51 staff – generally positive, and reflected that it encouraged the focus on the learner’s perspective.

HLM provided a conversation point to discuss and share expectations.

Also, gathered feedback from >100 learners. In general helped learners to complete portfolio, and wanted to see other activities modelled in this way. Notably reduced need to contact lecturers about how to build portfolio based on activity.

Some learner comments:

  • “it makes you structure your learning and expectations”
  • “I shall check my work against this model and tick off each section as I complete it” (‘checklist’ came up in several comments)
  • “Taking all points into consideration and using the advice to achieve the best marks”

I find this a little worrying – considering the question at the end of Carsten’s talk about the students motivation to ‘just pass’ – this approach seems to encourage this approach “if I do all of these things, I will pass” – a checklist approach to learning, that doesn’t encourage broader engagement?

Alan says that the ‘high level’ conversational model that this approach provides can assist selection of right ‘tools’ to meet learning context (including accessibility adjustments)

To date HLM has provided as way of:

  • Raising awareness of teaching and learning processes and in particular the learner perspective
  • Reflecting on, evaluation and reviewing current practice
  • Planning and designing course material / learning activities
  • Providing a reference framework to assist in course administration functions e.g. course validation and peer observation
  • Assisting students to adapt to new learning situations

Key benefits are cultural not technical.

Question: Missed this – sorry

Answer: Just got the end of this – students (and their expectations) are changing (new set every year), staff is static (generally) – this provides tools to help staff cope with this.

Question: (Me) Is there a standard format for information from the model so that can automatically build services on top of data

Answer: This is the next stage of the work. Also looking at online tool for tutors – but don’t want to lose facilitation of session

Question: (Me) Does the model encourage a ‘process’ driven approach to learning? Students think because they’ve ticked off all the points they have ‘learned’

Answer: Only one year in – got usability, need to see use, and assess impact

Technorati Tags:

Web 2.0 for Learning and Research

Carsten Ullrich from Shanghai Jiao Tong University. This is a slightly adapted version of a talk he gave at the recent WWW2008 in China.

Carsten’s slides are available at http://www.slideshare.net/ullrich

Carsten is based in the e-learning lab – looking at how learning can be made easier and more interactive using technology. Recently they have been looking at Web 2.0 technologies/approaches in learning. What they found is that these approaches were transformational – you have to change the way you teach to use these approaches.

Carsten is going to cover:

  • Motivation
  • Web 2.0 from a learning perspective
  • Web 2.0 as a research tool
  • Examples
    • Social bookmarking for learning object annotation
    • Microblogging for language learning
    • Totuba Toolkit
  • Lessons Learned

Starting with an outline of ‘Learning Management Systems’ – these support teacher centered ‘administered learning’ – if a lesson is ‘mastered’ then the student is allowed to continue – very typical ‘knowledge transfer’ paradigm, where teacher imparts wisdom to student.

An alternative approach is ‘Cognitive Tutors’ – built to support cognitive learning theories. These are very expensive to build – getting and encoding the principals of the theories into software is difficult.

What about Web 2.0? Often associated with constructivism – learner centred, emphasises collaboration and in-context learning, teachers provide assistance, advice etc.

When the e-learning lab looked at this they could not find an analysis of the technological foundations of Web 2.0 from an educational perspective. So first question ‘what is Web 2.0’ – looking at comments from Tim Berners-Lee and Tim O’Reilly – they adapted the characterisation of Tim O’Reilly.

Web 2.0 stimulates individual creativity – enabling and facilitating active participation. The value of a web 2.0 service increases the more people are using it.

From an education  perspective there is a large potential peer network.

The web provides diverse data on an ‘epic scale’ – huge variety of data, via browser and APIs, often annotated, and increasingly semantic and/or linked. From education – lots of information sources, access to data from real contexts which can be integrated into learning.

Web 2.0 supports the ‘architecture of assembly’. You can get students to combine data sources, but more importantly researchers/tutors can build rapid prototypes to try out.

Carsten illustrating what an iGoogle based ‘PLE’ (Personal Learning Environment) for language learning could look like – emphasising ease of ‘drag and drop’ approach iGoogle supports.

Web 2.0 takes a ‘perpetual beta’ approach – continual improvements/refinements to software. For education this can be confusing and distracting.

Some additional principles of web 2.0:

  • Independent access to data
  • Leveraging the Long Tail
  • Lightweight models

Moving on, what can Web 2.0 do for research?

Lots of available services available (often free), which can easily be combined to deliver new functionality. Again, prototypes can be built very quickly.

Two examples:

Social Bookmarking for Learning Object Annotation

Authoring learning resources is time consuming and difficult task. Need to:

  • Support lecturers with no or little knowledge about learning resources and metadata standards
  • integration in existing workflow and LMS

So, they designed a method for lecturers to use delicious to bookmark resources, using predefined tags which represented:

  • Concepts and relationships of subject domain
  • instructional type
  • Difficulty level

Used prefixes to allow effective filtering: “sjtu:”

Within the LMS for each page about concept c, look up resources in del.icio.us and add them on the page. Very simple – almost no effort to develop prototype.

The lecturer feedback was good. However, they found that students don’t look up external resources if the textbook is good enough – so need to concentrate on courses where the textbook doesn’t give enough coverage.

Micro-blogging for Language Learning

  • Context: distant campus of SJTU – vocational learners: limited time, seldom active, shy
  • Goal: provide practice possibilities
  • Hypothesis: Micro-blogging
    • increases sense of community; reduces transactional distance to teacher (i.e. teacher is just another ‘peer’); can be done in very small amounts of time

The e-learning lab implemented a twitter-update downloader to store all twitter updates in database, plus automatic grading based on number of updates. This was based on the Twitter API – but there were limitations (could only extract the last 20 messages), so they had to do work screen scraping the web interface as well.

They got 98 students out of 110 participating; 5574 updates during 7 weeks. about 50% students sent 1-19 updates, and 50% sent 20-99 updates, but a few ‘power users’ who sent 100+

Only 5% of students felt it didn’t achieve the aims (sense of community etc.). 50% stated they communicated with native speakers – although Carsten thinks that only 5% were actually having real ‘dialogue’ in this context (based on log analysis).

The main criticism was that there wasn’t enough ‘correction’ of mistakes (interesting, because this seems to suggest students would have liked some more ‘knowledge transfer’ elements?

Lessons learned

Web 2.0 services can stimulate active participation – Twitter usage continued after lecture. But saw users drifting ‘off topic’ – e.g. posts in other languages, pejorative messages

They found that students didn’t make use of the ‘architecture of assembly’ – e.g. didn’t reuse to show on blog.

Totuba Toolkit

Totuba is a startup providing a toolkit for schools, university, researchers.

What it does is provide a ‘Research Assistant’  to enable capturing, categorising and referencing of information; provides social network tools; share information; store; export to word etc.

Carsten is showing how this might work with a Wikipedia article.

He is comparing to Zotero – but the idea of Totuba is that it is very simple – it can be used by school children, you don’t need to install a plugin etc.

Looks like this is in alpha at the moment

The goal is to facilitate the process of learning and research, removing unnecessary steps, automating manual integration work, and make it easier to find additional materials and peers.

The concern I have with this is whether it is reinventing existing services, or whether it adds some value to yet available? So, what would make me use/recommend Totuba over Google Notebook + Facebook + … I guess simplicity and packaged product is the answer, but this seems to conflict with the rest of the message from Carsten. Need to reflect on this more.

Conclusions

Web 2.0 tools and learning:

  • Less suited for designed instructions
  • have potential to stimulate active participation
  • learniers will think of anticipated ways of usage
  • requireds active teacher input/monitoring
  • you can become dependent on 3rd party applications

Web 2.0 and research

  • functionality at high level of abstraction
  • quick way to assemble prototypes
  • one becomes dependent of third party tools
  • difficult to find data about scientific publications
    • Google scholar: no API
    • Citeseer, DBLP: restricted to Computer Science
  • Open Linked Data: still for experts

Question: Do these compliment or replace traditional methods

Answer: These are transformative – so eventually supplant traditional methods

Question: How to you do assessment? (cynically
you can say that students are motivated to ‘get the qualification’ not ‘to learn’)

Answer: Still had conventional exams in these examples. Twitter usage was measured, and contributed to mark – to get participation

Question: Are you saying this approach is more inline with human nature?

Answer: Yes! That’s a good summary.

Technorati Tags:

Andy Powell – Web 2.0 and Repositories

Andy has been live blogging at efoundations but clearly won’t be doing this for his own session.

Andy is describing repositories.

What we do with repositories:

  • manage
  • deposit
  • disclose
  • make openly available
  • curate
  • preserve

What is in repositories:

  • scholarly publications
  • learning objects
  • research data

Andy is going to focus on scholarly publications today, although much of what he says will be applicable across the board.

In terms of how HE is building an infrastructure, it is mainly an ‘institutional’ focus – although not exclusively – exceptions include arXiv, RePEC, JORUM. But Andy believes that the ‘political’ agenda is around institutional approaches.

Interoperability is done via centralised aggregators – usually national, sometimes global – Intute (national), OAIster (global). Interoperability is essentially at the level of harvesting metadata (usually simple Dublin Core) using OAI-PMH – not usually harvesting content.

Content in repositories is very often PDF.

Andy just noting SWORD, recently developed as a deposit API.

So – having given the background, Andy wants to look at the issues (now 5 issues – having grown from 3 when Andy first did a similar presentation – the issues grow as he looks at it more – oh dear):

#1 We talk about ‘repositories’

There is a real issue with terminology. The term ‘repository’ is pretty woolly. Whereas a focus on ‘making content available on the Web’ would be more intuitive to researchers.

I agree this is an issue – although from what Peter Murray-Rust’s talk we might be better in saying ‘backup systems’ rather than either of these 🙂

Andy noting we don’t talk about ‘Content Management Systems’ – which may be a good thing, but we need to acknowledge that in general terms ‘repositories’ are ‘content management systems’. If we started thinking about this, then we might start talking about ‘surfacing content’ on the web, rather than focussing on specific protocols (i.e. OAI-PMH)

#2 We don’t emphasise

  • Google indexing – where is the discussion about ‘Search Engine Optimisation’
  • RSS Feeds
  • ‘Widgets’

#3 Our focus is on sharing metadata

Even though we have full-text to share – and what we do share is PDF rather than a ‘native web’ format. Also the metadata we do share tends to be simple Dublin Core – inconsistently applied. Andy arguing that simple DC is too simple to build compelling discovery services, but too complicated for the user – they are put off adding metadata

#4 We ignore the Web Architecture

We have tended to adopt service oriented approaches – in line with tradition of digital library approaches.

Focus is building services that give access to data, rather than a ‘resource oriented’ approach which is being adopted in the more general web world. We don’t tend to adopt REST (and architectural style with a focus on resources with simple set of operations)

#5 We are antisocial

‘we’ (presumably the HE environment?) tend to treat content in isolation for the social networks that need to grow around that content

Successful repositories in a more generic sense (Flickr, YouTube, Slideshare, etc.) tend to promote the social activity that takes place around content as well as the content management and disclosure activity.

One thing that occurs to me here is a question about what services Flickr etc. are providing – what are their terms of service (clearly I need to go read these!) – but they may not have the same values that HE Institutions, and their repository services have.

Andy addressing my last point there which is that the institutional approach has fundamental mismatch with the real-life social networks adopted by researchers – which tend to be subject-based, cross-institutional and global. So while institutional approaches have some strengths – preservation etc. – they don’t get any ‘network’ effect.

We are ending up with ’empty’ repositories, having to ‘mandate’ deposit to get content, rather than making a compelling offering that researchers want to use.

So, Andy is suggesting we need to look at moving back to subject based, global repositories that concentrate content so that we can take advantage of the ‘network’ effect etc. This is where we started with arXiv.

There is a question of why in other examples – e.g. blogs – we can work with a distributed network of content, and services that aggregate this content (e.g. Technorati). Perhaps reasons this hasn’t worked with repositories is that blogs are under individual control, and the ‘glue’ (RSS) is lightweight and easy to apply. This doesn’t seem to be the case with repositories – although Andy admits he isn’t altogether sure why repositories don’t seem to have been successful in this way.

Andy noting that having this kind of challenge to repository ‘received wisdom’ is very difficult – it is very political, and those involved are reluctant to engage in discussion (possibly as it distracts from the Open Access agenda).

So – where do we go from here?

Andy suggests that we need to look at examples like Slideshare (a service that shares presentations). This might be what a ‘Web 2.0’ repository looks like:

  • a high quality browser-based document viewer (not a ‘helper’ application like Acrobat)
  • tagging, commentary, more-like-this, favourites
  • persistent (cool) URIs to content
  • ability to form simpler social groups
  • ability to embed documents in other web sites
  • high visibility to Google
  • use of ‘the cloud’ (Amazon S3?) to provide scalability

Andy suggesting we need to ‘go simple’. Develop simple(ish) repositories with complex services (search/aggregation) overlaying them.

Alternatively, we could go more ‘complex’ – the ‘Semantic Web’ approach – creating richer metadata about scholarly publications than we currently do, we explicitly adopt a complex data model etc.

Examples of this ‘complex’ approahc is SWAP (Scholarly Works Application Profile) which capture relationahsips between works, expressions, manifestations, items and agents. Also ORE (OAR Object Re-use and Exchange) – captures relationships between objects.

Andy throwing up diagrams of SWAP and ORE – these work well together, but much more complex approach.

Andy’s main points:

  • Look to Web 2.0
  • Need global concentration to get network effect
  • Simple DC too simple and too complex
  • SWAP and ORE may point to new approaches, but comes with extreme complexity

In conclusion:

Flickr was a response to digital photography – it wasn’t an attempt to create an ‘online photo album’

We need an approach to digital research that is not an attempt to recreate paper based scholarly communication – we need to re-think (‘re-envision’ in Andy’s words) scholarly communication in the digital age.

Question: Someone in the audience saying “when you said repositories were about ‘making content available on the web’ suddenly I understood why we had repositories – something that I had previously not understood”. I’m afraid I missed the next point, but it garnered a response from Andy along the lines of:

Answer: We are still working in a print/electronic hybrid environment for scholarly communication. We have a situation where you can have lots of copies of the same thing in different places and in different formats. Users want to get access to the most available copy. Is this the difference? (I’m not entirely sure this is the whole story – there is an issue about business models and charging for access – this is why it’s important which copy you are looking at)

Comment: From Phil Casey from BMJ – saying it is not about managing the delivery mechanism, but it is about the data.

Technorati Tags:

Talis Research Day – Codename Xiphos

I’m at a Talis Research Day today looking at a number of issues that are ‘hot topics’ in education and research at the moment. The program for the day looks great, with presentations by Peter Murray-Rust from the University of Cambridge, who is a proponent of opening up research and research data – I’d recommend his blog to catchup with the latest work in this area. Peter is talking about ‘Data-driven research’. Following this Andy Powell from Eduserv is talking about Web 2.0 and repositories.

First up – Peter M-R:

Peter presents using HTML – although it’s hard work, he believes that the common alternatives (Powerpoint, PDF) destroy data. I think the question of ‘authoring tools’ – not just for presentations, but in a more general sense of tools that help us capture data/information – is going to come to the fore in the next few years.

Peter has a go at publishers – claiming that publishers are in the business of preventing access to data, rather than facilitating it (at this points asks if there are any publishers in the audience – two sheepish hands are raised). Peter, also mentioning that Chemistry is particularly bad as a discipline in terms of making data accessible – with the American Chemical Society being real offender.

Peter’s talk tend to be pretty impromptu – so he is just listing some topics he may (or not) touch on today:

  • Why data matters
  • What is Open Data
  • Differences between Open Access and Open data
  • Demos
  • Repositories
  • eTheses
  • OpenNoteBook Science
  • Semantic data and the evils of PDF
  • Scinec Commons, Talis and the OKF
  • Possible Collaborations

Peter demonstrating how a graph without metadata is meaningless – showing a graph on the levels of Atmospheric Carbon Dioxide. If this was in paper form and we wanted to do some further analysis – it would take a lot of effort to take measurements off the graph – but if we have the data from behind the graph, we can immediately leap to doing further work.

Peter now noting that a scholarly publication looks very much now as it would have done 200 years ago. Showing a pdf of an article from Nature – and making the point that all looks great (illustrations of molecules, proteins and reactions etc.) but completely inaccessible to machines.

Peter noting that most important bio-data that is published is publicly accessible and reusable – but this is not true in chemistry. This means in the article, the data about the proteins is publicly accessible, but the information on chemical molecules is not – although covered in the same article.

Peter illustrating how there is a huge industry based on moving and repurposing data (e.g. taking publicly available patent data, and re-distributing in other formats etc.)

Peter now showing how a data rich graph is reduced to a couple of data points to ‘save space’ in journals – a real paper-based paradigm – we need to get away from this. Similarly experimental protocols are reduced to condensed text strings.

Peter now showing ‘JoVE’ – Journal of Visualised Experiments. In this online publication where scientific protocols are published in both textual and audio-visual format  – so much richer in detail than the type of summarisation that journals currently support. Peter notes – this is really important stuff – failure to provide enough detail to recreate an experiment, it can have a huge impact on your reputation and career.

Peter now moving onto ‘big science’ – relating his visit to CERN – how the enormous amounts of data generated by the Large Hadron Collider is captured, as well as relevant metadata. However, most science is not like this – not on this scale. Peter is relating the idea of ‘long tail’ science (coined by Jim Downing) – this is the small scale science, that is still generating (over all activity) large amounts of data – but each from small activities. This is really relevant to me, as this is exactly the discussion I was having at Imperial yesterday – looking at the approach taken by ‘big science’ and wondering if it is applicable to most of the research at Imperial.

So in Long-tailed Science, you may have a ‘lab’ that will have a reasonably ‘loose’ affiliation to the ‘department’ and ‘institution’. Peter noting that most researchers have experience data-loss – and this can be a real selling point for data and publication repositories.

Peter showing a thesis with many diagrams of molecules, graphs etc. Noting there is no way to effectively extract the information about molecules from the paper, as it is a PDF. He is demonstrating a piece of software which extracts data from a chemical thesis – demonstrating this from a thesis authored in Word, and using OSCAR (a text-mining tool tuned to work in Chemistry) – and shows how it can extract relevant chemical data, can display it in a table, reconstruct spectra (from the available data in the text – although these are not complete).

Peter asking (rhetorically) what are the major barriers – e.g. Wiley threatened legal action against a student who put graphs on their website.

Peter now demonstrating ‘CrystalEye’ – a system which spiders the web for crystals – reads the raw data, draws a ‘jmol’ view (3d visualisation) of the structure, links to the journal article etc. This brings together many independent publications in a single place showing crystal structures. Peter saying this could be done across chemistry – but data is not open, and there are big interests that lobby to keep things this way (specifically mentioning Chemical Abstracts lobbying the US Government)

Peter now talking about development of authoring tools – pointing out that this is much more important that a deposition tool – if the document/information is authored appropriately, it is trivial to deposit (it occurs to me that as long as it is on the open web, then deposit is not the point – although there is some question of preservation etc – but you could start to take a ‘wayback machine’ type approach). Peter is demonstrating how an animated illustration of chemical synthesis can be created from the raw data.

Peter now coming on to Repositories. Using ‘Sourceforge’ (and computer code repository) as an example. Stressing the importance of ‘versioning’ within Sourceforge – trivial to go back to previous versions of code. Need to look at introducing these tools for science. He is involved in a project called ‘Bioclipse’ – a free, open-source, workbench for chemo- and bioinformatics using a ‘sub versioning’ approach (based on Eclipse which is a software subversioning package) – Bioclipse stores things like spectra, proteins, sequences, molecules etc.

Peter mentioning issues of researchers not wanting to share data straightaway – we need ‘ESCROW’ systems that can store information which is only published more openly at a later date. The selling point is keeping the data safe.

Peter dotting around during the last few minutes of the talk, mentioning:

  • Science Commons (about customising Creative Commons philosophy for Science)
    • how to license data to make it ‘open’ under appropriate conditions – this is something that Talis has been working on with Creative Commons.
    • Peter saying that, for example, there should be a trivial way of watermarking images so that researchers can say ‘this is open’ – and then if it is published, it will be clear that the publisher does not ‘own’ or have copyright over the image.

Questions:

Me: Economic costs of capturing data outside ‘big science’

PMR: If we try to retro-fit costs are substantial. However, data capture can be marginal cost if done as part of research. Analogy of building motorways and cyclepaths – very expensive to add cyclepaths to motorways, but trivial to build at the same time.

Some interesting discussion of economics…

Technorati Tags: