Project Xulu – Creating a Social Network from a Web of Scholarly Data

This session by Chris Clarke from Talis

The web as it exists at the moment is a ‘web of documents’ – human readable, but not so good for machines.

To exemplify the problem – a novice user might type into

Google “how many people were evacuated from new orleans following Hurricane Katrina”

Google finds a relevant result – the Wikipedia article on the “Effect of Hurricane Katrina on New Orleans” – but data is hidden in document, and Google fails to pick up other relevant facts buried in the article.

Chris says what we need is a ‘machine-readable’ web – and this is the promise of the ‘semantic web’.

To go back to the Wikipedia example – Dbpedia is an attempt to extract ‘facts’ from Wikipedia text and make available in machine-readable format. Powerset is a search engine built on top of this, and a search on Powerset for the same query as above. Powerset ‘understands’ the query, and serves up appropriate answers.

The problem I have with this example is that it isn’t clear to me that Google doesn’t ‘understand’ (at least to the same extent) the question. Doing semantic analysis of the question is different to using the ‘semantic’ nature of the data.

Talis provide a ‘platform’ (an RDF Triple store, plus relevant services) – and they wanted to see how this could support a ‘web of scholarly data’. They took the metadata for 500 articles in XML format and loaded it into the platform. They started to see what links etc. they could get out. Within the data set there was reference to 19808 articles (mainly in citations) and 21209 people.

This is all very interesting, but Talis wanted to see how this could actually be ‘useful’. This is what Xiphos is meant to demonstrate – a scholarly social network based on the data. Xiphos Nework was built in 4 weeks.

So Chris is showing screenshots (although assures us they do have a real system, and we can look at it after – there is a problem with the network connection at the moment). Taking a possible example of a real researcher, they enter a keyword that is actually a persons name ‘flower’.

Results are grouped by type:

  • Things
  • People

By exploiting the data available in the platform, extracted from the 500 articles, they can show connections between the person ‘T P Flower’, and show connections (via co-authoring) to others.

If the user creates an account on the system, then it can see if it already has information about them, and connect up the existing information to the newly created account. This means the user has a pre-populated social network and ‘stuff’ which they can refine.

By exploiting the information in the store, they can breakdown relationships in 4 ways:

  • People you know
  • People you cite
  • People who cite you
  • People you are watching

The middle two are automatically generated from the metadata from the articles. The last one is a ‘one way’ relationship – you don’t know who is watching you (perhaps have an overall number). Had good feedback from academics on this last function.

As well as each person having a network page, each paper also has a ‘network’ – showing citations and cited by, as well as presenting a thumbnail preview of the paper.

You can add papers to ‘collections’ (which can be public or private) – and people can ‘belong’ to collections as well ‘watch’ collections (sounds a bit like the idea of a ‘Twine’)

Alongside this they looked at adding ‘subjects’ using schemes (specifically MeSH), and events. Also had an idea around a ‘Vault’ which allows document stores – but weren’t sure if that should be part of the remit of a system like this, or whether this should be held elsewhere.

This was all from a set of data on 500 articles, and is just one approach to building applications on top of this data. They chose this because it could be used to ‘clean’ data – users would improve the linkages etc. You could build many other types of application.

Chris stressing that although they see a place for Talis in this environment, they believe it is necessary to see many players – suggesting PubMed, Blackwells, Ingenta etc. could be part of the picture.

Question: Do you have a business model in mind?

Answer: This is just a prototype – different approaches depending on your point of view (publishers different to Open Access)

Question: In an environment where you want multiple players how do you get them all working together. Are there tools that do this?

Answer: Open Data looking at some of these issues. Google and Yahoo starting to exploit this type of semantic data.

Question: When users make modifications to ‘the graph’, how do you deal with ‘versioning’

Answer: Change stored in such a way you can roll back to previous versions.

Technorati Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.