SORT – Closing Keynote

From Bill Thompson, Partnership Manager at the BBC Archive

The BBC has an enormous archive – 1 million hours of tv/radio, photographs, over 7 miles of shelves of printed material, artifacts (tardises, daleks) – job of the archive to serve the needs of the BBC. But growing awareness there is public value in this material – this is area where Bill works in a small team.

First idea was to put digital material online. However, thinking about how to expose content and add value started to infect thinking across corporation. Started to think about online aspects of public space.

  • About the creation of material
  • Exploiting new technology capabilities (e.g. cameras have embedded GPS)
  • Use of content – not just about making programmes, and putting stuff on TV – lots of stuff that never gets broadcast
  • Preservation – complex – wide variety of physical formats

All these things need to happen – even just for internal purposes.

Want to make it as open as possible – but there are issues – may contain personal data, need to think about regulatory regime, commercial issues. These are constraints, although not unreasonable constraints.

Want to think beyond the BBC – think at webscale. Would be foolish to do some of these things just at the scale of the BBC – e.g.

  • location based stuff – need to make sure what the BBC does fits into wider frameworks.
  • Have naming conventions at the BBC, need to look at how this fits with other data outside the BBC.
  • Time – knowing when stuff was done – BBC have said they want to publish a time axis of all programmes ever broadcast – so need to decide how to do this – again not something just the BBC interested in.
  • Want to bring in users – e.g. university academics

BBC is very creative environment. BBC has huge engineering expertise. Engaging seriously with standards efforts etc. For Bill semantic web is a way of building sustainable approaches.

While we’ve been waiting for the semantic web for a long time, it does seem to be getting closer.

We have more processing power than we know what to do with. We have connectivity. We can store data.

What we don’t have the tools and intelligent agents that allow us to apply reasoning across data. We are talking about Artificial Intelligence (AI) – and these come with very big problems. We haven’t made much progress with AI.


OK – I admit, at this point I kind of lost track – Bill delved into some of the challenges for creating Artificial Intelligence, and while I felt I was getting the general points, much of the detail washed over me.

I think that one of the key aspects Bill was highlighting was that some AI researchers believe that you can’t have intelligence without a surrounding environment – and it is in the ability of an entity to interact with it’s environment – especially in the sense of taking in, and pushing out, information.

I think Bill’s argument was that when you simply develop software based AI, you don’t have these kinds of interactions, and so you aren’t going to get intelligence.  Bill quoted a book around this topic which I didn’t manage to get the details of, but it reminded me very much of the arguments put forward by Steven Grand in his book “Growing up with Lucy: How to Build an Android in Twenty Easy Steps“. You can read more about Steven Grand’s approach to AI at I think a quote from this page that relates to what Bill was saying is:

“Central to the Lucy project is researching how an organism gains and uses knowledge, or more appropriately how the process of acquiring and using data are interconnected attributes.”

However, whereas ‘Lucy’ (the Steven Grand project) takes the approach of building a physical entity that can interact with the physical world, I got the impression that Bill was arguing that the semantic web could create an environment in which a more purely software based AI could interact.

Hope that makes sense. It was an interesting talk, and a mind stretching way of closing a very interesting and challenging conference.

SORT – Panel discussion

Q: What are the business models – how do we make this sustainable?

A: (Mike Ellis) Some of this activity can reduce costs – so not revenue stream, but cheaper to do stuff. Requires creative thinking – need to talk to marketeers and communications specialists. E.g. National Gallery – partnered with commercial company to produce iphone app – which was sold

A: (Dan Greenstein) We are going to have to take money away from existing activities – e.g. University of California is now boycotting Nature due to price increases. Need to make sure those things that don’t work go away.

A: (Jo Pugh) Some of this stuff just ‘has to be done’ – freeing our data might be like preservation – doesn’t make us money, but we do it.

A: (Andy Neale) Some of this can be done as part of ‘business as usual’ – tacks on to existing activity

A: (Mike Ellis) Income from protecting some of this stuff (e.g. picture libraries selling use of pictures) is not that great – and there are costs with things like chasing copyright etc.

A: (Jo Pugh) V&A changed rules over what they could do – became more permissive and revenue went up, and they were able to reduce staff

A: (Dan Greenstein) Some publishers protecting backlists in anticipation of a revenue stream that isn’t available, and yet they could realise revenue streams

A: (Stuart Dempster) Look at Ithaca model case studies – real world financials about how much it costs to operate types of digital services – and will be updated this year so will be able to see impact of downturn. Also recommend looking at government technology policy – will be source for innovative practice. Now seeing funders requiring exit strategies for project from day 1.

Q: (Sally Rumsden) Interested in metadata. What is ‘good enough’ metadata. What should we be doing to make sure people can find stuff reliably

A: (Andy Neale) DigitalNZ took any metadata – some items only have title, and even some of those title are ‘title unknown’ – but even this is a hook. When you start pushing this into resource discovery systems with faceting etc. start to expose the quality of the metadata – can highlight to contributors problems they didn’t appreciate, and result in improvements over time. Even if you only have a title – this can still be useful…

A: (Tom Heath) Good enough is in the eye of the beholder. We can’t anticipate. However, you could flag stuff you aren’t happy with so it is clear to users

A: (Dan Greenstein) Metadata enhancement adds to cost. And specialist materials have a particularly high cost, and delivers value to a small number of people. We aren’t good at saying ‘no’ to stuff. We have to be clear what we can afford – have to model costs of project more effectively.

A: (Liz Lyon) Trove project (in Australia) using crowdsourcing to improve metadata

A: (Balviar Notay) Some projects already looking at how text mining tools can enhance metadata for digital repositories – although perhaps unlikely to solve all problems so likely to see mixed manual and automatic approaches

A: (David Kay) If you put stuff out, evidence from Internet Archive and others, authors become motivated to improve and add more. But have to get it out there first

A: (Mike Ellis) Look at Powerhouse Museum collection – using OpenCalais to generate tags. Picasa starting to add object recognition – some automated tools improving, but still some issues. Also look at Google tagging game, V&A also doing user engagement to generate content from humans

A: (Peter Burnhill) Think about what metadata already existed and reuse or leverage that data

SORT – Getting your attention

David Kay going to talk about ‘attention data’ – what users are looking at or showing interest in – and also how it relates to user generated content as starting to believe that attention data is key to getting user engagement.

The TILE project  – looked at library attention data – could this inform recommendations for students. David mentioning well known ‘recommendation’ services – e.g. Amazon, also in the physical world – Clubcard, Nectar card informs marketing etc.

David Pattern at University of Huddersfield – “Libraries could gain valuable insights into user behaviour by data mining borrowing data and uncovering usage trends.”

Types of attention data:

  • Attention – behaviour indicating interest/connections – such as queries, navigation, details display, save for later
  • Activity – formal transactions such as requesting, borrowing, downloading
  • Appeal – formal and informal lists – types of recommendations – such as reading lists – can be a proxy for activity
  • And …

We could concentrate and contextualise the intelligence (patterns of user activity) existing in HE systems at institutional level whilst protecting anonymity – we know which institution a user is in, what course they are on, what modules they are doing. This contextual data is a mix of  HE ‘controlled’ (e.g. studies, book borrowing),  user controlled (e.g. social networks) and some automatically generated data.

The possibility of critical mass of activity data from ‘day 1’ brings to life te opportunity and motivation to embrace and curate user contribution – such as ratings, reviews, bookmarks, lists. To achieve this need to make barriers to contribution as low as possible.

What types of questions might be asked – did anyone highly rate this textbook, what did last year’s students download most.

What level is this type of information useful – institutional, consortial, national, international?

MOSAIC ran a developer competition based on usage data from the University of Huddersfield (see Read to Learn). 6 entries fell into three areas:

  • Improving Resource Discovery
  • Supporting learning choices
  • Supporting decision making (in terms of collection management and development)

However – some dangers – does web scale aggregation of data to provide personalised service threaten privacy of individual? David says we believe that as long as good practice is followed. We need to be careful but not scared off. Already examples:

  • California State University show how you can responsibly use recommendation data
  • MESUR project – contains 1 billion usage events (2002-2007) to drive recommendations

In MOSAIC project, CERLIM did some research with MMU (Manchester Metropolitan University) students – 90% of students said they would like to be able to find out what other people are using:

  • To provide a bigger picture of what is available
  • To aid retrieval of relevant resources
  • To help with course work

CERLIM found students were very familiar and happy with the idea of recommendations – from Amazon, Ebay etc.

University of Huddersfield have done this:

  • suggestions based on circ data – people who borrowed this also borrowed…
  • suggestions for what to borrow next – most people who read book x, go on to read book y next

Impact of borrowing – when recommendations introduced into the catalogue there is an increase in the range of books borrowed by students and the average number of books borrowed went up – really striking correlations here.

Also done analysis of usage data by faculty – so can see which faculties have well used collections. Also identify low usage material.

Not only done this for themselves – released data publicly.

Conclusion/thoughts from a recent presentation by Dave Pattern:

  • serendipity is helping change borrowing habits
  • analysis of usage data allows greater insights in how our services are used (or not)
  • would national aggregation of usage data be even more powerful?

Now David (Kay) moving onto some thoughts from Paul Walk – should we be looking at aggregating usage data, or engaging with people more directly? Paul asks the question “will people demand more control over their own attention data?”

Paul suggests that automated recommendation systems might work for undergraduate level, but in academia need to look beyond automatic recommendations – because it is ‘long tail all the way’. Recommendations from peers/colleagues going to work much much better.

David relating how user recommendations appear on bittorrent sites and drive decisions about which torrents to download. Often very small numbers – but sometimes one recommendation (from the right person) can be enough. Don’t need to necessarily worry about huge numbers – quality is powerful.

Q & A

Comment: (Dan Greenstein) At Los Alamos use usage data for researchers moving between disciplines (interdisciplinary studies) – fast way of getting up to speed.

Comment: (Liz Lyon) Flips peer-review on its head – post review recommendation – and if you know who is making that recommendation allows you to make better judgements about the review…

Comment: (Chris Lintott) Not all ‘long tail’ – if you aggregate globally – there are more astronomer academics that there are undergraduates at the university of Huddersfield.

Comment: (Andy Ramsden) Motivation of undergraduate changes over time – assessment changes across years of study – and groups of common study become smaller. Need to consider this when thinking about how we make recommendations

Q: (Peter Burnhill) Attention data about ‘the now’ – what about historical perspective on this.

A: Examples on bittorrent sites of preserving older reviews – in some cases material posted 30 years old – so yes, important.

SORT – Working the crowd: Galaxy Zoo & the rise of the citizen scientist

I’ve been looking forward to this session by Chris Lintott on Galaxy Zoo

As our ability to get information about the universe has increased we are challenged to deal with larger and larger amounts of data. In astronomy driven by availability of hi-resolution digital imaging etc – whereas 20-30 years ago you could get collections of hundreds of galaxies – now can get collections of millions.

Analysis of galaxy images is about looking at the shape of galaxy. While machine approaches have been developed – they typically have only an 80% accuracy. However humans are very good at this type of task. This used to be a task students would do – but the amount of data far outstripped ability of students to keep up.

In astronomy there is a long tradition of ‘amateurs’ taking part and spotting things that may not be spotted by professionals. However contibutions have generally been around data collection – and then passed to experts for analysis. Galaxy Zoo reverses this – data collection been done and asking public to analyse data.

GalaxyZoo was meant to be a side project – but picked up by media – specifically BBC News website – and sudden burst of publicity got huge boost. However, first thing that happened was server went down – 30,000 emails telling them that the server had gone down. Luckily able to get that back up and running quickly.

After 48 hours were classifying as many galaxies in 1 hour as a student previously doing in a month.

Found that getting many people to do the classification improves accuracy – over professional astronomers. Took away all barriers to participating to get as many people involved as possible. Originally had a ‘test’ for users – but took this away.

The huge side effect is that humans can spot unexpected stuff without being told – much better than machines.

Also built community around people participating – this community now starting to solve problems – e.g. discovery of small green galaxies – started to analyse, recruited programmer to interrogate data and this has eventually resulted in published paper – these objects have been known since 1960s but never analysed. None of the people in the group were scientists.

When they’ve talked to users of the site the overwhelming reason for taking part is that they want to do something useful – want to contribute.

We have responsibility not to waste peoples time – collective manpower on GalaxyZoo 2 was equivalent to employing a single person for 200 years – we cannot take this likely.

Don’t make promises you can’t keep – e.g. don’t offer ‘free response’ that you then can’t actually read – Galaxy Zoo handles this via the online community forums.

Chris describes three strands of engagement with users

  • Known knowns
  • Unknown unknowns
  • Known unknowns

Now JISC funded project to convert information from old ship logs – because has climate data.

Show pages of ships logs –

  • key data you should extract (known knowns – that stuff the researchers know they want from the logs like weather reports)
  • unexpected things you might spot (unknown uknowns – stuff you might spot in the logs – pictures, unexpected information)
  • expected things, but not known how much (known unknowns – events you know will be in there but not how often e.g. encounters with other ships)

These strands are generalisable to many projects

Zooniverse – takes the generalisable stuff from the researchers and provides it – platform for citizen science.

Can no longer rely on media to get message out and drive engagement – “it’s on the internet isn’t it amazing” no longer a story – need to work out how we get the next 300,000 people involved [my first thought – Games – look at Farmville…]

SORT – Open Science at Genome Scale

Second session of the second day – Liz Lyon from UKOLN

Open Science at Web-scale report – a consultative document, Liz says now available on writetoreply (but I can’t find it) (thanks to Kevin Ashley, now got a link to this

OK – Liz talking about the amount of data being generated by Genome sequencing machines – now into second generation of Genome sequencing, and the next generation is being worked on which will work at orders of magnitude larger volumes of data.

This type of huge data production brings challenges. Need large-scale data storage that is:

  • Cost effective
  • Secure
  • Robust and resilient
  • Low entry barrier
  • Has data-handling/transfer/analysis capability

Looking at ‘cloud services’ that could offer this – e.g. Nature Biotechnology 10.1038/nbt0110-13 details use of cloud services in biotechnology.

Starting to see data sets as new instruments for science.

Cost of genome sequencing dropping, while number of sequenced genomes rises.

Leroy Hood says “medicine is going to become an information science”. P4 medicine:

  • Predictive
  • Personalised
  • Preventive
  • Participatory

Stephen Friend – chief exec of Sage Bionetworks – wants to develop open data repository (Sage Commons) to start to develop redictive models of disease – liver/breast/colon cancer, diabetes, obesity.

Paraphrasing a quote Liz read out: To Cultural forces encourage sharing – the way people handle personal data will impact on how researchers deal with data and mean they have not choice to share.

Need to think about ways to incentivise researchers to share data – through mechanisms that allow credit and attribution which will then mean researchers benefit from sharing data.

Need to thing about:

  • Scaleable data infrastructure
  • Personal genomic – share your data?
  • Transform 21st Century medicine/bioscience
  • Credit and attribution for data and models

SORT – Digital New Zealand: Implementing a national vision

The second day of ‘Survive or Thrive’ starts with Andy Neale, Programme Manager for DigitalNZ

Going to cover:

  1. Getting started and getting stuff done
  2. Issues and opportunities
  3. Ongoing development and iteration
  4. Strategic drivers and reflection
  5. Things that worked

Andy stresses New Zealand is a different environment to the UK. Thinks small size may be an advantage despite smaller budgets (wonder if there is a lesson here – perhaps trying to do things in the UK at a ‘New Zealand’ scale?)

First pitch to collaborators – didn’t push a ‘national vision’ very hard. Was an invitation to contribute to a ‘Coming Home’ programme which was focussed on content relevant to Armistice Day (World War I). Collaborators asked to signup to a series of 4 projects:

  • Search widget
  • Mashup
  • Remix experience
  • Tagging demonstration

Search widget – pull together simple search across relevant material in New Zealand archives etc. and make it possible to embed into any web page. As well as the search widget, built a fuller ‘search’ experience (sounds like based on Solr) – simply a different presentation layer on same service/content as the widget.

As demonstrator of what opening up data enabled they added an API to the material, and used to build a timeline display using timeline tool from MIT (Simile).

Memory Maker – making it possible for users to remix content as a video and submit back to the site – although people were excited by technology, it was making content available that made it possible.

Finally joined Flickr Commons to try out user tagging of content.

These four small projects were laying foundations for something bigger. When building search widget, although focussed on World War I material, asked ‘would you mind us taking everything?’ – and in most cases there was no objection, and there was no extra cost.

Infrastructure built to be scalable and extensible. So move from ‘Coming Home’ to DigitalNZ was a small step – really mainly about presentation because already got content (metadata).

Made a form for ‘collection creation’ – allowed building of a search of a subset of whole collection – based on various criteria – e.g. keyword search. The application then setup a widget that you could paste into your web page and also a pointer to the fuller ‘search page’ for the subcollection. This was used to create the Coming Home subcollection – but also the tool was opened up to anyone – so any member of the public could build their own subcollection, complete with search widget and web page (think this is genius!)

The API that was used to do the timeline mashup described above was documented and opened up to anyone – and people started to build stuff, although Andy feels it is early days for this, and more could be done (and hopes it will be)

Next stage was to start talking about this as a National initiative – but all the pieces were in place.

Timeline for project was:

  • May 08 – governance approval for concepts
  • July 08 – Began s/w development activity
  • Nov 11th 08 – Launch of Coming Home projects (Armistice Day provided and absolute deadline for project!)
  • Dec 1st 08 – Launch of full aggregation, custom search and API service

Took a long time to negotiated and discuss before May 2008 – but the Armistice Day deadline focussed minds – had to get agreement and move forward and do something.

There is the ‘vision’ – but Andy says even as programme manager the vision feels like something to secure funding – so the vision used to inform team mission:

Helping people find, share and use New Zealand digital content

This mission was how the team described it – so they had ownership over the concept.

DigitalNZ wasn’t the first initiative to do something similar in New Zealand – so need to say how you are different to these – there is always a lot of History. Andy comparing a previous aggregation in NZ to Europeana – with agreed standards etc. that contributors need to sign up to. DigitalNZ decided not to have standards – they would take what they could get! Important to bring other initiatives and those involved along with you and get their support as far as possible.

DigitalNZ had limited development capacity – so teamed up with vendors – but teamed up with 3 vendors which had expertise overlap to cover:

  • User experience design
  • Front-end development
  • Search infrastructure

Because of overlap between vendors, each could lead in an area, but could backfill in other areas where necessary.

DigitalNZ completely depended on agile development methodology – specifically Scrum – this approach means you deliver real working software every two weeks – which makes it clear to collaborators that you are getting stuff done – they can see real progress.

Knew from the beginning that branding would be an issue – but perhaps underestimated how much of an issue. Andy says all organisations have egos – and also others see them in specific ways. So although the initiative was led by National Library of New Zealand – this is not the up front branding. This means it is more seen as a true collaboration of equals.

Had to make low barriers to entry, as no money for collaborators to take part. One of the things they did is to accept metadata in any format and quality. So that could mean scraping websites etc. Then deal with the issues arising from this later – not make it the collaborators problem – otherwise many simply wouldn’t be able to participate.

In some cases got enthusiastic initial response from partners, but then could get bogged down in local internal discussions. Again the initial Armistice Day deadline meant decisions were made more quickly.

After launch of services, have kept in project mode – Andy says the phrase “business as usual” is unacceptable! So this means still doing 2 week development cycles using same Scrum methodology etc. Need to thing about this as you plan projects on a national scale.

The next step from getting the metadata and search was to look at how to create digital content – digitise content (unavailability of content in digital format is usually the biggest barrier to getting access). Setup ‘Make it Digital’ toolkit – advice for those wanting to digitise, but also includes voting tool for public to suggest material for digitisation.

Took the search widget/search page creation tool and starting to apply to richer content – can launch a new collection instance in 2 hours (although not including graphic design) – wow!

Now also running a Fedora instance to allow hosting of content that hasn’t got anywhere else to live – e.g. for organisations who can’t run their own repository.

Now grappling with:

  • The focus of content to be included in DigitalNZ Search – should it start to pull in relevant content from bodies outside NZ?
  • The balance of effort between central and distributed tools – focus is more on distributed approach – then local organisations can do marketing etc.
  • The balance of effort split between maintenance of existing solutions and development of new solutions – challenged to grow services without any more money
  • The availability of resource to fund digitisation consultancy, workshops and events – often what is needed is money but this is not available

What has worked for DigitalNZ?

  • Have the team articulate vision
  • Start with small exciting projects
  • Be clear about your points of difference (to other projects in same space)
  • Lower barriers to participation
  • Use branding and design to inspire committment – a lot of effort goes into make what they do look good
  • Invest time building strong relationships with collaborators
  • Have deadlines that can’t be moved
  • Once you are in the door you can up-sell the intiative
  • Build for reuse and extensibility
  • Iterate in small fast cycles – Andy says he can’t recommend this enough – better to do 2 days of requirements analysis and then deliver something, then iterate again
  • Have lightweight governance and a team of experts
  • Get on with it and refactor in response to change

Q & A

Q: (Jill Griffiths, CERLIM) How many people have responded and engaged with ability to suggest content for digitisation

A: About 100 items have been nominated – and the most popular item has about 600 votes. Also have a weighted scorecard for organisations to help them decide on priorities – one aspect on the scorecard is user demand, which is where this tool is used to inform. Also doing work on microfunding digitisation – e.g. $10k (NZ $)

Q: (?) How did you cope with jealously of big collections (e.g. Turnbull)

A: Not a problem. Building on previous initiatives so many of the concepts not new and already had been agreed – e.g. exposing metadata through other platforms. Education is part of the process.

SORT – Launch of JISC and RLUK Resource Discovery Vision

This final session of the day is to launch this ‘Resource Discovery Vision’ which was developed by a Taskforce (of which I was a part – so I may be biased). David Baker (Deputy Chair, JISC) is introducing the work of the taskforce and the resulting vision.

Terms of Reference were:

  • Define the requirements for the provision of a shared UK resource discovery infrastucture for libraries, archives, museums and related resources to support education and research
  • Focus on metadata that can assist in access to resources, with special reference to serials, books, archives/special collections, museum collections

The reasons for developing this vision:

  • to enable UK HE to implement a fit for purpose infrastructure to underpin the consumption of resources … for the purpose of research and learning
  • To address the key challenge of providing end users wit flexible and tailored resource discovery and delivery tools

The vision:

UK students and research will have easy, flexible access to content and services through a collaborative, aggregated and integrated resource discovery and delivery framework which is comprehensive, open and sustainable.

The aim is to realise the vision sooner rather than later – by 2012:

  • Integrated and seamless access to collections in libraries, museums and archives in UK HEIs
  • Creation of through and open aggregated layer – designed to work with all major search engines
  • Provision of a divers range of innovative and personalised resource discovery services
  • Avoidance of duplication of effort
  • Existing resource discovery services encouraged to develop and innovate
  • Data will be available to commercial organisations to develop services
  • Data and functionality will need to be diffused to other software

David stresses that focus initially on UK HE – but this is focus for first phase and doesn’t rule out wider consideration later.

What now?

  • Programme of Work
  • Ongoing dialogue
  • Buy-in
  • Partnerships
  • Quick Wins

Now Rachel Bruce to give more detail:

What does the vision address? Looking at The (Digital) Library Environment: Ten Years After by Lorcan Dempsey – two points specifically:

  • “Aggregation of supply and demand”
  • “Making our data work hard, egagement and co-creation”

What is an aggregation? Need to look at how we work with national and international aggregations – e.g.:

  • Archives Hub
  • Repositories K
  • Digital New Zealand
  • Europeana
  • Culture24

Rachel showing linked data diagram – not to say ‘linked data’ is the way forward, but that the vision is linked to this idea of linked datasets.

What are we doing? Implementation plan in draft at

Rachel mentions some work I’m going to be involved with around a resource relating to Open Bibliographic Data (working with Sero Consulting and Paul Miller from Cloud of Data)

Now Mike Mertens from RLUK talking about their involvement and view. Resource Discovery highlighted in RLUK Strategic Plan (2008-2011)

RLUK is able to deploy an aggregation of some 16 million items, and has a strong belief in open data and public good. RLUK current sells metadata – so making this available openly and freely this has an impact on their income – means change in meaning, relevance, business operations, scope and purpose. [really glad RLUK is tackling this head on – great to hear this recognition of the real impact and the willingness to deal with this]

Comments and Q & A

Q: (Linda somone?) How does this relate to other initiatives – e.g. we are already putting data in Europeana

A: (Rachel) Need to build partnerships – still work to be done

Q: (Gurdish Sandhu) Many libraries implement ‘new generation’ discovery tools – is this still a valid thing to do in the context of this vision

A: Obviously something that needs to be addressed. Have programme of work about how these tools work and how to surface this type of information on the web.

Q: (Gurdish Sandhu) If this service is only about metadata – won’t it put off users who are used to retrieving full-text via Google

A: (Mike) There is a risk – but finding out about something is still important. Better to try to bolster content to be produced which could link to things later. This is a vision – a firm push in the right direction

Q: (Peter Burnhill) [apologies if I’ve mangled this q] There is a difference between the digital and the physical – and discovery to delivery works differently. Should the focus be on one or the other?

A: (Rachel) Yes, there is a difference – some work on this already underway

Comment: (Paul Ayris) Putting vision in European context – especially Europeana – build a Pan European aggregator to feed metadata into Europeana. Would prefer to work with National level aggregators – which isn’t currently possible in the UK – but this vision might enable it.

SORT – Advanced text mining tools and resources for knowledge discovery

Penultimate session of the day – Sophia Ananiadou from NaCTeM (National Centre for Text Mining)

What is text mining? – takes us from text to knowledge.

  • Yields precise knowledge nuggest from sea of infomration -> Knowledge Extraction
  • Extraction of ‘named entities’ – e.g. names of people, institution names, diseases, genes, etc. etc.
  • Diovery of concepts allows semantic annotation and enrichment of documents – improves information access (goes beyond index terms) and allows clustering and classification of documents
  • Extracts relationships, events and even opinions, attitudes etc. – for further semantic enrichment

Need a toolkit:

  • Resources – lexica, grammars, ontologies, databases
  • Tools – parsers, taggers, named entity recognisers
  • Annotated corpora
  • Domain adaptation

Sophia talking in a bit more detail about how you go about doing text mining:

  • Start with syntactic analysis
  • Use Named Entity Recognition to extract terms/semantic entities
  • Use parsers to extract other aspects – events, sentiments etc.

All this allows the creation of annotations – semantic metatdata.

Some examples of text mining applications:

Sophia suggests we should be integrating ‘Language Technology’ into open and common e-research infrastructure to enable the use of text mining tools on the content. See U-Compare tool from NaCTeM –

Q & A

Q: (David Flanders) If I was a repository manager which tool would you recommend I play with first?

A: All of them! Need to work out what you want to do and pick appropriate tool

SORT – Linked Data: avoiding a ‘Break of Gauge’ in your web content

Tom Heath from Talis… slides at

Tom using his journey to work (Bristol -> Birmingham) as analogy …

  • 25 minute walk to the station
  • Train Bristol Temple Meads -> Birmingham New Street
  • Train to Birmingham New Street -> Birmingham International
  • Bike from Birmingham International to Talis offices

The Rail network allows this to happen. However, if we look back to 1800s – same journey would have taken around 4 days before rail link was built (Birmingham to Gloucester railway – see However, even when rail link was built, had to change at Gloucester to get train to Bristol because the track Bristol->Gloucester was a different gauge to Gloucester->Birmingham.

The situation for data in HE is currently like the picture before the national rail network was developed – lots of isolated nodes of data. While possible to do custom links between various datasets – it is difficult to answer questions that might require links across many datasets.

At the moment we might be able to mashup data from several sources – but it costs us each time we do it (as with changing trains at Gloucester). Linked Data makes it possible to combine the data without this ‘each time’ cost.

Tom’s ‘take home’ messages:

  • Building physical networks adds value to the places which are connected – the Birmingham<->Bristol railway was built for a reason not arbitrarily – allowed transport of goods from a port to inland city
  • Buidling virtual networks adds value to the tings whih are connected
  • Linked Data enables us to build a network or ‘Web’ of data sets


  • No need for a ‘Big Bang’ – exploit existing infrastructure; build a backbone
  • Costs? – As for any infrastructure investment; Bootstrapping cost vs cost savings and value of things that wouldn’t otherwise get done

Q & A

Q: (David Flanders) Are there some examples that people can look at for guidance

A: Biggest example – – example of infrastructure that allows devolved ownership of URIs – which separates out the URI namespace from Department names etc. – lots of really good practice. See also Jeni Tennison’s blog – If the UK Government can do it – any University can do the same.

Q: (Mike Ellis) Conceptually great, but reality it is too hard – better to do what you can?

A: Don’t agree 🙂 Anyone can get the idea of a network of things connected – just draw a spider diagram and you’ve got the idea. The technical challenge is new – but there are always technical challenges – we all need to learn new things to deal with this – but whatever happens next this will be true

Q: (Peter Burnhill) Machine readable is key. In the past we got hung up on ‘channels’ as opposed to data models. Need to move to the place where publication of schemas is a great thing to do.

A: Agree. Any institution publishing ontologies or vocabularies that is then reused – gets ‘credit’ by reuse of their URIs …

Q: (David Kay) I’m with Mike – this is closer to solving the ‘authority file’ problem rather than data model problem. If we’ve continually failed to solve this problem aren’t we bound to fail with this attempt as well?

A: Need to stop thinking of a ‘authority’ answer – may have lots answers – and may be contradictory. But this is what will allow you to scale – you will use the one that is most useful to you.

Q: (Liz Lyon) Just to mention ‘Concept Web Alliance’ in Bioinformatics is looking at describing concepts using RDF

SORT – Geo-spatial as an organising principle

James Reid from EDINA (presentation available at

Digimap has just had its 10th birthday – a practical exemplar of the efficacy of shared services.

Geo-spatial information has had huge and rapid takeup over recent years. What are the drivers behind this?

  • Grass roots – hacker driven, web 2.0 type approach – informal
  • Top down, governance heavy, standards driven – formal

Technology drivers – increasingly we all have GPS devices (even if we don’t know it)

What is Geospatial? Can be direct – aerial photography, mpas, etc. Also indirect – e.g. location information on Flickr

80% of all organisation information is geographic.

James talking about ‘Unlock‘ ( – a range of tools – e.g. Indirect georeferencing can happen through different routes – places names, parish names, coordinates, postcodes – Unlock allows these to be transparent by translating from one to another. Buidling an infrastructure for geospatial services (think I got that right)

Inspire – European Directive to improve the sharing of geospatial information, and make more accessible to the public – now part of UK law –

Inspire applies to universities, and covers aspects such as discovery and licensing as well as just making data available. Inspire covers a wide range of data – especially if you look at Annex III – likely to impact on Universities (looks from the roadmap at that this means December 2013 is a date to look at for Annex III)

Things we need to be looking at (from industry ‘Foresight’ study):

  • Augmented reality – forecast to become mainstream in next 5 years
  • Cartography and visualisation – to make sense of the vast amounts of geodata
  • Global
  • Satellite imagery
  • Semantic web
  • Software: Rise of Open Source, realtime 3D, browser as primary UI

Also political and environmental drivers …

Many drivers not specific to Geospatial

Inspire is a ‘stick’ although also a ‘carrot’.