Do you Read to Learn?

I’ve been promising a blog post of my entry into the JISC MOSAIC competition for a while now, so here goes.

The JISC MOSAIC competition was basically about demonstrating different ways in which library usage data could be exploited. The data made available for the competition is from the University of Huddersfield, where Dave Pattern has led the way in putting this type of data to work. I was also keen to dust off my rather rusty coding skills. I have to admit that when I first saw the large XML files that the project was offering, I was slightly worried – doing any kind of analysis on the files looked like it was going to be a bit of work. Luckily very soon after the competition was announced, Dave offered a simple API to the data which definitely looked more my kind of thing – a relatively simple XML format, with nice summary information available.

I had originally though that working on the competition might give me the push I needed to learn a new programming language – trying to get up to speed with Python or Ruby has been on my todo list for a while. However I ended up falling back on the language I’ve used most in the past – Perl. Several years ago I wrote some Perl scripts to parse various XML files so I was confident I could pick this up again. I was also slightly suprised that Perl still seemed to have some of the most extensive XML parsing options (although this may be simply due to my pre-existing knowledge – I’d be interested to hear what other languages I should be looking at?)

I wanted to come at the data from a slightly different angle. I had two ideas:

  • Generate purchase recommendations for libraries by finding the items they already owned in the usage data, and finding those linked items (in the usage data) that are not already owned
  • Get people to upload lists of books they owned/liked, find which courses they were linked to by the usage data, and suggest courses the person

I’d have liked to do both (and at one point thought I might pull this off with some help), but in the end I went with the second of these.

The idea was that if we know what books students on a specific course uses, if someone really likes those books then they may well find the course interesting. I’m still unsure of whether this assumption would be borne out in practice, and I’d be interested in comments on this. My program basically needed to:

  • Allow you to upload a list of books (I went for a list of ISBNs for simplicity)
  • Check which course codes those books were related to
  • Find where courses matching those course codes were available
  • Display this information back to you

The first thing I realised was how much Perl I’d forgotten – it took me quite a while to get back into it, and even now looking at the script I can see things that I would do quite differently if I were to start over.

I was able to pinch quite a few bits from existing tutorials and examples on the web (this is one of the great things about using Perl – lots of existing code to use). Things like uploading a file of ISBNs were relatively trivial. I’m not going to run through the whole thing here, but the bits I want to highlight are:

Dealing with UCAS
UCAS really don’t make it easy to get information out of their website on a machine-to-machine basis. I’ve done an entire post on scraping information from UCAS, which I’m not going to rehash here, but honestly if we are going to see people developing applications which help individuals build personalised learning pathways through Higher Education courses this has got to improve.

How much overlap is significant?
The first set of test data I used was the ISBNs from my own LibraryThing account. This is a free account, so limited to 200 items – so approximately this was 200 ISBNs. I realise that most people are not going to have a list of 200 ISBNs to hand (a major issue with what I’m proposing here), but it seemed like a good place to start. However, I found that only 2 of these 200 items matched items in the usage data from Huddersfield. Initially these two items resulted in several course recommendations – because I’d assumed that any overlap was a ‘recommendation’. However it was immediately apparent that the fact I owned ‘The Amber Spyglass’ by Philip Pullman didn’t really imply I’d be interested in studying History with English Language Teaching, or that owning Jane Eyre meant I’d be interested in Community Development and Social Work – these were just single data points, and amounted to ‘coincidence’.

Given this, I introduced the idea of ‘close matches’ which meant that you owned/read at least 1% of all the items associated with a course code. However, this led to my own data generating zero matches – not a good start. For the purposes of demonstration I basically faked some sets of ISBNs which would give results. I have no idea whether 1% is a realistic level to set for ‘close matches’ – it could well be this is too low, but it seemed like a good place to start, and it can easily be adjusted within the script.

I think it is really important to stress that the only usage data the competition worked against was that from the University of Huddersfield. This was bound to give limited results – any single institutions data would suffer from the same problem. However, if we were to see usage data brought together from Universities from across the UK I still think there are some possibilities here (and who knows what might turn up if you added public library information into the mix somehow?).

So – the result is at ReadToLearn and you are welcome to give it a go – and I’m very interested in comment and feedback. I’m hoping to at least partially rewrite the application to use the UCAS screenscraping utility I’ve since developed. Although I’m rather embarrassed by the code as it definitely leaves alot to be desired, if you want to you can download the ReadtoLearn code here.

Accessing Sconul Access

This is a very quick lunchtime post to document a script I’ve been working on over the last week or so. SCONUL Access is a scheme that offers reciprocal access to various university libraries across the UK.

The SCONUL Access website allows you to enter details of a UK university affiliation, and then will list details of those libraries which you can use via the reciprocal agreement scheme (you have to apply for a SCONUL access card at your ‘home’ institution before you can use the other libraries).

I’ve occasionally thought it would be nice to do something like map the results of a SCONUL access enquiry on a Google map, or integrate the question of ‘which libraries can I use’ with ‘where can I get a book’ – so that users could potentially do a search of all the libraries they can access (perhaps limited by a geographical radius?). Aside from these ideas, the SCONUL Access directory actually contains quite a bit of useful information on each library it lists – including the insitution website, the library website and the library catalogue URL.

Further, I was recently inspired by Philip Adams from Leicester (@Fulup) on Twitter who pointed me at http://www.library.dmu.ac.uk/Resources/OPAC/index.php?page=366 which combines information from SCONUL access with the Talis Silkworm directory to show SCONUL Access libraries (relevant to those at the University of Leicester I guess) on Google Maps.

Unfortunately the SCONUL Access website doesn’t provide an API to query the data it has on the libraries, so I thought I’d start writing something. I haven’t (yet anyway) tried to replicate the function that SCONUL access provide of taking user details, and giving a list of available libraries – to get this function you still have to go to SCOUNL Access website and fill in their forms. What my script does is simply provide SCONUL Access member library details in an XML format. The script lives at:

http://www.meanboyfriend.com/sconulaccess

It supports three modes of use:

1. Summary of all SCONUL Access libraries
URL: http://www.meanboyfriend.com/sconulaccess
Function: returns a summary of all institutions participating in SCONUL Access from their A-Z Listing. This XML (see below for format) only includes the SCONUL Access (internal) code for the library, the name of the institution and the URL for the full SCONUL Access record

2. Full records for specified SCONUL Access libraries
URL: http://www.meanboyfriend.com/sconulaccess/? e.g. http://www.meanboyfriend.com/sconulaccess/?institution=2,3,4
Function: returns full records for each institution specified by its SCONUL Access ID in the URL (see full XML structure below)

3. Full records for all SCONUL Access libraries
URL: http://www.meanboyfriend.com/sconulaccess/?institution=all
Function: similar to 2 but returns full records for all institutions that are obtained via 1. This takes some time to return results as it retrieves over 180 records from the SCONUL Access website – so it isn’t recommended for general use.

XML Structure

<sconul_access_results>
 <institution code=”4″ name=”Aston University”>
  <inst_sconul_url>
    http://www.access.sconul.ac.uk/members/institution_html?ins_id=4
  </inst_sconul_url>
  <website>http://www.aston.ac.uk/</website>
  <library_website>http://www1.aston.ac.uk/lis/</library_website>
  <library_catalogue>http://library.aston.ac.uk/</library_catalogue>
  <contact_name>Anne Perkins</contact_name>
  <contact_title>Public Services Coordinator</contact_title>
  <contact_email>a.v.perkins@aston.ac.uk</contact_email>
  <contact_telephone>01212044492</contact_telephone>
  <contact_postcode>B4 7ET</contact_postcode>
 </institution>
 <source>
  <source_url>http://www.access.sconul.ac.uk/</source_url>
  <rights>Copyright SCONUL. SCONUL, 102 Euston Street, London, NW1 2HS. </rights>
 </source>
</sconul_access_results>

The <institution> element is repeatable.
For (1) above the only elements returned are:
<institution>
</inst_sconul_url>
<source> (and subelements)

Anyway, I’d be interested in comments, and would be happy to look at alternative functions and formats – let me know if there is anything you’d like to see.

UCAS Course code lookup

While I was writing my entry for the JISC MOSAIC competition (which I will write up more thoroughly in a later post I promise – honest), one of the problems I encountered was retrieving details of courses and institutions from the UCAS website. Unfortunately UCAS don’t seem to provide a nice API to their catalogue of course/institution data. To extract the data I was going to have to scrape it out of their HTML pages. Even more unfortunately they require a session ID before you can successfully get back search results – this means you essentially have to start a session on the website and retrieve the session ID before you can start to do a search.

I hacked together something to do enable me to get what I needed to do for the MOSAIC competition. However, I wasn’t the only person who had this problem – in a blog entry on his MOSAIC entry Tony Hirst notes the same problem. At the time Tony asked if I would be making what I’d done available, and I was very happy to – unfortunately the way I’d done it I couldn’t expose just the UCAS course code search. I started to re-write the code but writing something that I could share with other people, with appropriate error checking and feedback proved more challenging than my original dirty hack.

I’ve finally got round to it – it works as follows:

The service is at http://www.meanboyfriend.com/readtolearn/ucas_code_search?
The service currently accepts two parameters:

  • course_code
  • catalogue_year

The course_code parameter simply accepts a UCAS course code. I haven’t been able to find out what the course code format is restricted to – but it looks like it is a maximum of 4 alphanumeric characters, so this is what the script accepts. Assuming the code meets this criteria, the script passes this directly to the UCAS catalogue search. The UCAS catalogue doesn’t seem to care whether alpha characters are upper or lower case and treats them as equivalent. For some examples of UCAS codes, you can see this list provided by Dave Pattern. (see Addendum 2 for more information on UCAS course codes and JACS)

The catalogue_year parameter takes the year in the format yyyy. If no value is given then the UCAS catalogue seems to default to the current year (2010 at the moment). If an invalid year is given the UCAS catalogue also seems to default to the current year. It seems that at most only two years are valid at a single time. However the script doesn’t check any of this – as long as it gets a valid four digit year, it passes it on to the UCAS catalogue search.

An example is http://www.meanboyfriend.com/readtolearn/ucas_code_search/?course_code=R901&catalogue_year=2010

The script’s output is xml of the form:

<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<course_name>xxxx</course_name> (repeatable)
</institution>
</ucas_course_results>

(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)

<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<inst_ucas_url>[URL for Institution record on UCAS website]</inst_ucas_url>
<course ucas_catalogue_id=””> (repeatable) (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>[URL for course record on UCAS website]</course_ucas_url>
<name>xxxx</name>
</course>
</institution>
</ucas_course_results>

For example:

<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<course_name>Combined Modern Languages</course_name>
</institution>
</ucas_course_results>

(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)

<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<inst_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsInstDetails.run?i=P80
>/inst_ucas_url>
<course ucas_catalogue_id=””> (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsDetails.run?n=989628
</course_ucas_url>
<name>Combined Modern Languages</name>
</course>
</institution>
</ucas_course_results>

The values fed to the script and the StateID for the UCAS website is fed back in the response.

If there is an error at some point in the process and error message will be included in the response in an <error> tag.

Addendum 1
The script relies on the HTML returned by UCAS remaining consistent. If this changes, my script will probably break.

Having done the hard work I’d be happy to offer alternative formats for the data returned by the script – just let me know in the comments. I’d also be happy to look at different XML structures for the data so again just leave a comment.

Something I should have mentioned in the original post. Given the data returned by the script you should be able to form a URL which links to an institution on the UCAS website using a URL of the form:
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/<insert state ID from xml here>/HAHTpage/search.HsInstDetails.run?i=<insert institution code here>

Since finishing this work last night I’ve realised that I’ve left out one important piece of data which is an identifier that would let you form a link to a specific course from a specific institution. I have slightly restructured the XML to leave a space for the ucas_catalogue_id in the XML. I’ll add this in as soon as I can.
This has now been added.

Addendum 2
I’ve just found quite a bit more detail on the format and structure of the UCAS ‘course codes’. UCAS now uses JACS (Joint Academic Coding System) for course codes (see JACS documentation from HESA). JACS codes consist of 4 characters, the first being an uppercase letter and the remaining three characters being digits. JACS codes are essentially hierarchical with the first character representing a general subject area and the digits representing subdivisions (in with increasing granularity). The codes in the UCAS catalogue are a mixture of JACS 1.7 and JACS 2.0 codes. A full listing of JACS v2.0 codes is available from HESA, and a listing of JACS v1.7 codes is available from UCAS as a pdf.

UCAS have an explanation of why and where they use both JACS v2.0 and JACS v1.7.

However because UCAS need to code courses which cover more than one subject area, they have rules for representing these courses while sticking to codes with a total length of 4 characters. These rules are summarised on the UCAS website, but a fuller description is available in pdf format. This last document is most interesting because it indicates how you might create the UCAS code from a HESA Student Record which could be of interest for future mashups.

The implications of all this for my script are relatively small as I currently assume that there is a 4 character alpha-numeric code. On the basis of this documentation I could refine this to check for 3 alpha-numeric characters followed by a single digit I guess – perhaps I will at some point.

Finally it looks like UCAS and HESA are currently looking at JACS v3.0 which could introduce further changes I guess, although it looks unlikely that this will affect the code format, but rather the possible values, and maybe the meaning of some values. While this isn’t a problem for my script, it would mean that historical course codes from datasets such as MOSAIC could not be assumed to represent the same subject areas in the current UCAS course catalogue as they did when the data was recorded – which is, to say the least, a pain.

Addendum 3
A final set of changes (I hope):

  • The ucas_catalogue_id is now populated
  • Added inst_ucas_url element which contains the URL linking to the Institution record in the UCAS catalogue
  • Added course_ucas_url element which contains the URL linking to the Course record in the UCAS catalogue

Everyone’s a winner?

The results of the JISC MOSAIC competition were announced this week. The winning entries were great, and I think their prizes were well deserved. The only downside in this was that my entry didn’t make the cut. I will admit to having a moment of disappointment over this, but this passed in about 5 seconds – after all, I’d really enjoyed the challenge of writing my entry and was relatively pleased with the result.

Later in the week I fell into conversation with a couple of people on Twitter about how there hadn’t been much collaboration in the competition. With one notable exception none of the contestants had published early thoughts online, and all the entries had been from individuals rather than teams.

During the course of this conversation I managed to both insult and upset someone I greatly like, admire and respect. For this I am truly sorry. This post is is in the way of an apology as well as an attempt to express my own thoughts around the nature of ‘developer competitions’ such as JISC MOSAIC.

The idea of a developer competition is that you set a challenge, aimed at computer programmers and interested others, and offer prizes to the best entries – the criteria can vary wildly. Perhaps the biggest prize of this type we’ve seen is the $1million NetFlix prize, but in the UK HE community where I work there have been a few smaller prizes on offer, and more widely in the UK community there have been prizes for ideas about using government data, and we are about to see one launched on the use of Museum data. The JISC MOSAIC competition offered a 1st prize of £1000 for work on library usage data.

One of the amazing things about the web, and perhaps particularly about the communities I’m engaged in, is the incredible personal commitment made in terms of time and resource by individuals to what many would regard as ‘work’. Both of the people I was talking to put in a great deal of effort into contributing to and developing ideas that many might think of as ‘the day job’ – and they do so with no thought of reward.

So – given this tendency to be self-motivated to solve problems, contribute, take part etc. Why do we need developer competitions?

My starting point is to look at my own motivation for entering the JISC MOSAIC competition. Would I have done this work without the competition? Trying to be completely honest here – probably not. However, I would almost certainly done other things instead – perhaps blogged more, perhaps done some other development (like this). So the competition focussed my energy on a particular area of work. Was I motivated by the cash prize? I’m not sure – at the end of the day it isn’t that relevant to me (although no doubt I could have found something to treat myself to). I think it was just the idea of the ‘competition’ that gave me the focus. I’m the kind of person who works relatively well with clear deadlines – so having a date by which a set of work was to be done definitely gave me something to aim at.

So – the competition was one element. However, I was also looking for ways to dust off my scripting skills. I used to script in Perl as part of my job, but I haven’t done this for several years – I had been looking for ways of picking this up again as it was something I always enjoyed doing. I am also extremely interested in the ideas behind the competition – I believe libraries should be exploiting their usage data more, and I was keen to show the community how valuable that usage data might be.

I don’t assume that others are motivated in the same way as me. When the usage data that was part of the JISC MOSAIC competition was first put online somebody immediately took it and transformed it into RDF – they weren’t motivated by a competition, they just did it.

My conclusion is that such competitions harness existing energy in the community and focus it on a particular problem for a particular time period. It won’t generally work where people aren’t inclined to do the work anyway. You need an interesting problem or proposition to engage people.

So far, so good? I’m not sure. The problem with a competition is that it is, well, competitive. Again trying to be honest about my own situation (and I’m not particularly proud of this, so don’t take it as an endorsement of my own approach) is that I immediately became more protective of my ideas. The competition had put a ‘value’ on them that they hadn’t previously had. I should say I actually started work on two entries to the competition – one was in collaboration with someone else, which unfortunately we weren’t able to pull together in time – so it wasn’t all about ‘me’. However, I didn’t announce my own entry until I was ready to submit. This isn’t how I usually work – I’m usually happy to share half baked ideas (as readers of this blog will know only too well!).

Again I think the factors around this are complex. It wasn’t just that I didn’t want to give away my idea. The truth is that I’m not a very good programmer. I wanted to take this chance to develop my programming skills (or at least get myself back to my previous level of incompetence). I am under no illusions – any developer worth their salt could take my idea and do a better job with it. In general this would be great – if my idea is good enough to inspire other people to do it much much better than I can I’d be very happy. But for the period of the competition this suddenly seemed like a bad idea.

Reflecting on this now, this shows a pretty rubbish (on my part) attitude to others – the ‘fear’ that my idea would be ‘stolen’ (and of course the egoism that says my idea was worth stealing). I’m pretty confident in retrospect that the only possible outcome of publishing early would have been a better entry (possibly in collaboration with others). However, I would say that my guess is it would have resulted in me not doing the coding – which I would have been sorry about.

I am going to blog my entry in detail, and release all the work I’ve done – which others are more than welcome to use and abuse.

So although I think developer competitions work in terms of focussing people on a problem, I think there are some possible downsides, perhaps chief of which is that competitions may discourage collaboration. I don’t think this is a given though, and so in closing here are some thoughts that future developer competitions might want to consider:

  • Is there an element in your competition that encourages team entries above or aswell as individual entries?
  • Can you reward collaboration either within or outside the competition structure?
  • How are you going to ensure that the whole community can share and benefit from the competition outcomes? Plan this from day 1!

Perhaps consider splitting the prizes in different ways to acheive this – not one ‘big winner’, but rather judging and rewarding contributions as you go along. Perhaps consider having a ‘collaboration’ environment where ideas can be submitted (and judged separately) and where teams can form and work together.

A final thought – I really enjoyed entering the JISC MOSAIC competition – it stretched my skills and scratched an itch for me. I am in no way disappointed I didn’t win – the winning entries were very deserving. I fully intend to do more scripting/programming going forward. And sharing.

Would you recommend your recommender?

We are starting to see software and projects emerging that utilise library usage data to make recommendations to library users about things they might find useful. Perhaps the most famous example of this type of service is the Amazon ‘people who bought this also bought’ recommendations.

In libraries we have just had the results of the JISC MOSAIC project announced, which challenged developers to show what they could do with library usage data. This used usage data from Huddersfield, where Dave Pattern has led the way both in exploiting the usage data within the Huddersfield OPAC, and also in making the data available to the wider community.

On the commercial side we now have the bX software from Ex Libris, which takes usage data from SFX installations across the world (SFX is an OpenURL resolver which essentially helps makes links between descriptions of bibliographic items and the full text of the items online). By tracking what fulltext resources a user accesses in a session and looking at behaviour over millions of transactions, this can start to make associations between different fulltext resources (usually journal articles).

I was involved in trialling bX, and I talked to some of the subject librarians about the service and the first question they wanted to know the answer to was “how does it come up with the recommendations”. There is a paper on some of the work that led to the bX product, although a cursory reading doesn’t tell me exactly how the recommendations are made. Honestly I actually hope that there is some reasonably clever mathematical/statistical analysis going on behind the recommendation service that I’m not going to understand. For me the question shouldn’t be “how does it work?” but “does it work?” – that is are the recommendations any good?

So we have a new problem – how do we measure the quality of the recommendations we get from these services?

Perhaps the most obvious approach is to get users to assess the quality of the recommendations. This is the approach that perhaps most libraries would take if assessing a new resource. It’s also an approach that Google take. However, when looking at a recommender service that goes across all subject areas, getting a representative sample of people from across an institution to test the service thoroughly might be difficult.

Another approach is to use a recommendation service and then do a longitudinal study of user behaviour and try to draw conclusions about the success of the service. This is how I’d see Dave Pattern’s work at Huddersfield, which he recently presented on at ILI09. Dave’s analysis is extremely interesting and shows some correlations between the introduction of the recommender service and user behaviour. However, it may not be economic to do this where there is a cost to the recommender service.

The final approach, and one that appeals to me, is that taken by the NetFlix Prize competition. The NetFlix Prize was an attempt by the DVD/Movie lending company NetFlix to improve their recommendation algorithm. They offered a prize of $1million to anyone who could improve on their existing algorithms by a factor of 10% or more. The NetFlix prize actually looked at how people rated (1-5) movies they had watched – based on previous ratings the goal was to predict how individuals might rate other movies. The way the competition was structured was that a data set with ratings was given to contestants, along with a set of ratings where the actual values of the ratings had been removed. The challenge was to find an algorithm that would fill in these missing ratings accurately (or more accurately than the existing algorithm). This is a typical approach when looking at machine based predictions – you have a ‘training set’ of data – which you feed into the algorithms, and the ‘testing set’ which is the real life data against which you compare the machine ‘predictions’.

The datasets are available at the UCI Machine Learning Repository. The Netflix prize was finally won in September 2009 after almost 3 years.

What I find interesting about this approach is that it tests the recommendation algorithm against real data. Perhaps this is an approach we could look at with recommendation services for libraries – to feed in a partial set of data from our own systems and see whether the recommendations we get back match the rest of our data. As we start to see competition in this marketplace, we are going to want to know which services best suit our institutions.

Moving Type

I’m in the process of moving this blog. It is now powered by WordPress (rather than Typepad/Moveable Type previously). Although I have migrated all the content, links to posts may currently be broken  – I’m in the process of fixing these, but it may take me a couple of days – please be patient!

In the meantime I think all the RSS/Atom feeds are working OK and you shouldn’t need to do anything if you subscribe via one of the feeds.

Super! Mashing! Great!

If you like playing around with bibliographic and other library data (and let’s face it, who doesn’t?) then you are in for a good summer.

Two events to get into your diary now are the WorldCat Mashathon in Amsterdam on 13/14 May, and Mash Oop North (tag is mashlib09) in Huddersfield, UK, on 7th July.

The Worldcat Mashathon is an event organised by OCLC which promises access to data derived from the 1.2 billion records in Worldcat via a variety of web services. This event follows on from the previous Worldcat Hackathon held in New York City last year – to get a flavour of the event you can see a video summary on the Hackathon on YouTube. The OCLC Developer Network wiki has further details and registration.

Mash Oop North is the 2009 incarnation of Mashed Libraries UK. As the organiser of the previous mashlib08 event I can’t really comment on exactly how excellent it was, but Mash Oop North is being organised by Dave Pattern and others, so I can objectively say it is bound to be a brilliant event. To see the kind of thing that happened at mashlib08, and to keep up to date with new of Mash Oop North, keep an eye on http://mashedlibrary.ning.com. Mash Oop North is being sponsored by Talis, although if you are interested in supporting the event, you may want to consider donating a prize (if you aren’t sure what prize to offer, may I suggest a speedboat?)

[For a guide to the cultural references in this post, see http://en.wikipedia.org/wiki/Bullseye_(UK_game_show)]

Technorati Tags: ,

JISC09 Closing Keynote – Ewan McIntosh

The closing keynote is from Ewan McIntosh, who is Digital Commissioner for 4iP – Channel 4’s Innovation for the Public Fund.

Ewan mentioning The Guardian’s Datastore (and reflecting that he wished ‘they’ (presumably Channel 4) had done it first!) – this is a collection data which the Guardian compiles, and is now making available in ways that encourage reuse (although you have to understand the data to make sensible mashups) – you can see some examples from Tony Hirst on OUseful.info

Now mentioning ‘MySociety‘ and ‘Theyworkforyou‘ – noting how making data reusable opens up ways of allowing interacting with the data and combining it to uncover new information. However, opening up data is difficult – example of European newspapers accusing Google of ‘stealing’ their information because they use headlines from their websites – but Ewan noting that Google is driving traffic to the newspapers via the route.

“Free is a hard price to beat”

Mentioning John Houghton and Charles Oppenheim report on economic impact of Open Access – if you rethink the model then there are savings to be made.

“Destination anywhere”

4iP funding lots of projects. But lots of proposals start “X is a site which…” – they are thinking in terms of ‘destinations’ – and Universities are the ‘ultimate destination’. But most people visit only about 6 websites in a day – if you see your website as a destination, then you are saying you are going to compete with those 6 top websites – you are really going to struggle with this.

The VLE is a destination. The only reason students go there is because Universities ‘compel’ them to – it is the only place they can get the information they need. However, this results in students visiting and leaving as soon as they can.

“Participation culture”

Higher Education is not a participatory culture. On the web the current ‘top’ participatory environment is probably Facebook. Example of ‘Who has the biggest brain?’ – like brain training – but you play against others. 50 million players (in 6 months)

iMob – iPhone game a text based strategy game.

Battlefront – a Channel 4 education project – via MySpace and Beebo – encourages young people to get involved in campaigning on issues they care about.

Ewan just said “Hands up if you are not currently twittering” (most of the room) – “you are doing nothing!”. Those twittering are participating – being much more cognitively active.

Ewan describing different ‘spaces’:

  • Watching spaces (tv, theatre, gigs)
  • Participation spaces (marches, meetings, markets)
  • Performing spaces (Second Life, WoW, Home)
  • Publishing spaces (Blogging, Flickr)
  • Group spaces (Bebo, Facebook)
  • Secret Spaces (Mobile, SMS, IM) – sounds like the ‘backchannel’?

The mobile phone is one of the most exciting developments in learning – Google Android and iPhone incredible platforms. But Universities not realising that students are leapfrogging tethered screens to go for mobile. Ewan suggests that the vast majority of students have mobile devices that access the internet – but does your university provide mobile services?

Ewan showing how if you represent his Facebook contacts graphically you can see how the contacts in Academia tend only to be connected to each other – it is a closed world.

“People don’t just do stuff because it’s in your business plan”

Ewan says “I don’t buy the Gen-Y stuff – the Google Generation, the Digital Natives”. It has nothing to do with being ‘young’ – but being ‘youthful’.

Parents think that young people spend about 18.8 hours per week online – but actually they spend an average 43.5 hours per week online – where is this missing time?

Don’t romanticise creativity – it isn’t easy. 90% of the a-v output that people consume comes from LA based corporations – this is not ‘building on the shoulders of giants’.

Access to creative technology comes far too late for most children. Higher Education and JISC can apply pressure to the school sector to give access to, and make use of creative technology.

Ewan says Anonymity is not a bad thing. Some examples where Anonymity does not work – School of Everything, Landshare (both with money from Channel 4). However, some services only work with anonymity – e.g. Embarassing Teenage illnesses (also C4)

Ewan showing a grid that helps thinks about startups – but he suggests it could also be used for University web services, or even other activites:.

 

  Visitor (just looks at stuff) Fan (will sign up but not create content) Contributor (uploads content, comments etc.)
Grab the attention      
Timescale      
Keep the attention again and again      
Timescale      
Turn the value into a tangible assett      
Timescale      

Ewan encourages us to think about applying the grid to your own online offerings (wonder what this would look like for an OPAC?)

Technorati Tags:

JISC09 – Moving from print to digital: e-theses highlight the issues

I’m chairing this session, so may be a bit difficult to blog (since I can’t see the screen from the front). The session goes from the international (DART), to the national (EThOS/EThOSNet), to the institutional (the From Entry to EThOS project at Kings College London)

First up, Chris Pressler (from the University of Nottingham) talking about DART:

DART-Europe – started as an 18 month project between a small group of academic institutions and Proquest. The first phase focussed on the creation of a simple search service to e-theses.

In the first phase the technology wasn’t too difficult, but some question about the business model. Proquest have a commercial service in the USA – but it didn’t seem suitable in Europe.

DART-Europe is now in the second phase administered by Nottingham and UCL – it is no longer a project, but an ongoing service. All partners have a seat on the DART board (really, there is a DART board). Although a UK led project partners (and potential partners) from across Europe.

  • DART now providing access to over 100,000 full-text e-theses. The thesis records come from:
    • 34 data sources (national, consortial or institutional)
    • 13 countries
    • 150 institutions
  • Daily updates
  • Data collection using simple OAI Dublin Core – but MODS and MARC also supported. Took an extremely simple approach to metadata – just 5 pieces of information per thesis.
  • Takes a pragmatic outlook
    • aims to keep things simple – minimise barriers

DART exposes theses to Google (wasn’t very clear how though?)

Although DART takes a simple approach, metadata still needs work.

DART now supports RSS, alerts, export results, multilingual interfaces, and provides usage statistics

How much does it cost to run DART? Not clear – need to look at this, and also benefits. Need to answer the question of whether this can run as a institutional supported service.

DART-Europe has other technical insterests – digital preservation, retrodigitisation…

Conclusions:

  • No dedicated funding means progress incremental – but has produced tangible results
  • Time to start marketing portal to academic community
  • DART-Europe provides a networking organisation for partners – not just about thesis issues

Next up EThOS/EThOSNet (declaration of interest, I’m the Project Director for EThOSNet):

EThOS aims

  • single point of access for UK HE Doctoral theses
  • Support HEIs in transition from print to electronic theses (via a toolkit)
  • digitise existing paper theses

Different participation options supported by EThOS

  • Open Access Sponsor – institution makes ‘up front’ payment to cover digitisation of a set number of theses
  • Associate Member Level 1 – institutions pays as it goes – each time a thesis is digitised, billed monthly
  • Associate Member Level 2 – the first researcher pays, then the digitised version available free
  • Associate Member Level 3 – EThOS simply routes the requester to the awarding institution (where the institution does not want EThOS to digitise theses)

EThOS takes an ‘opt-out’ approach – will put up theses without seeking author permission, but have strong rapid takedown policy so that if an author does not wish their thesis to be made available via EThOS it can be removed immediately.

98 UK HE institutions have signed up for EThOS.

Now Tracy Kent from University of Birmingham talking about the impact of EThOS on Birmingham.

  • University of Birmingham – is an Open Access Sponsor
  • From old ‘microfilm’ service, Birmingham used to supply 5-6 theses per week. In the first few weeks of EThOS going into public beta, providing 5-10 per day
  • University of Birmingham already had some theses in its institutional repository UBIRA – these are harvested by EThOS in order that they can be supplied via EThOS
  • Costs shifted from handling document supply requests to converting and loading etheses into reposityr to facilitate ‘front loading’ of e-thesis content
  • University of Birmingham took decision that if one of their users wanted a thesis from EThOS from a ‘Level 2’ member (i.e. equivalent of ILL) then this would have to be covered from researchers budgets, not from the library ILL budget

Birmingham contacted about 500 authors – only 5 got in touch to say that they would not want to be part of EThOS. A further 10 said they’d like to be included but couldn’t because of publisher restrictions (i.e. they had published, or were going to publish)

Birmingham have a number of procedures in place to check theses before they go to be digitised and believe that this due diligence approach combined with EThOS rapid takedown policy means that they are acting is a responsible way – and so far have had no requests for takedown from authors.

Birmingham have seen that once a thesis is on EThOS it is usually downloaded many times.

The service means that

  • Birmingham University thesis content is being seen and accessed
  • There is a changing role for document supply staff
  • There is a need to train authors to seek out necessary permissions and to ensure that submitted theses have the necessary permissions

Finally in the EThOS section Anthony Troman from the British Library. British Library run the EThOS service – they use a digitisation suite to digitise the paper theses, and make available to the end user by download, or (for additional payment) in other formats such as CD-ROM or paper.

Some questions that have come up:

  • Why not continue with microfilm service?
    • Requests for this service have been declining over the last few years – and was costing the BL large amounts of money
    • The system was not economically viable or sustainable
    • In 2 months usage 8517 individual theses requested for digitisation – well over a years worth under the microfilm service
    • In 2 months 17000 downloads
  • Popularity causing some problems with demand
    • New scanner installed
    • Double shifts – digitisation running 8am-midnight every day
  • Increase in quality between microfilm and digitised

Unfortunately this all costs money! However, a fundamental principal was that ideally theses should be free at point of use. Unfortunately the popularity means that some institutions who have made an upfront contribution are already running short of funds – but there are several options for institutions in this situation and they should contact the BL to discuss options.

Once a thesis is digitised – noone has to pay again – not the institution or the researcher.

Finally (running late which as chair is my fault!) Patricia Methven and Vikas Deora from Kings talking about Entry to EThOS:

Patricia reflecting how many different parts on the institution that needed to be involved in the move to e-theses. Now Vikas saying that Entry to EThOS about the ‘born digital’ theses rather than digitisation.

At Kings e-thesis submission is not mandatory. The Exam Office was keen to test student takeup and to streamline administration. The library was keen to see born-digital deposit due to storage issues and EThOS participation as important drivers. Vikas says with feeling (as a PhD) “The last thing you want to do once you have finished your thesis is to go to a website and fill out hundreds of pieces of information”!

The project looked at creating an e-thesis submission workflow
– how to capture the metadata, integrate with existing workflows, integrate with the repository (Fedora in this case) etc.

Found the student record system as a key source of data – this captures a lot of information about the title of thesis, names of tutors, status of student (e.g. writing up) – and the status of the student was seen as  the driver for the workflow. Because the data is coming from within the institution, the Exam Office don’t need to do further checking – so there were real benefits to the Exam Office which came out of the project – you need to convince them that this is going to save them work!

Bibliographic services had concerns about the metadata – assigning subject headings and keywords etc. So the project tried to integrate this into the workflow, so that the library could still classify the theses. They harvest back  information from the library system (e.g. subject headings) – they weren’t allowed to write into the library system (sounds like there is double entry going on here?)

Student doesn’t have to enter any information when they upload the thesis – just upload the pdf, check the information and it is submitted to the repository.

Kings recommend that the file the student submits is the ‘source’ file – e.g. Word doc or LaTeX etc. They can also submit PDF, or the conversion will be done for them – this allows for more flexibility in terms of long term preservation.

Literally takes 25-30secs for a student to submit an ethesis. Vikas sees this as absolutely key.

What’s next?

  • Move from e-thesis to Virtual Research Environment
  • Policy decision with exam board – does e-submission become mandatory? (Vikas sees this as key to adoption)
  • Embargos

Q: Has EThOS considered changing approaches to Intellectual Property rights after 2 months?

A: No – lots of issues around the IP issues, but must manage issues. Some institutions taking a ‘trial’ approach where they agree with legal advisors to try it out for a short period, subject to review, as a way of starting out, and hopefully getting agreement for long term committment if no legal problems come up. Also mention that institutions may well be insured against legal action.

Technorati Tags:

JISC09 Open Access

This session on the economic impact of Open Access recently published. John Houghton who authored the report with Charles Oppenheim is going to talk about the report to start with.

The project tried to quantify the costs and benefits – creating a series of spreadsheets contains elements identified in the process model of Scholarly Publishing, adding in cost data. There are about 2300 activity items that are costed in these sheets. Some example figures for activities in UK Scholarly publishing in 2007 – Reading cost £2.77billion, writing £1.6billion, Peer review £203million etc.

The overall estimate (for UK scholarly publishing in 2007) was £5.4billion

Then looked at cases and scenarios exploring cost savings result from the alternative publishing models throughout the system. Finally models the impact of changes in accessibility and efficiency on returns to R&D.

In summary – OA publishing models (whether Author pays, Overlay etc.) should save money.

John says ‘of course there would have to be a move of money from subscriptions to e.g. author pays funds’ – lets not underestimate the impact of this – this move of funds is likely to be politically charged, and challenging to organisations. I would also be interested in seeing some analysis of how (for example) ‘author pays’ might change the profile of expenditure across institutions – would this result in expenditure being more or less concentrated in research heavy institutions, or is it neutral in this respect?

Other side of the coin – benefits of Open Access models also more than benefits of traditional publishing.

See http://www.cfses.com/EI-ASPM/ for the opportunity to see, and play with, a simplified model.

Now Hector MacQueen (from University of Edinburgh School of Law) talking about Legal Perspectives on OA Publishing – going to make remarks from personal experience. Hector started by thinking about doing research by electronic means – and only gradually come to think of it in terms of Open Access.

In 1978 research was based around physical access – Hector’s material in libraries and archives. You had to get yourself to the physical location, or sometimes via ILL (although often material Hector wanted was not available via ILL).

The first electronic resource in Hector’s area was Lexis (now Lexis-Nexis) – but it was made very clear to the academics that this was not free. Not only was Lexis restricted to a single terminal but there was a ‘gatekeeper’ (person) who you had to go to to get searches done.

Courts (and others) started to make material available on the web – for free. So for formal sources, this was the start of ‘open access’ – or at least free access. The library became less of the place to get resources – moved to the desktop.

Then Hector started exploring the website of individuals who were publishing – and became aware of ‘self-archiving’ activity – especially in the USA. Hector followed suit. There was already a lot of informal sharing (and what Hector describes as ‘informal peer-review’ – essentially pre-publication comments from peers) happening – via email etc.

Hector notes that the Open Access world needs to recognise this informal activity. I agree – I’d go further and say that one of the problems we (Libraries/OA Movement) have is that we tried to formalise this type of activity, rather than working to support the existing informal sharing. Hindsight is a wonderful thing, but I’m not sure that we are past this stage yet (certainly not in all disciplines) – a consideration of how we can support informal sharing woudl still be a valuable exercise I think.

Hector now commenting on Copyright – the impact of the Google Book Search agreement (how will this impact in the UK?), Project Gutenberg, European Digital Library, Amazon + Kindle – also noting the impact of iTunes on availability of music and the dropping of DRM.

Finally before group discussion, an academic (who?) in Theatre studies.

Those studying early theatre groups (e.g. The King’s Men – Shakespeare’s troop) – have problems tracking records, as can be extremely distributed (around county record offices, various sources in their home location etc.) There is work (Reed) bringing together records from all over the UK – which are being published in an expensive series of books – one per town, a series that has been updated over the last 30 years (I think that’s right). However, the leader of the project negotiated from the off (30 years ago) that he had personal rights to the digital distribution – and so he can now make all the pdfs for the publications available via the Internet Archives, and build a database of information listing actors, troops, locations, plays, writers.

EEBO is another source – a resource created from this called DEEP (Database of Early English Playbooks) – which uses book title pages from EEBO to build a database of Early English Play Books.

EEBO is available via a JISC Collections deal in the UK. This, combined with seamless authentication via IP address, leads people to believe it is ‘free’ – this is an issue when trying to get people to understand the importance of Open Access.

The academic talking about a journal he is involved in which is not Open Access and reflecting why – an editorial stipend from the publisher allows the academics involved to promote the work. However he suspects that if these costs were met effectively by the tax payer, then the overall cost would be lower, as the

Q: I think the question was: What about Open Access outside Higher Education – that is should OA material created by HE be available to all others or just HE community?

A: In general the debate around OA has been focused on STEM – but OA would have a big impact on e.g. Law firms as well

A: The pharmaceutical industry could be a key beneficiary of OA – this is not necessarily a problem but needs to be recognised

A: Need to consider carefully how institutions add value – perhaps what is published is not the value add, but the interpretation of that – knowledge transfer etc.

Q: What would we need to do to make researchers fill the repositories that are out there and empty?

A: (not sure who said this, but a researcher) Mandating deposit essential – but not popular with the academics. Realising that some academics don’t want to be read!

A: Mandating good thing (John Houghton). Something that persuades colleagues is finding things that are available via OA. This means that your own institutional repository is very rarely of interest – it is what is being published elsewhere.

A: Charles Oppenheim – Citation advantage compelling – OA material gets cited more, but the reasons for this not well understood. University of Southampton now more highly cited than Oxford/Cambridge (according to Southampton)

Interesting that the view from the researchers is ‘mandate’. Really good to get researcher’s views, but suspect that the reason they are speaking at this event is exactly because they are atypical.

Technorati Tags: