UCAS Course code lookup

While I was writing my entry for the JISC MOSAIC competition (which I will write up more thoroughly in a later post I promise – honest), one of the problems I encountered was retrieving details of courses and institutions from the UCAS website. Unfortunately UCAS don’t seem to provide a nice API to their catalogue of course/institution data. To extract the data I was going to have to scrape it out of their HTML pages. Even more unfortunately they require a session ID before you can successfully get back search results – this means you essentially have to start a session on the website and retrieve the session ID before you can start to do a search.

I hacked together something to do enable me to get what I needed to do for the MOSAIC competition. However, I wasn’t the only person who had this problem – in a blog entry on his MOSAIC entry Tony Hirst notes the same problem. At the time Tony asked if I would be making what I’d done available, and I was very happy to – unfortunately the way I’d done it I couldn’t expose just the UCAS course code search. I started to re-write the code but writing something that I could share with other people, with appropriate error checking and feedback proved more challenging than my original dirty hack.

I’ve finally got round to it – it works as follows:

The service is at http://www.meanboyfriend.com/readtolearn/ucas_code_search?
The service currently accepts two parameters:

  • course_code
  • catalogue_year

The course_code parameter simply accepts a UCAS course code. I haven’t been able to find out what the course code format is restricted to – but it looks like it is a maximum of 4 alphanumeric characters, so this is what the script accepts. Assuming the code meets this criteria, the script passes this directly to the UCAS catalogue search. The UCAS catalogue doesn’t seem to care whether alpha characters are upper or lower case and treats them as equivalent. For some examples of UCAS codes, you can see this list provided by Dave Pattern. (see Addendum 2 for more information on UCAS course codes and JACS)

The catalogue_year parameter takes the year in the format yyyy. If no value is given then the UCAS catalogue seems to default to the current year (2010 at the moment). If an invalid year is given the UCAS catalogue also seems to default to the current year. It seems that at most only two years are valid at a single time. However the script doesn’t check any of this – as long as it gets a valid four digit year, it passes it on to the UCAS catalogue search.

An example is http://www.meanboyfriend.com/readtolearn/ucas_code_search/?course_code=R901&catalogue_year=2010

The script’s output is xml of the form:

<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<course_name>xxxx</course_name> (repeatable)
</institution>
</ucas_course_results>

(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)

<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<inst_ucas_url>[URL for Institution record on UCAS website]</inst_ucas_url>
<course ucas_catalogue_id=””> (repeatable) (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>[URL for course record on UCAS website]</course_ucas_url>
<name>xxxx</name>
</course>
</institution>
</ucas_course_results>

For example:

<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<course_name>Combined Modern Languages</course_name>
</institution>
</ucas_course_results>

(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)

<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<inst_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsInstDetails.run?i=P80
>/inst_ucas_url>
<course ucas_catalogue_id=””> (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsDetails.run?n=989628
</course_ucas_url>
<name>Combined Modern Languages</name>
</course>
</institution>
</ucas_course_results>

The values fed to the script and the StateID for the UCAS website is fed back in the response.

If there is an error at some point in the process and error message will be included in the response in an <error> tag.

Addendum 1
The script relies on the HTML returned by UCAS remaining consistent. If this changes, my script will probably break.

Having done the hard work I’d be happy to offer alternative formats for the data returned by the script – just let me know in the comments. I’d also be happy to look at different XML structures for the data so again just leave a comment.

Something I should have mentioned in the original post. Given the data returned by the script you should be able to form a URL which links to an institution on the UCAS website using a URL of the form:
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/<insert state ID from xml here>/HAHTpage/search.HsInstDetails.run?i=<insert institution code here>

Since finishing this work last night I’ve realised that I’ve left out one important piece of data which is an identifier that would let you form a link to a specific course from a specific institution. I have slightly restructured the XML to leave a space for the ucas_catalogue_id in the XML. I’ll add this in as soon as I can.
This has now been added.

Addendum 2
I’ve just found quite a bit more detail on the format and structure of the UCAS ‘course codes’. UCAS now uses JACS (Joint Academic Coding System) for course codes (see JACS documentation from HESA). JACS codes consist of 4 characters, the first being an uppercase letter and the remaining three characters being digits. JACS codes are essentially hierarchical with the first character representing a general subject area and the digits representing subdivisions (in with increasing granularity). The codes in the UCAS catalogue are a mixture of JACS 1.7 and JACS 2.0 codes. A full listing of JACS v2.0 codes is available from HESA, and a listing of JACS v1.7 codes is available from UCAS as a pdf.

UCAS have an explanation of why and where they use both JACS v2.0 and JACS v1.7.

However because UCAS need to code courses which cover more than one subject area, they have rules for representing these courses while sticking to codes with a total length of 4 characters. These rules are summarised on the UCAS website, but a fuller description is available in pdf format. This last document is most interesting because it indicates how you might create the UCAS code from a HESA Student Record which could be of interest for future mashups.

The implications of all this for my script are relatively small as I currently assume that there is a 4 character alpha-numeric code. On the basis of this documentation I could refine this to check for 3 alpha-numeric characters followed by a single digit I guess – perhaps I will at some point.

Finally it looks like UCAS and HESA are currently looking at JACS v3.0 which could introduce further changes I guess, although it looks unlikely that this will affect the code format, but rather the possible values, and maybe the meaning of some values. While this isn’t a problem for my script, it would mean that historical course codes from datasets such as MOSAIC could not be assumed to represent the same subject areas in the current UCAS course catalogue as they did when the data was recorded – which is, to say the least, a pain.

Addendum 3
A final set of changes (I hope):

  • The ucas_catalogue_id is now populated
  • Added inst_ucas_url element which contains the URL linking to the Institution record in the UCAS catalogue
  • Added course_ucas_url element which contains the URL linking to the Course record in the UCAS catalogue

Everyone’s a winner?

The results of the JISC MOSAIC competition were announced this week. The winning entries were great, and I think their prizes were well deserved. The only downside in this was that my entry didn’t make the cut. I will admit to having a moment of disappointment over this, but this passed in about 5 seconds – after all, I’d really enjoyed the challenge of writing my entry and was relatively pleased with the result.

Later in the week I fell into conversation with a couple of people on Twitter about how there hadn’t been much collaboration in the competition. With one notable exception none of the contestants had published early thoughts online, and all the entries had been from individuals rather than teams.

During the course of this conversation I managed to both insult and upset someone I greatly like, admire and respect. For this I am truly sorry. This post is is in the way of an apology as well as an attempt to express my own thoughts around the nature of ‘developer competitions’ such as JISC MOSAIC.

The idea of a developer competition is that you set a challenge, aimed at computer programmers and interested others, and offer prizes to the best entries – the criteria can vary wildly. Perhaps the biggest prize of this type we’ve seen is the $1million NetFlix prize, but in the UK HE community where I work there have been a few smaller prizes on offer, and more widely in the UK community there have been prizes for ideas about using government data, and we are about to see one launched on the use of Museum data. The JISC MOSAIC competition offered a 1st prize of £1000 for work on library usage data.

One of the amazing things about the web, and perhaps particularly about the communities I’m engaged in, is the incredible personal commitment made in terms of time and resource by individuals to what many would regard as ‘work’. Both of the people I was talking to put in a great deal of effort into contributing to and developing ideas that many might think of as ‘the day job’ – and they do so with no thought of reward.

So – given this tendency to be self-motivated to solve problems, contribute, take part etc. Why do we need developer competitions?

My starting point is to look at my own motivation for entering the JISC MOSAIC competition. Would I have done this work without the competition? Trying to be completely honest here – probably not. However, I would almost certainly done other things instead – perhaps blogged more, perhaps done some other development (like this). So the competition focussed my energy on a particular area of work. Was I motivated by the cash prize? I’m not sure – at the end of the day it isn’t that relevant to me (although no doubt I could have found something to treat myself to). I think it was just the idea of the ‘competition’ that gave me the focus. I’m the kind of person who works relatively well with clear deadlines – so having a date by which a set of work was to be done definitely gave me something to aim at.

So – the competition was one element. However, I was also looking for ways to dust off my scripting skills. I used to script in Perl as part of my job, but I haven’t done this for several years – I had been looking for ways of picking this up again as it was something I always enjoyed doing. I am also extremely interested in the ideas behind the competition – I believe libraries should be exploiting their usage data more, and I was keen to show the community how valuable that usage data might be.

I don’t assume that others are motivated in the same way as me. When the usage data that was part of the JISC MOSAIC competition was first put online somebody immediately took it and transformed it into RDF – they weren’t motivated by a competition, they just did it.

My conclusion is that such competitions harness existing energy in the community and focus it on a particular problem for a particular time period. It won’t generally work where people aren’t inclined to do the work anyway. You need an interesting problem or proposition to engage people.

So far, so good? I’m not sure. The problem with a competition is that it is, well, competitive. Again trying to be honest about my own situation (and I’m not particularly proud of this, so don’t take it as an endorsement of my own approach) is that I immediately became more protective of my ideas. The competition had put a ‘value’ on them that they hadn’t previously had. I should say I actually started work on two entries to the competition – one was in collaboration with someone else, which unfortunately we weren’t able to pull together in time – so it wasn’t all about ‘me’. However, I didn’t announce my own entry until I was ready to submit. This isn’t how I usually work – I’m usually happy to share half baked ideas (as readers of this blog will know only too well!).

Again I think the factors around this are complex. It wasn’t just that I didn’t want to give away my idea. The truth is that I’m not a very good programmer. I wanted to take this chance to develop my programming skills (or at least get myself back to my previous level of incompetence). I am under no illusions – any developer worth their salt could take my idea and do a better job with it. In general this would be great – if my idea is good enough to inspire other people to do it much much better than I can I’d be very happy. But for the period of the competition this suddenly seemed like a bad idea.

Reflecting on this now, this shows a pretty rubbish (on my part) attitude to others – the ‘fear’ that my idea would be ‘stolen’ (and of course the egoism that says my idea was worth stealing). I’m pretty confident in retrospect that the only possible outcome of publishing early would have been a better entry (possibly in collaboration with others). However, I would say that my guess is it would have resulted in me not doing the coding – which I would have been sorry about.

I am going to blog my entry in detail, and release all the work I’ve done – which others are more than welcome to use and abuse.

So although I think developer competitions work in terms of focussing people on a problem, I think there are some possible downsides, perhaps chief of which is that competitions may discourage collaboration. I don’t think this is a given though, and so in closing here are some thoughts that future developer competitions might want to consider:

  • Is there an element in your competition that encourages team entries above or aswell as individual entries?
  • Can you reward collaboration either within or outside the competition structure?
  • How are you going to ensure that the whole community can share and benefit from the competition outcomes? Plan this from day 1!

Perhaps consider splitting the prizes in different ways to acheive this – not one ‘big winner’, but rather judging and rewarding contributions as you go along. Perhaps consider having a ‘collaboration’ environment where ideas can be submitted (and judged separately) and where teams can form and work together.

A final thought – I really enjoyed entering the JISC MOSAIC competition – it stretched my skills and scratched an itch for me. I am in no way disappointed I didn’t win – the winning entries were very deserving. I fully intend to do more scripting/programming going forward. And sharing.

Would you recommend your recommender?

We are starting to see software and projects emerging that utilise library usage data to make recommendations to library users about things they might find useful. Perhaps the most famous example of this type of service is the Amazon ‘people who bought this also bought’ recommendations.

In libraries we have just had the results of the JISC MOSAIC project announced, which challenged developers to show what they could do with library usage data. This used usage data from Huddersfield, where Dave Pattern has led the way both in exploiting the usage data within the Huddersfield OPAC, and also in making the data available to the wider community.

On the commercial side we now have the bX software from Ex Libris, which takes usage data from SFX installations across the world (SFX is an OpenURL resolver which essentially helps makes links between descriptions of bibliographic items and the full text of the items online). By tracking what fulltext resources a user accesses in a session and looking at behaviour over millions of transactions, this can start to make associations between different fulltext resources (usually journal articles).

I was involved in trialling bX, and I talked to some of the subject librarians about the service and the first question they wanted to know the answer to was “how does it come up with the recommendations”. There is a paper on some of the work that led to the bX product, although a cursory reading doesn’t tell me exactly how the recommendations are made. Honestly I actually hope that there is some reasonably clever mathematical/statistical analysis going on behind the recommendation service that I’m not going to understand. For me the question shouldn’t be “how does it work?” but “does it work?” – that is are the recommendations any good?

So we have a new problem – how do we measure the quality of the recommendations we get from these services?

Perhaps the most obvious approach is to get users to assess the quality of the recommendations. This is the approach that perhaps most libraries would take if assessing a new resource. It’s also an approach that Google take. However, when looking at a recommender service that goes across all subject areas, getting a representative sample of people from across an institution to test the service thoroughly might be difficult.

Another approach is to use a recommendation service and then do a longitudinal study of user behaviour and try to draw conclusions about the success of the service. This is how I’d see Dave Pattern’s work at Huddersfield, which he recently presented on at ILI09. Dave’s analysis is extremely interesting and shows some correlations between the introduction of the recommender service and user behaviour. However, it may not be economic to do this where there is a cost to the recommender service.

The final approach, and one that appeals to me, is that taken by the NetFlix Prize competition. The NetFlix Prize was an attempt by the DVD/Movie lending company NetFlix to improve their recommendation algorithm. They offered a prize of $1million to anyone who could improve on their existing algorithms by a factor of 10% or more. The NetFlix prize actually looked at how people rated (1-5) movies they had watched – based on previous ratings the goal was to predict how individuals might rate other movies. The way the competition was structured was that a data set with ratings was given to contestants, along with a set of ratings where the actual values of the ratings had been removed. The challenge was to find an algorithm that would fill in these missing ratings accurately (or more accurately than the existing algorithm). This is a typical approach when looking at machine based predictions – you have a ‘training set’ of data – which you feed into the algorithms, and the ‘testing set’ which is the real life data against which you compare the machine ‘predictions’.

The datasets are available at the UCI Machine Learning Repository. The Netflix prize was finally won in September 2009 after almost 3 years.

What I find interesting about this approach is that it tests the recommendation algorithm against real data. Perhaps this is an approach we could look at with recommendation services for libraries – to feed in a partial set of data from our own systems and see whether the recommendations we get back match the rest of our data. As we start to see competition in this marketplace, we are going to want to know which services best suit our institutions.

Middlemash, Middlemarch, Middlemap

The next Mashed Library event was announced a few months ago, but now more details are available. Middlemash is happening at Birmingham City University on 30th November 2009. I hope to see you there.

In discussion with Damyanti Patel, who is organising Middlemash, we thought it would be nice to do a little project in advance of Middlemash. When we brainstormed what we could do I originally suggested that maybe someone had drawn a map of the fictional geography of Middlemarch, and if we could find one, we could make it interactive in some way. Unfortunately a quick search turned up no such map. However, what it did turn up was something equally interesting – this map of relationships between characters in Middlemarch on LibraryThing.

This inspired a new idea – whether this could be represented in RDF somehow. My first thought was FOAF, but initially this seemed limited as it doesn’t allow for the expression of different types of relationship. However, I then came across this post from Ian Davis (this is the first in a series of 3), which used the Relationship vocabulary in addition to FOAF to express more the kind of thing I was looking for.

The resulting RDF is at http://www.meanboyfriend.com/overdue_ideas/middlemash.rdf. However, if you want to explore this is a more user-friendly manner, you probably want to use an RDF viewer. Although there are several you could use, the one I found easiest as a starting point was the Zitgist dataviewer. You should be able to browse the file directly with Zitgist via this link. There are however a couple of issues:

  • Zitgist doesn’t seem to display the whole file, although if you browse through relationships you can view all records evenutally
  • At time of posting I’m having some problems with Zitgist response times, but hopefully these are temporary

This is the first time I’d written any RDF, and I did it by hand, and I was learning as I went along. So I’d be very glad to know what I’ve done wrong, and how to improve it – leave comments on this post please.

I did find some problems with the Relationship vocabulary. It still only expresses a specific range of relationships. It also seems to rely on inferred relationships in some cases. The relationships uncle/aunt/nephew/niece aren’t expressed directly in the relationship vocabulary – presumably on the basis that they could be inferred through other relationships of ‘parentOf’, ‘childOf’ and ‘siblingOf’ (i.e. your uncle is your father’s brother etc.). However, in Middlemarch there are a few characters who are described as related in this manner, but to my knowledge no mention of the intermediary relationships are made. So we know that Edward Causubon has an Aunt Julia, but it is not stated whether she is his father’s or mother’s sister, and further his parents are not mentioned (this is as far as I know, I haven’t read Middlemarch for many years, and I went from SparkNotes and the relationship map on LibraryThing).

Something that seemed odd is that the Relationship vocabulary does allow you explicitly to relate grandparents to grandchildren without relying on the inferrence from two parentOf relathionships.

Another problem, which is one that Ian Davis explores at length in his posts on representing Einsteins biography in RDF is the time element. The relationships I express here aren’t linked to time – so where someone has remarried it is impossible to say from the work I have done here whether they are polygamous or not! I suspect that at least some of this could have been dealt with by adding details like dates of marriages via the Bio vocabulary Ian uses, but I think this would be a problem in terms of the details available from Middlemarch itself (I’m not confident that dates would necessarily be given). It also looked like hard work 🙂

So – there you have it, my first foray into RDF – a nice experiment, and potentially an interesting way of developing representations of literary works in the future?