While I was writing my entry for the JISC MOSAIC competition (which I will write up more thoroughly in a later post I promise – honest), one of the problems I encountered was retrieving details of courses and institutions from the UCAS website. Unfortunately UCAS don’t seem to provide a nice API to their catalogue of course/institution data. To extract the data I was going to have to scrape it out of their HTML pages. Even more unfortunately they require a session ID before you can successfully get back search results – this means you essentially have to start a session on the website and retrieve the session ID before you can start to do a search.
I hacked together something to do enable me to get what I needed to do for the MOSAIC competition. However, I wasn’t the only person who had this problem – in a blog entry on his MOSAIC entry Tony Hirst notes the same problem. At the time Tony asked if I would be making what I’d done available, and I was very happy to – unfortunately the way I’d done it I couldn’t expose just the UCAS course code search. I started to re-write the code but writing something that I could share with other people, with appropriate error checking and feedback proved more challenging than my original dirty hack.
I’ve finally got round to it – it works as follows:
The service is at http://www.meanboyfriend.com/readtolearn/ucas_code_search?
The service currently accepts two parameters:
- course_code
- catalogue_year
The course_code parameter simply accepts a UCAS course code. I haven’t been able to find out what the course code format is restricted to – but it looks like it is a maximum of 4 alphanumeric characters, so this is what the script accepts. Assuming the code meets this criteria, the script passes this directly to the UCAS catalogue search. The UCAS catalogue doesn’t seem to care whether alpha characters are upper or lower case and treats them as equivalent. For some examples of UCAS codes, you can see this list provided by Dave Pattern. (see Addendum 2 for more information on UCAS course codes and JACS)
The catalogue_year parameter takes the year in the format yyyy. If no value is given then the UCAS catalogue seems to default to the current year (2010 at the moment). If an invalid year is given the UCAS catalogue also seems to default to the current year. It seems that at most only two years are valid at a single time. However the script doesn’t check any of this – as long as it gets a valid four digit year, it passes it on to the UCAS catalogue search.
An example is http://www.meanboyfriend.com/readtolearn/ucas_code_search/?course_code=R901&catalogue_year=2010
The script’s output is xml of the form:
<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<course_name>xxxx</course_name> (repeatable)
</institution>
</ucas_course_results>
(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)
<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<inst_ucas_url>[URL for Institution record on UCAS website]</inst_ucas_url>
<course ucas_catalogue_id=””> (repeatable) (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>[URL for course record on UCAS website]</course_ucas_url>
<name>xxxx</name>
</course>
</institution>
</ucas_course_results>
For example:
<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<course_name>Combined Modern Languages</course_name>
</institution>
</ucas_course_results>
(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)
<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<inst_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsInstDetails.run?i=P80
>/inst_ucas_url>
<course ucas_catalogue_id=””> (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsDetails.run?n=989628
</course_ucas_url>
<name>Combined Modern Languages</name>
</course>
</institution>
</ucas_course_results>
The values fed to the script and the StateID for the UCAS website is fed back in the response.
If there is an error at some point in the process and error message will be included in the response in an <error> tag.
Addendum 1
The script relies on the HTML returned by UCAS remaining consistent. If this changes, my script will probably break.
Having done the hard work I’d be happy to offer alternative formats for the data returned by the script – just let me know in the comments. I’d also be happy to look at different XML structures for the data so again just leave a comment.
Something I should have mentioned in the original post. Given the data returned by the script you should be able to form a URL which links to an institution on the UCAS website using a URL of the form:
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/<insert state ID from xml here>/HAHTpage/search.HsInstDetails.run?i=<insert institution code here>
Since finishing this work last night I’ve realised that I’ve left out one important piece of data which is an identifier that would let you form a link to a specific course from a specific institution. I have slightly restructured the XML to leave a space for the ucas_catalogue_id in the XML. I’ll add this in as soon as I can.
This has now been added.
Addendum 2
I’ve just found quite a bit more detail on the format and structure of the UCAS ‘course codes’. UCAS now uses JACS (Joint Academic Coding System) for course codes (see JACS documentation from HESA). JACS codes consist of 4 characters, the first being an uppercase letter and the remaining three characters being digits. JACS codes are essentially hierarchical with the first character representing a general subject area and the digits representing subdivisions (in with increasing granularity). The codes in the UCAS catalogue are a mixture of JACS 1.7 and JACS 2.0 codes. A full listing of JACS v2.0 codes is available from HESA, and a listing of JACS v1.7 codes is available from UCAS as a pdf.
UCAS have an explanation of why and where they use both JACS v2.0 and JACS v1.7.
However because UCAS need to code courses which cover more than one subject area, they have rules for representing these courses while sticking to codes with a total length of 4 characters. These rules are summarised on the UCAS website, but a fuller description is available in pdf format. This last document is most interesting because it indicates how you might create the UCAS code from a HESA Student Record which could be of interest for future mashups.
The implications of all this for my script are relatively small as I currently assume that there is a 4 character alpha-numeric code. On the basis of this documentation I could refine this to check for 3 alpha-numeric characters followed by a single digit I guess – perhaps I will at some point.
Finally it looks like UCAS and HESA are currently looking at JACS v3.0 which could introduce further changes I guess, although it looks unlikely that this will affect the code format, but rather the possible values, and maybe the meaning of some values. While this isn’t a problem for my script, it would mean that historical course codes from datasets such as MOSAIC could not be assumed to represent the same subject areas in the current UCAS course catalogue as they did when the data was recorded – which is, to say the least, a pain.
Addendum 3
A final set of changes (I hope):
- The ucas_catalogue_id is now populated
- Added inst_ucas_url element which contains the URL linking to the Institution record in the UCAS catalogue
- Added course_ucas_url element which contains the URL linking to the Course record in the UCAS catalogue