What to Watch

TV Reviews have always (in my mind at least) been a bit of an oddity – if you watched the programme, what is there to tell you? And if you missed it, why are you interested? (although despite this I’ve always enjoyed reading TV reviews – especially Nancy Banks-Smith)

However, with the availability of TV ‘catchup’ services online – particularly the iPlayer – the TV review can not just be an amusing piece of writing, but actually help you decide whether the programme is worth watching on catchup. With this in mind, it struck me that it would be nice to link from reviews to the relevant catchup service.

The Guardian TV reviews were an obvious starting place for me for two reasons. Firstly I’m a big fan of the Guardian, and secondly their ‘Open Platform‘ experiment enables people like me to grab and republish their content (with appropriate attribution). The BBC iPlayer catchup service was also an obvious starting point, partly because it is the most popular catchup service in the UK, and again because the BBC already provide some level of structured access to their data, providing much of their programme information as ‘linked data’. So I decided to try to mashup the Guardian TV reviews with the BBC iPlayer service.

Unfortunately although the Guardian provide a lot of structured metadata with their articles via the Open Platform, the TV programmes mentioned are not part of the structured metadata – so I was left having to scrape the names of programmes from the title or body of the article somehow. Here I came across a discrepancy – in the RSS feed for the “Last night’s TV” within a review each programme name was surrounded by a <strong> tag – but this wasn’t the case in the version of the reviews on the Open Platform – they were missing this tag.

On the support forum for the Guardian Open Platform I got some really helpful (and prompt) responses from Matt Mcallister including:

The style formatting that you’re looking for isn’t available in the current version of the API.  But we’ve had similar feedback from others and have included that feature request in our to-do list.

Because of this, I ended up using the RSS feeds to grab the programme names. The channel the programme was on always followed the programme name in brackets, so I was able to grab this reasonably easily at the same time using a regular expression:

m/<strong>(.*?)<\/strong>\s?\((.*?)\)/g

(I’m not a reg exp expert, so any better version of this welcome, but it does the job)

Because the content in the RSS feeds are intended for personal use only, I can’t republish content from here – but luckily the RSS feed includes an ‘item id’ which can be used to look up the content on the Open Platform – so I combine the programme names with text of the article and other information from the Open Platform – and I’ve got my list of programmes with the full-text of the reviews attached.

Now to mash up with the BBC content. The biggest problem is going from the programme name to the information the BBC can provide, which is identified by a unique ID. For example the URI for the series “Richard Hammond’s Invisible Worlds” is http://www.bbc.co.uk/programmes/b00rmrmm, but from the review all I get is the name of the programme as a text string. I started to play around with simply throwing the programme name at the BBC search engine, but then @moustaki (Yves Raimond) came to my rescue by letting me know you could simply construct a URL:

http://www.bbc.co.uk/programmes/title of your programme [with spaces removed]

and it would automatically try to match (generally exactly, but with some ‘added heuristics’). So I was able to construct a URL like this, and then from the response I grabbed the final destination page so:

http://www.bbc.co.uk/programmes/richardhammond’sinvisibleworlds

redirects to

http://www.bbc.co.uk/programmes/b00rmrmm

This doesn’t work in 100% of cases – the main problem I’ve come across is when the Guardian reviews a programme from a documentary strand (e.g. Time Shift or Storyville), they often just use the title of the episode, and omit the name of the strand. Unfortunately the BBC linking doesn’t pick this up – so for example:

http://www.bbc.co.uk/programmes/bread:aloafaffair

doesn’t pick up this Time Shift episode:

http://www.bbc.co.uk/programmes/b00rm508

Overall this approach gives a relatively good hit rate. At the moment if this is unsuccessful I just offer a link to a BBC search for the programme title – I could probably do some of this with some code to improve the match rate?

The next problem was how to get the iPlayer details for the programme. Luckily the BBC expose a lot of their programme data in a structured way. I had expected the iPlayer data to be available as RDF, as this is how the BBC has been exposing a lot of their data (there is lots written about this – see this Nodalities article for example) – but it looks like the iPlayer information is still on the edges of this – however, there is a standard way of retrieving iPlayer data which is documented on the BBC Backstage site. This allows you to construct a URI using a ‘groupID’ (that is an ID which represents the group which owns the programme – this is usually the ‘series’ ID) – so for the Richard Hammond series we can use the following URI:

http://www.bbc.co.uk/programmes/b00rmrmm/episodes/player

This returns some XML including the availability of the episodes on the iPlayer. I then integrate this with the Guardian data, and I have my final set of data that I’m ready to publish – a TV review, and links to episodes of programmes mentioned in that review on the BBC iPlayer service.

The next step was to publish this on the web somewhere. Now, this is where my skills fell down badly. Basically I’m not a designer, and making stuff look good is really not my forte. To quote my art teacher in a school report about my skills at art and handwriting:

Art: Tries hard with some success

Handwriting: Tries hard

So, my first attempt was as simple HTML as you could get, and (to put it bluntly) as ugly as sin (anyone who has looked at my ReadtoLearn app will know the score). I was left wondering how I could deliver something that looked nice, but didn’t require me to magically gain design skills, and I had a sudden inspiration: WordPress. The reason this site looks good (I hope) despite my lack of design skills is that I use WordPress, and one of the many, many ‘themes’ available – so I though if I could squeeze the data I had into a WordPress installation I’d have a nice looking web interface for free.

Some further thought and investigation and I realised that the easiest way (I could see) of achieving this was to publish my application as an RSS feed (no need to worry about the formatting – just the content), and then use one of the WordPress ‘syndication’ plugins to scoope up the RSS items and republish as WordPress blog posts. The use of WordPress syndication was something I originally picked up from Tony Hirst, who pointed at a blog post by Jim Groom describing exactly how to do it.

So, some tweaking of my code to output RSS (with most of the useful information in the description tag in basic HTML), a ‘one click’ install of WordPress on my website (I use Dreamhost as my web site host who offer a number of ‘one-click’ installs including WordPress), a little while experimenting with different themes to see which one I liked best for this particular app (I went with Boumatic by Allan Cole), and I had transformed by ugly html into something altogether more elegant and polished.

The Guardian Open Platform terms and conditions mean that you have to refresh your data at least every 24 hours (I’m guessing this is in case there are any corrections or take-down notices on any of their content), so I added another WordPress plugin which deletes posts automatically after 1 day, and then I had my “What to Watch” application ready to go. Not only that, but adding the WPTouch plugin means the site also works nicely on a range of handheld devices – no extra effort on my part.

There’s still some work to do, and I’ve got some ideas for improvements, but for now I’m pretty happy both with the mashup, and the way I’ve managed to publish it via WordPress – but as always suggestions for different approaches, or improvements to the app are very welcome. Have a look at What to Watch and tell me what you think 🙂

Linked Data

Linked Data is getting a lot of press at the moment – perhaps most notably last week Gordon Brown (the UK Prime Minister) said:

Underpinning the digital transformation that we are likely to see over the coming decade is the creation of the next generation of the web – what is called the semantic web, or the web of linked data.

This statement was part of a speech at “Building Britain’s Digital Future” (#bbdf) (for more on the context of this statement, see David Flanders ‘eye witness’ account of the speech, and his thoughts)

Last week I attended a ‘Platform Open Day‘ at Talis, which was about Linked Data and related technologies, so I thought I’d try to get my thoughts in order. I may well have misunderstood bits and pieces here and there, but I’m pretty sure that the gist of what I’m saying here is right (and feel free to post comments or clarifications if I’ve got anything wrong).

I’m going to start with considering what Linked Data is…

The principles of Linked Data are stated by Tim Berners-Lee as:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

What does this mean?

While most people are familiar with URLs, the concept of a URI is less well known. A URL is a resource locator – if you know the URL, you can locate the resource. A URI is a resource identifier – it simply identifies the resource. In fact, URLs are a special kind of URI – that is any URL is also a URI in that a URL both identifies and locates a resource. So – all URLs are also URIs, but not vice versa. You can read more about URIs on Wikipedia.

Further to this, an ‘HTTP URI’ is a URL as we are used to using on the web.

This means that the first two principles together basically say you should identify things using web addresses. This sounds reasonably straightforward. Unfortunately there is some quite tricky stuff hidden behind these straightforward principles, which basically come down to the fact that you have to be very careful and clear about what any particular http URI identifies.

For example this URI:

http://www.amazon.co.uk/Pride-Prejudice-Penguin-Classics-Austen/dp/0141439513/ref=sr_1_9?ie=UTF8&s=books&qid=1269423132&sr=8-9

Doesn’t identify (as you might expect) Pride and Prejudice, but rather identifies the Amazon web page that describes the Penguin Classics edition of Pride and Prejudice. This may seem like splitting hairs, but if you want to start to make statements about things using their identifiers it is very important. I might want to state that the author of Pride and Prejudice is Jane Austen. If I say:

http://www.amazon.co.uk/Pride-Prejudice-Penguin-Classics-Austen/dp/0141439513/ref=sr_1_9?ie=UTF8&s=books&qid=1269423132&sr=8-9 is authored by Jane Austen, then strictly I’m saying Jane Austen wrote the web page, rather than the book described by the web page.

Moving on to principle 3, things get a little more controversial. I’m going to break this down into two parts. Firstly “When someone looks up a URI, provide useful information”. Probably the key thing to note here is that when you identify things with an http uri (as per principles 1 and 2), you are often going to be identifying things that can’t be delivered online. If I identify a physical copy of a book (for example, my copy of Pride and Prejudice, sitting on my bookshelf), I can give it a http URI to identify it, but if you type that URI into a web browser, or in some other way try to ‘retrieve’ that URI, you aren’t going to get the physical item appear before you – so if you lookup that URI the third principle says that you should get some ‘useful information’ – for example, you might get a description of my copy of Pride and Prejudice. There are some technical implications of this, as I have to make sure that you get some useful information about the item (e.g a description), while still being clear that the URI identifies the physical item, rather than identifying the description of the physical item – but I’m not going to worry too much about this now.

The  second part of principle 3 is where we move into territory which tends to set off heated debate. This says “using the standards (RDF, SPARQL)”. Firstly it invokes ‘standards’, and secondly it lists two specific standards. I feel that the wording isn’t very helpful. It does make it clear that Linked Data is about doing things in a standardised way – this is clearly important, and yet also very difficult – as anyone who has worked with Bibliographic metadata will appreciate, achieving standards even across a relatively small and tight-knit community such as librarians is difficult enough – getting standardisation across larger, disparate, communities is very challenging indeed.

What I don’t think the principle makes very clear is what standards are being used – it lists two (RDF and SPARQL), but as far as I can tell most people would agree RDF is actually the key thing here, making this list of two misleading however you read it. I’m not going to describe RDF or SPARQL here, but may come back to them in future posts. In short RDF provides a structured way of making assertions about resources – there is a simple introduction my slideshare presentation on the Semantic Web. SPARQL is a language for querying RDF.

There is quite a bit of discussion about whether RDF is essential to ‘Linked Data’ including Andy Powell on eFoundations, Ian Davis on Internet Alchemy, and Paul Miller on Cloud of Data.

So finally, on to principle 4; “Include links to other URIs. so that they can discover more things.”. The first three principles are concerned with making your data linkable – i.e. making it possible for people to link to your data in meaningful ways. The fourth principle says you should link from your data to other things. For my example of representing my own copy of Pride and Pejudice, that could include linking to information about the book in a more general sense – rather than record the full details myself, I could (for example) link to an OpenLibrary record for the book. Supporting both inbound and outbound links is key to making a rich, interconnected, set of data, enabling the ‘browsing’ of data in the same way we currently ‘browse’ webpages.

I was originally intending to explore some of the arguments I’ve come across recently about ‘Linked Data’ – I especially wanted to tackle some of the issues raised by Mike Ellis in his ‘challenge’, but I think that this post is quite long enough already, so I’ll leave that for another time.

Read to Learn: Updated

Last year I blogged about my entry into the JISC MOSAIC competition which I called ‘ReadtoLearn’. The basic idea of the application was that you could upload a list of ISBNs, and by using the JISC MOSAIC usage data the application would generate a list of course codes, search for those codes on the UCAS web catalogue, and return a list of institutions and courses that might be of interest to you, based on the ISBNs you had uploaded.

While I managed to get enough done to enter the competition, I had quite a long ‘to do’ list at the point I submitted the entry.

The key issues I had were:

  • You could only submit ISBNs by uploading a file (using http post)
  • The results were only available as an ugly html page
  • It was slow

Recently I’ve managed to find some time to go back to the application, and have now added some extra functionality, and also managed to speed up the application slightly (although it still takes a while to process larger sets of ISBNs).

Another issue I noted at the time was that because “the MOSAIC data set only has information from the University of Huddersfield, the likelihood of matching any particular ISBN is relatively low”. I’m happy to say that the usage data that the application uses (via an API provided by Dave Pattern) has been expanded by a contribution from the University of Lincoln.

One of the biggest questions for the application is where a potential user would get a relevant list of ISBNs from in the first place (if they even know what an ISBN is). I’m still looking at this, but I’ve updated the application so there are three ways of getting ISBNs into the application. The previous file upload still works, but now also a comma separated list of ISBNs can be submitted to the application (using http get) and a URL of a webpage (or RSS feed etc.) containing ISBNs can be submitted, and ISBNs will be extracted using regular expressions (slower, but gives a very generic way of getting ISBNs into the application). I would like to look at further mechanisms such as harvesting ISBNs from an Amazon wishlist or order history, or a LibraryThing account, but for the moment you could submit a URL and the regular expression should do the rest.

Rather than the old HTML output, I’ve now made the results available as XML instead. Although this is not pretty (obviously), it does mean that others can use the application to generate lists of institutions/courses if they want. On my to do list now is to use my own XML to generate a nice HTML page (eating your own dog food I think they call it!).

I also restructured the application a little, and split into two scripts (which allowed me to also provide a UCAS code lookup script separately)

Finally, one issue with the general idea of the application was the question of how much of an overlap with the books borrowed by users on a specific course should lead to a recommendation. For example, if 4 ISBNs from your uploaded list turned out to all have been borrowed by users on courses with the code ‘W300’, should this constitute a recommendation to take a W300 course? My solution was to offer two ‘match’ options – one was to find ‘all’ matches – this meant that even a single ISBN related to a course code would result in you getting a recommendation for that course code. The second option was to ‘find close matches only’ – this only recommended a course code to you if the number of ISBNs you matched was at least 1% of the total ISBNs related to that course code in the usage data. I decided I would generalise this a bit, so you can now specify the percentage of overlap you are looking for (although experience suggests that this is going to be low based on the current data – perhaps less than 1)

So, the details are:

Application URL:

http://www.meanboyfriend.com/readtolearn/studysuggest

GET Parameters:

match

Values: ‘All’ or a number between 0 and 100 (must be >0)

Definition: Percentage overlap between ISBNs in submitted list related to a course code, and total ISBNs related to the course code that will constitute a ‘recommendation’.  ‘All’ will retrieve all courses where at least one ISBN has been matched.

isbns

Values: a comma separated list of 10 or 13 digit ISBNs

url

Values: a url-encoded url (include ‘http etc.’) of a page/feed which include ISBNs. ISBNs will be extracted using a regular expression. (See http://www.blooberry.com/indexdot/html/topics/urlencoding.htm for information on URL encoding)

If both isbn and url parameters are submitted, all ISBNs from the list and the specified webpage will be used.

Example:

An example request to the script could be:

http://www.meanboyfriend.com/readtolearn/studysuggest?match=0.5&isbns=0722177755,0722177763,0552770884,043999358,0070185662,0003271323,0003271331,0003272788

Response:

The response is xml with the following structure (this is an example with a single course code):

<study_recommendations>
<course type=”ucas” code=”V1X1″ ignore=”No” total_related=”385″ your_related=”3″>
<items>
<item isbn=”0003271331″></item>
<item isbn=”0003271323″></item>
<item isbn=”0003272788″></item>
</items>
<catalog>
<provider>
<identifier>S84</identifier>
<title>University of Sunderland</title>
<url>http://www.ucas.com/students/choosingcourses/choosinguni/instguide/s/s84</url>
<course>
<identifier>997677</identifier>
<title>History with TESOL</title>
<url>http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/Dhh-QG8Bhe33Egpbb227I8OPTGQUw-VTyY/HAHTpage/search.HsDetails.run?n=997677</url>
</course>
</provider>
<provider>
<identifier>H36</identifier>
<title>University of Hertfordshire</title>
<url>http://www.ucas.com/students/choosingcourses/choosinguni/instguide/h/h36</url>
<course>
<identifier>971629</identifier>
<title>History with English Language Teaching (ELT)</title>
<url>http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/Dhh-QG8Bhe33Egpbb227I8OPTGQUw-VTyY/HAHTpage/search.HsDetails.run?n=971629</url>
</course>
</provider>
</catalog>
</course>
</study_recommendations>

The ‘catalog’ element essentially copies the data structure from XCRI-CAP which I’ve documented in my previous post – I’m not using this namespace at the moment, but I may come back to this when I have time. The ‘course’ and ‘provider’ element can both be repeated.

If you are interested in using it please do, and drop me a comment here if you have examples, or suggestions for further improvements.

UCAS Course code lookup: Take two

Last year as part of the JISC MOSAIC competition, I put together a script which allowed you to search the online UCAS catalogue using a course code, and get an XML response. The XML it returned was just a basic format which suited my purposes at the time, and in the comments I gave the following response to Alan Paull who mentioned XCRI:

I’m aware of the XCRi model and the XCRi-CAP work, and did wonder if I could output my scraped results in this format, but in the end decided for something quicker and dirtier for my purposes.

XCRI (eXchanging Course Related Information) is a JISC funded initiative to “establish a specification to support the exchange of course-related information”. This has established an XML specification intended to enable courses to be advertised or listed in a consistent manner – this is called ‘XCRI-CAP’ (Course Advertising Profile). A number of projects and institutions have implemented XCRI-CAP (a list of projects is available from the CETIS website).

The key thing for me about this approach is the idea that if all institutions (let’s say UK HE institutions, but XCRI-CAP is not sector specific) published their course catalogue following this specification, it would be a relatively simple matter to use, aggregate, disaggregate and reuse this data.

I’ve wanted to get back to this for a while, and finally got round to it, so you can now get results from the script in XCRI-CAP. I have to admit that I’ve a slight confusion as to what makes valid XCRI-CAP – I’ve run the results through the validator blogged by David Sherlock, and get a small number of warnings regarding the lack of ‘descriptions’ for each provider I list. However, the XCRI wiki entry for the provider element suggests that the Description is ‘optional’ (although it then says it ‘should’ be provided).

The script is at:

http://www.meanboyfriend.com/readtolearn/ucas_search

The script accepts four parameters described here:

format

  • If left blank, results will be returned in the default XML format (not xcri-cap) – documented below
  • If set to the value ‘xcri-cap’ the results will be returned in xcri-cap XML – see notes below. If there is an error, this will use the default XML fomat documented below

course_code

  • Accepts an UCAS course code, which is used to search the online UCAS catalogue (4 alphanumeric characters)

catalogue_year

  • Accepts a year in the format YYYY
  • If no year is given, this is left blank
  • UCAS supports searches against more than one catalogue at a time, to enable searching against the current and coming year. If left blank, as far as I can tell, this defaults to the catalogue for the current year (at time of writing, 2010)

stateID

  • The UCAS website uses a session identifier in all URLs called the ‘stateID’
  • If a stateID is supplied to the script, it will use it (unless it turns out to be invalid)
  • If no stateID is supplied, or the stateID supplied is invalid, the script will obtain a new stateID
  • If you are doing repeated requests against the script, it would be ‘polite’ to get a stateID the from the first request, and reuse it in subsequent requests so the script isn’t constantly starting new sessions on the UCAS website

So a valid request to the script could be:

http://www.meanboyfriend.com/readtolearn/ucas_search?course_code=W300&format=xcri-cap&catalogue_year=2010

In terms of output, there are two formats, the default XML, and the XCRI-CAP XML.

XCRI-CAP XML

I’m outputting a minimal amount of data, as I’ve limited myself to scraping only information from the UCAS catalogue search results page. This means I’m currently including only the following elements:

<catalog>
<provider>
<identifier />(I suspect I’ve got a problem here. I’m using the UCAS identifier, which I can’t really find any information about. From the XCRI wiki it looks like I need to be using a URI here)
<title />
<url />(I’m using the URL for the UCAS page for the institution. This includes the stateID, as to link to the UCAS page requires a valid session. It isn’t ideal, as this is only valid for a limited period of time [now amended to use a different URL to the UCAS web page which does not include stateID])
<course>
<identifier />(I’m using the UCAS identifier for the course, again it looks like I should be using a URI from the wiki?)
<title />
<url />(I’m using the URL for the UCAS page for the course. This includes the stateID, as to link to the UCAS page requires a valid session. It isn’t ideal, as this is only valid for a limited period of time)
</course>
</provider>

I am looking at whether I can get more information, but to add to the information I’m currently returning would mean doing some further requets to the UCAS website to scrape information from other pages to supplement the basic information available on the search results page.

Default XML

The default XML format is documented in my previous blog post, but just to recap:

<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<inst_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsInstDetails.run?i=P80
>/inst_ucas_url>
<course ucas_catalogue_id=””>
<course_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsDetails.run?n=989628
</course_ucas_url>
<name>Combined Modern Languages</name>
</course>
</institution>
</ucas_course_results>

Note that you get the ucas_stateid returned, so it can be reused in future requests. Finally, if there are any errors, these will always be returned in the default XML format (even if you request xcri-cap format):

<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<error />
</ucas_course_results>