Scraping, scripting and hacking your way to API-less data

Mike Ellis from eduserv talking about getting data out of web pages.

Scraping – basically allows you to extract data from web pages – and then you can do stuff with it! Some helpful tools for scraping:

  • Yahoo!Pipes
  • Google Docs – use of the importHTML() function to bring in data, and then manipulate it
  • dapper.net (also mentioned by Brendan Dawes)
  • YQL
  • httrack – copy an entire website so you can do local processing
  • hacked search – use Yahoo! search to search within a domain – essentially allows you to crawl a single domain and then extract data via search

So, once you’ve scraped your data, you need some tools to ‘mung’ it (i.e. manipulate it)

  • regex – regular expressions are hugely powerful, although can be complex – see some examples at http://mashedlibrary.ning.com/forum/topics/extracting-isbns-from-rss
  • find/replace – can use any scripting language, but you can even use Word (I like to use Textpad)
  • mail merge (!) – if you have data in excel, or access, or csv etc. you can use mail merge to output with other information – e.g. html
  • html removal – various functions available
  • html tidy – http://tidy.sourceforge.net – can chuck in ‘dirty’ html – e.g cut and pasted from Word, and tidy it up

Processing data:

  • Open Calais – service from Reuters  that analyses block of text for ‘meaning’ – e.g. if it recognises the name of a city it can give information about the city such as latitude/longitude etc.
  • Yahoo!Term Extraction – similar to Open Calais – submit text/data and get back various terms – also allows tuning so that you can get back more relevant results
  • Yahoo!geo – a set of Yahoo tools for processing geographic data – http://developer.yahoo.com/geo

The ugly sisters:

  • Access and Excel – don’t dismiss these! They are actually pretty powerful

Last resorts:

Somewhere I have never travelled

This presentation by Brendan Dawes – http://www.brendandawes.com/ (powered by WordPress)

Brendan quite into data – “data porn” – visualising data. Saying that much of the web is still designed as if it’s in print.

Making ‘weird creatures’ out of keywords http://www.brendandawes.com/?s=redux – ‘creatures’ size indicates popularity, speed they move depends on age – but this stuff doesn’t come with an instruction manual – there is nowhere that these links between data and behaviour is documented for the ‘end user’ – but just putting it out there, and trying it out.

‘Interfaces’ are important – Brendan likes to collect ideas in ‘Field Notes’ books – http://fieldnotesbrand.com/. Also has a firewire drive full of ‘doodles’ as his ‘digital notebook’ – just bits and pieces of stuff that may do one thing – e.g. a drawing app, that allows you to draw things in black ink – that sat there for ages, he did nothing with it. Then had an idea that he wanted to be able put stuff on lines that he had drawn – found something that someone else had done online – and he had put that on his digital notebook.

Brendan wanted to do something http://www.daylife.com/

(aside – When you design stuff for people, avoid colours – as people can dump a perfectly good idea if you’ve done it in the wrong colour! Use black and white, because it doesn’t upset anyone 🙂

What would happen if we removed interfaces completely? Allowed people to build their own interface?

So – all of these bits and pieces came together as http://doodlebuzz.com/ – allows you to do a search – then you draw a line to see the results displayed.

Memoryshare – a BBC project to share memories. Original version had a rather dull interface – didn’t engage people, so not very good usage – although the content is very compelling when you start reading. Brendan and team did a range of prototypes – very open brief – basically do anything you want.

Took ideas done with the Daylife example – displaying time based events on a spiral line – great ‘wow’ moment when you see the spiral on the screen, and then as you zoom in it becomes obvious that it is a 3d environment – very, very pretty! Original demo was in Flash, which couldn’t cope with the amount of data in memoryshare – but the BBC really liked design, so figured out how to do it – see the results at http://www.bbc.co.uk/dna/memoryshare/ – compare this to the old design at the Internet Archive Wayback Machine.

Brendan now moving onto using data to produce physical objects – mentioned a site I didn’t get (Update: thanks to @nicoleharris got this now http://www.ponoko.com/make-and-sell/how-to-make) that allows you to upload a design and get it made – so for example Brendan has had some wooden luggage tags made with data displayed on them. Moo.com has an API – you can pump data in and get physical objects out. Brendan has written something that takes data from wefeelfine.org and pushes to moo.com to make cards – transfers transient digital data into less transient physical data

Visualisation

Iman Moradi is talking about how we organise library stock and spaces – he’s going through at quite a pace, so very brief notes again.

Finding things is complex

It’s a cliched that library users often remember the colour of the book more than the title – but why don’t we respond to this? Organise books by colour – example from Huddersfield town library.

Iman did a demonstrator – building a ‘quotes’ base for a book – use a pen scanner to scan chunk of text from book, and associate with book via ISBN – starts to build a set of quotes from the book that people found ‘of interest’

Think about libraries in terms of games – users are ‘players’, the library is the ‘game environment’. Using libraries is like a game:

  • Activities = Finding, discovery, collection
  • Points/levels = acquiring knowledge

Mash Oop North

Today I’m at Mash Oop North aka #mashlib09 – and kicking off with a presentation from Dave Pattern – some very brief notes:

Making Library Data Work Harder

Dave Pattern – www.slideshare.net/daveyp/

Keyword suggestions – about 25% of keyword searches on Huddersfield OPAC give zero results.
Look at what people are typing in the keyword search – Huddersfield found ‘renew’ was a common search term – so can pop up a information box with information about renewing your books.

By looking at common keyword combinations can help people refine their searches

Borrowing suggestions – people who borrowed this item, also borrowed …
Tesco’s collect and exploit this data. Do libraries sometimes assume we know what is best for our users – but we perhaps need to look at data to prove or disprove our assumptions

Because borrowing driven by reading lists, perhaps helps suggestions stay on-topic

Course specific ‘new books’ list – based on what people on specific courses borrow
Able to do amazon-y type personalised suggestions

Borrowing profile for Huddersfield – average number of books borrowed shows v high peak in October, lull during the summer – now can see the use of the suggestions following this with a peak in November.

Seems to be a correlation between introduction of suggestions/recommendations with increase in borrowing – how could this be investigated further?

Started collecting e-journal data via SFX – starting to do journal recommendations based on usage.

Suggested scenario – can start seeding new students experience – 1st time student accesses website can use ‘average’ behaviour of students on same course – so highly personalised. Also, if information delivered via widgets could drag and drop to other environments.

JISC Mosaic project, looking at usage data (at National level I think?)

So – some ideas of stuff that you might do with usage data:

#1 Basic library account info:
Just your bog standard library optionss
– view items on loan. hold requests etc
– renew items
Confgure alerting options
– SMS, Facebook, Google Teleppathy
Convert Karma
– rewards for sharing information/contributing to pool of data – perhaps swap karma points for free services/waiving fines etc.

#2 Discovery service
Single box for search

#3 Book recommendations
Students like book covers
Primarily a ‘we think you might be interested in’ service
Uses database of circulation transactions, augmented with Mosaic data
time relevant to the modules student is taking
Asapts to choices student makes over time

#4 New books
Data-mining of books borrowed by student on a course
Provide new books lists based on this information (already doing this at Huddersfield I think)

#5 Relevant Journals

#6 Relevant articles
– Whenever student interacts with library services e.g. keywords etc. – refines their profile