Scraping, scripting and hacking your way to API-less data

Mike Ellis from eduserv talking about getting data out of web pages.

Scraping – basically allows you to extract data from web pages – and then you can do stuff with it! Some helpful tools for scraping:

  • Yahoo!Pipes
  • Google Docs – use of the importHTML() function to bring in data, and then manipulate it
  • dapper.net (also mentioned by Brendan Dawes)
  • YQL
  • httrack – copy an entire website so you can do local processing
  • hacked search – use Yahoo! search to search within a domain – essentially allows you to crawl a single domain and then extract data via search

So, once you’ve scraped your data, you need some tools to ‘mung’ it (i.e. manipulate it)

  • regex – regular expressions are hugely powerful, although can be complex – see some examples at http://mashedlibrary.ning.com/forum/topics/extracting-isbns-from-rss
  • find/replace – can use any scripting language, but you can even use Word (I like to use Textpad)
  • mail merge (!) – if you have data in excel, or access, or csv etc. you can use mail merge to output with other information – e.g. html
  • html removal – various functions available
  • html tidy – http://tidy.sourceforge.net – can chuck in ‘dirty’ html – e.g cut and pasted from Word, and tidy it up

Processing data:

  • Open Calais – service from Reuters  that analyses block of text for ‘meaning’ – e.g. if it recognises the name of a city it can give information about the city such as latitude/longitude etc.
  • Yahoo!Term Extraction – similar to Open Calais – submit text/data and get back various terms – also allows tuning so that you can get back more relevant results
  • Yahoo!geo – a set of Yahoo tools for processing geographic data – http://developer.yahoo.com/geo

The ugly sisters:

  • Access and Excel – don’t dismiss these! They are actually pretty powerful

Last resorts:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.