{"id":507,"date":"2009-07-07T11:49:47","date_gmt":"2009-07-07T10:49:47","guid":{"rendered":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/?p=507"},"modified":"2009-07-07T11:49:47","modified_gmt":"2009-07-07T10:49:47","slug":"scraping-scripting-and-hacking-your-way-to-api-less-data","status":"publish","type":"post","link":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/2009\/07\/scraping-scripting-and-hacking-your-way-to-api-less-data\/","title":{"rendered":"Scraping, scripting and hacking your way to API-less data"},"content":{"rendered":"<p><a href=\"http:\/\/electronicmuseum.org.uk\/\">Mike Ellis<\/a> from eduserv talking about getting data out of web pages.<\/p>\n<p>Scraping &#8211; basically allows you to extract data from web pages &#8211; and then you can do stuff with it! Some helpful tools for scraping:<\/p>\n<ul>\n<li><a href=\"http:\/\/pipes.yahoo.com\">Yahoo!Pipes<\/a><\/li>\n<li><a href=\"http:\/\/docs.google.com\">Google Docs<\/a> &#8211; use of the importHTML() function to bring in data, and then manipulate it<\/li>\n<li><a href=\"http:\/\/dapper.net\">dapper.net<\/a> (also mentioned by Brendan Dawes)<\/li>\n<li><a href=\"http:\/\/developer.yahoo.com\/yql\/\">YQL<\/a><\/li>\n<li><a href=\"http:\/\/www.httrack.com\">httrack<\/a> &#8211; copy an entire website so you can do local processing<\/li>\n<li>hacked search &#8211; use Yahoo! search to search within a domain &#8211; essentially allows you to crawl a single domain and then extract data via search<\/li>\n<\/ul>\n<p>So, once you&#8217;ve scraped your data, you need some tools to &#8216;mung&#8217; it (i.e. manipulate it)<\/p>\n<ul>\n<li>regex &#8211; regular expressions are hugely powerful, although can be complex &#8211; see some examples at <a href=\"http:\/\/mashedlibrary.ning.com\/forum\/topics\/extracting-isbns-from-rss\">http:\/\/mashedlibrary.ning.com\/forum\/topics\/extracting-isbns-from-rss<\/a><\/li>\n<li>find\/replace &#8211; can use any scripting language, but you can even use Word (I like to use <a href=\"http:\/\/www.textpad.com\/\">Textpad<\/a>)<\/li>\n<li>mail merge (!) &#8211; if you have data in excel, or access, or csv etc. you can use mail merge to output with other information &#8211; e.g. html<\/li>\n<li>html removal &#8211; various functions available<\/li>\n<li>html tidy &#8211; <a href=\"http:\/\/tidy.sourceforge.net\">http:\/\/tidy.sourceforge.net<\/a> &#8211; can chuck in &#8216;dirty&#8217; html &#8211; e.g cut and pasted from Word, and tidy it up<\/li>\n<\/ul>\n<p>Processing data:<\/p>\n<ul>\n<li>Open Calais &#8211; service from Reuters\u00a0 that analyses block of text for &#8216;meaning&#8217; &#8211; e.g. if it recognises the name of a city it can give information about the city such as latitude\/longitude etc.<\/li>\n<li>Yahoo!Term Extraction &#8211; similar to Open Calais &#8211; submit text\/data and get back various terms &#8211; also allows tuning so that you can get back more relevant results<\/li>\n<li>Yahoo!geo &#8211; a set of Yahoo tools for processing geographic data &#8211; <a href=\"http:\/\/developer.yahoo.com\/geo\">http:\/\/developer.yahoo.com\/geo<\/a><\/li>\n<\/ul>\n<p>The ugly sisters:<\/p>\n<ul>\n<li>Access and Excel &#8211; don&#8217;t dismiss these! They are actually pretty powerful<\/li>\n<\/ul>\n<p>Last resorts:<\/p>\n<ul>\n<li>Use Freedom of Information &#8211; for data you can&#8217;t get any other way, submit FoI requests via <a href=\"http:\/\/www.whatdotheyknow.com\/\">What do they know<\/a><\/li>\n<li>OCR stuff (Mike has used <a href=\"http:\/\/www.softi.co.uk\/freeocr.htm\">http:\/\/www.softi.co.uk\/freeocr.htm<\/a>)<\/li>\n<li>Re-key data &#8211; or use <a href=\"https:\/\/www.mturk.com\/mturk\/welcome\">Mechanical Turk<\/a> to get people to do it for you?<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Mike Ellis from eduserv talking about getting data out of web pages. Scraping &#8211; basically allows you to extract data from web pages &#8211; and then you can do stuff with it! Some helpful tools for scraping: Yahoo!Pipes Google Docs &#8211; use of the importHTML() function to bring in data, and then manipulate it dapper.net [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[30,5],"class_list":["post-507","post","type-post","status-publish","format-standard","hentry","tag-mashlib09","tag-webtech"],"_links":{"self":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/507","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/comments?post=507"}],"version-history":[{"count":2,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/507\/revisions"}],"predecessor-version":[{"id":509,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/507\/revisions\/509"}],"wp:attachment":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/media?parent=507"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/categories?post=507"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/tags?post=507"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}