{"id":1082,"date":"2010-10-29T11:20:03","date_gmt":"2010-10-29T10:20:03","guid":{"rendered":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/?p=1082"},"modified":"2010-10-29T11:20:03","modified_gmt":"2010-10-29T10:20:03","slug":"open-bibliography-and-why-it-shouldnt-have-to-exist","status":"publish","type":"post","link":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/2010\/10\/open-bibliography-and-why-it-shouldnt-have-to-exist\/","title":{"rendered":"Open Bibliography (and why it shouldn&#8217;t have to exist)"},"content":{"rendered":"<p><!-- p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px} span.Apple-tab-span {white-space:pre} -->Today I&#8217;m at <a href=\"http:\/\/blogs.ukoln.ac.uk\/mashspa\/\">Mashspa<\/a> &#8211; another <a href=\"http:\/\/www.mashedlibrary.com\/\">Mashed Library<\/a> event.<\/p>\n<p>Ben O&#8217;Steen is talking about a JISC project he is currently involved with. Project about getting bibliographic information into the open. For Ben Open means &#8220;publishing bibliographic information under a permissive license to encourage indexing, re-use and re-purposing&#8221;. Ben believes that some aspects &#8211; such as attribution &#8211; should be part of &#8216;community norm&#8217;, not written into a license.<\/p>\n<p>In essence an open bibliography is all about Advertising! Telling other people what you have.<\/p>\n<p>Bibliographic information allows you to:<\/p>\n<ul>\n<li>Identify and find an item you know you want<\/li>\n<li>Discovery related items or items you believe you want<\/li>\n<li>Serendipitously discover items you would like with knowing they might exist<\/li>\n<li>&#8230;other stuff<\/li>\n<\/ul>\n<p>This list (from top to bottom) require increasing investment. Advertising isn&#8217;t about spending money &#8211; it&#8217;s about investment.<\/p>\n<p>To maximise returns you maximise the audience<\/p>\n<p>Ben asks &#8220;Should the advertising target &#8216;b2b&#8217; or &#8216;consumers&#8217;?&#8221;<\/p>\n<p>Ben acknowledges that it may not be necessary to completely open up the data set &#8211; but believes that in the long term open is the way forward.<\/p>\n<p>Some people ask &#8220;Can&#8217;t I just scrape sites and use the data &#8211; it&#8217;s just facts isn&#8217;t it?&#8221;. However Directive 96\/9\/EC of the European Parliament which codifies a new protection based on &#8220;sui generis&#8221; rights &#8211; rights earned by the &#8220;sweat of the brow&#8221;. So far this law seems to have only solidified existing monopolies &#8211; not generated new economic growth (which was apparently the intention of the law)<\/p>\n<p>When project asked UK PubMedCentral if we could reproduce the bibliographic data they share through their OAI-PMH service? &#8211; they said &#8216;Generally, No&#8217; &#8211; paraphrasing that basically UK PubMedCentral said they didn&#8217;t have the rights to give away the data (except the stuff from Open Access journals) &#8211; NOTE &#8211; this is the metadata not the full text articles we are talking about &#8211; they said they could not grant the right to reuse the metadata [would this, for example, mean that you could not use this metadata in a reference management package to then produce a bibliography?]<\/p>\n<p>Principles:<\/p>\n<ul>\n<li>Assign a license when you publish data<\/li>\n<li>Use a recognised license<\/li>\n<li>If you want your data to be effectively used and added to by other it should be open &#8211; in particular non-commercial and other restrictive licenses should be avoided<\/li>\n<li>Strongly recommend using CC0 or PDDL (latter in the EU only)<\/li>\n<li>Strongly encourage release of bibliographic data into the &#8216;Open&#8217;<\/li>\n<\/ul>\n<p>Sliding scale:<\/p>\n<ul>\n<li>Identify &#8211; e.g. for author simple identifier could just be name &#8211; cheap, more expensive identifiers &#8211; e.g. URIs or ORIDs<\/li>\n<li>Discover &#8211;<\/li>\n<li>Serendipity &#8211;<\/li>\n<\/ul>\n<p>If you increase investment you get more use &#8211; difficult to reuse data without identifiers for example.<\/p>\n<p>1. Where there is human input, there is interpretation &#8211; people may interpret standards in different ways, use fields in different ways<\/p>\n<p>Ben found a lot of variation across data in PubMed data set &#8211; different journals or publishers interpret where information should go in different ways &#8211; &#8220;Standards don&#8217;t bring interoperability, people do&#8221;<\/p>\n<p>2. Data has been entered and curated without large-scale sharing as a focus &#8211; lots of implicit, contextual\u00a0 information left out &#8211; e.g. if you are working in a specialist Social Science library, perhaps you don&#8217;t mention that the item is about Social Sciences as that is implicit by (original) context<\/p>\n<p>3. Data quality is generally poor &#8211; example from the BL ISBN = \u00a32.50!<\/p>\n<p>In a closed data set you may not discover errors &#8211; when you have lots of people looking at data (with different uses in mind) you pick up different types of error.<\/p>\n<p>The data clean-up process is going to be PROBABALISTIC &#8211; we cannot be sure &#8211; by definition &#8211; that we are accurate when we deduplicate or disambiguate. Typical methods:<\/p>\n<ul>\n<li>Natural Language Processing<\/li>\n<li>Machine learning techniques<\/li>\n<li>String metrics and old school record reduplication &#8211; easiest of the the 3 (for Ben)<\/li>\n<\/ul>\n<p>Not just about matching uniquely &#8211; looking at level of similarity and making decisions<\/p>\n<p>List of string metrics at <a href=\"http:\/\/staffwww.dcs.shef.ac.uk\/people\/s.chapman\/stringmetrics.html\">http:\/\/staffwww.dcs.shef.ac.uk\/people\/s.chapman\/stringmetrics.html<\/a><\/p>\n<p>Felligi-Sunter method for old school deduplication &#8211; not great, but works OK\u00a0.<\/p>\n<p>Can now take a map-reduce approach (distribute processing across servers)<\/p>\n<p>Do it yourself:<\/p>\n<ul>\n<li>ANU&#8217;s Febrl python code<\/li>\n<li><a href=\"http:\/\/datamining.anu.edu.au\/projects\/linkage.html\">http:\/\/datamining.anu.edu.au\/projects\/linkage.html<\/a><\/li>\n<li>Just need a csv file with one unique ID for each record<\/li>\n<\/ul>\n<p>When de-duping &#8211; need to be able to unmerge so you can correct if necessary &#8211; canonical data that you have, and data that you publish to the public<\/p>\n<p>Directions with Bibliographic data: So far much effort has been directed at &#8216;Works&#8217; &#8211; we need to put much more effort into their &#8216;Networks&#8217; &#8211; starts to help (for example) disambiguate people<\/p>\n<p>Network examples:<\/p>\n<ul>\n<li>A cites B<\/li>\n<li>Works by a given author<\/li>\n<li>Works cited by a given author<\/li>\n<li>Works citing articles that have since been disproved, redacted or withdrawn<\/li>\n<li>Co-authors<\/li>\n<li>&#8230;other connections that we&#8217;ve not even thought of yet<\/li>\n<\/ul>\n<p>Ben says &#8211; Don&#8217;t get hung up on standards &#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today I&#8217;m at Mashspa &#8211; another Mashed Library event. Ben O&#8217;Steen is talking about a JISC project he is currently involved with. Project about getting bibliographic information into the open. For Ben Open means &#8220;publishing bibliographic information under a permissive license to encourage indexing, re-use and re-purposing&#8221;. Ben believes that some aspects &#8211; such as [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[49,58],"class_list":["post-1082","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-mashlib","tag-mashspa"],"_links":{"self":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1082","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/comments?post=1082"}],"version-history":[{"count":2,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1082\/revisions"}],"predecessor-version":[{"id":1084,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1082\/revisions\/1084"}],"wp:attachment":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/media?parent=1082"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/categories?post=1082"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/tags?post=1082"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}