{"id":1231,"date":"2011-07-14T11:50:14","date_gmt":"2011-07-14T10:50:14","guid":{"rendered":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/2011\/07\/linked-data-and-libraries-linked-data-opac\/"},"modified":"2011-07-15T12:54:44","modified_gmt":"2011-07-15T11:54:44","slug":"linked-data-and-libraries-linked-data-opac","status":"publish","type":"post","link":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/2011\/07\/linked-data-and-libraries-linked-data-opac\/","title":{"rendered":"Linked Data and Libraries: Linked Data OPAC"},"content":{"rendered":"<p>This session by Phil John &#8211; Technical Lead for Prism (was Talis, now Capita). Prism is a &#8216;next generation&#8217; discovery interface &#8211; but built on Linked Data through and through.<\/p>\n<p>Slides available from <a href=\"http:\/\/www.slideshare.net\/philjohn\/linked-library-data-in-the-wild-8593328\">http:\/\/www.slideshare.net\/philjohn\/linked-library-data-in-the-wild-8593328<\/a><\/p>\n<p>Now moving to next phase of development &#8211; not going to be just about library catalogue data &#8211; but also journal metadata; archives\/records (e.g. from the CALM archive system); thesis repositories; rare items and special collections (often not done well in traditional OPACs) &#8230; and more &#8211; e.g. community information systems.<\/p>\n<p>When populating Prism from MARC21 &#8211; do initial &#8216;bulk&#8217; conversion, then periodic &#8216;delat&#8217; files &#8211; to keep in sync with LMS. Borrower and availability data is pulled from LMS &#8220;live&#8221; &#8211; via a suite of RESTful web services.<\/p>\n<p>Prism is also a Linked Data API&#8230; just add .rss to collection of .rdf\/.nt\/.ttl\/.json to items. This means simple to publish RSS feeds of preconfigured searches &#8211; e.g. new stock, or new stock in specific subjects etc.<\/p>\n<p>Every HTML page in Prism has data behind it you can get as RDF.<\/p>\n<p>One of the biggest challenges &#8211; Extracting data from MARC21 &#8211; MARC very rich, but not very linked&#8230;  Phil fills the screen with #marcmustdie tweets \ud83d\ude42<\/p>\n<p>But have to be realistic &#8211; 10s of millions of MARC21 records exist &#8211; so need to be able to deal with this.<br \/>\nDecided to tackle problem in small chunks. Created a solution that allows you to build a model interatively. Also compartmentalises code for different sections &#8211; these can communicate but work separately and can be developed separately. Makes it easy to tweak parts of the model easily.<\/p>\n<p>Feel they have a robust solution that performs well &#8211; even if it only takes 10 seconds to convert a MARC record &#8211; then when you use several million records it takes months.<\/p>\n<p>No matter what MARC21 and AACR2 says &#8211; you will see variations in real date.<\/p>\n<p>Have a conversion pipeline:<br \/>\nParser &#8211; reads in MARC21- fires events as it encounters different parts of the record &#8211; it&#8217;s very strict with Syntax &#8211; so insists on valid MARC21<br \/>\nObserver &#8211; listens for MARC21 data structures and hands control over to &#8230;<br \/>\nHandler &#8211; knows how to convert MARC21 structures and fields into Linked data<\/p>\n<p>First area they tackled was Format (and duration) &#8211; good starting point as it allows you to reason more fully about the record &#8211; once you know Format you know what kind of data to expect.<\/p>\n<p>In theory should be quite easy &#8211; MARC21 has lots of structured info about format &#8211; but in practice there are lots of issues:<\/p>\n<ul style=\"list-style-type: disc;\">\n<li>no code for CD (it&#8217;s a 12 cm sound disk that travels at 1.4m\/s!)<\/li>\n<li>DVD and LaserDisc shared a code for a while<\/li>\n<li>Libraries slow to support new formats<\/li>\n<li>limited use of 007 in the real world<\/li>\n<\/ul>\n<p>E.g. places to look for format information:<br \/>\n007<br \/>\n245$$h<br \/>\n300$$a (mixed in with other info)<br \/>\n538$$a<\/p>\n<p>Decided to do the duration at the same time:<br \/>\n306$$a<br \/>\n300$$a (but lots of variation in this field)<\/p>\n<p>Now Phil talking about &#8216;Title&#8217; &#8211; v important, but of course quite tricky&#8230;<br \/>\n245 field in MARC may duplicate information from elsewhere<br \/>\nGot lots of help from <a href=\"http:\/\/journal.code4lib.org\/articles\/3832\">http:\/\/journal.code4lib.org\/articles\/3832<\/a> (with additional work and modification)<\/p>\n<p>Retained a &#8216;statement of responsibility&#8217; &#8211; but mostly for search and display&#8230;<\/p>\n<p>Identifiers&#8230;<br \/>\nLots of non identifier information mixed in with other stuff &#8211; e.g. ISBN followed by &#8216;pbk.&#8217;<br \/>\nMany variations in abbrevations used &#8211; have to parse all this stuff, then validate the identifier<br \/>\nOnce you have an identifier, you start being able to link to other stuff &#8211; which is great.<\/p>\n<p>Author &#8211; Pseudonyms, variations in names, generally no &#8216;relator terms&#8217; in 100\/700 $$e or $$4 &#8211; which would show the nature of the relationship between the person and the work (e.g. &#8216;author&#8217; &#8216;illustrator&#8217;) &#8211; because these are missing have to parse information out of the 245$$c<\/p>\n<p>&#8230; and not just dealing with English records &#8211; especially in academic libraries.<\/p>\n<p>Have licensed Library of Congress authority files &#8211; which helps&#8230; &#8211; authority matching requirements were:<br \/>\nHas to be fast &#8211; able to parse 2M records in hours not days\/months<br \/>\nHas to be accurate<\/p>\n<p>So &#8211; store Authorities as RDF but index in SOLR &#8211; gives speed and for bulk conversions don&#8217;t get http overhead&#8230;<\/p>\n<p>Language\/Alternate representation &#8211; this is a nice &#8216;high impact&#8217; feature &#8211; allows switching between representations &#8211; both forms can be searched for &#8211; use RDF content language feature &#8211; so useful for people using machine readable RDF<\/p>\n<p>Using and Linking to external data sets&#8230;<br \/>\npart of the reason for using linked data &#8211; but some challenges&#8230;.<\/p>\n<ul style=\"list-style-type: disc;\">\n<li>what if datasource suffers downtime<\/li>\n<li>worse &#8211; what if datasource removed permanently?<\/li>\n<li>trust<\/li>\n<li>can we display it? is it susceptible to vandalism?<\/li>\n<\/ul>\n<p>Potential solutions (not there yet):<\/p>\n<ul style=\"list-style-type: disc;\">\n<li>Harvest datasets and keep them close to the app<\/li>\n<li>if that&#8217;s not practical proxy requests using caching proxy &#8211; e.g. Squid<\/li>\n<li>if using wikipedia and worried about vandalism &#8211; put in checks for likely vandalism activity &#8211; e.g. many multiple edits in short time<\/li>\n<\/ul>\n<p><strong>Want to see<br \/>\n<\/strong>More library data as LOD &#8211; especially on the peripheries &#8211; authority data, author information, etc.<br \/>\nLMS vendors adopting LOD<br \/>\nLOD replacing MARC21 as standard representation of bibliographic records!<\/p>\n<p><strong>Questions?<\/strong><br \/>\nIs process (MARC-&gt;RDF) documented?<br \/>\nA: Would like to open source at least some of it&#8230; but discussions to have internally in Capita &#8211; so something to keep and eye on&#8230;<\/p>\n<p>Is there a running instance of Prism to play with:<br \/>\nA: Yes &#8211; e.g. <a href=\"http:\/\/prism.talis.com\/bradford\/\">http:\/\/prism.talis.com\/bradford\/<\/a><\/p>\n<p>[UPDATE: See in comments Phil suggests\u00a0<a href=\"http:\/\/catalogue.library.manchester.ac.uk\/\">http:\/\/catalogue.library.manchester.ac.uk\/<\/a> as one that has used a more up to date version of the transform<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This session by Phil John &#8211; Technical Lead for Prism (was Talis, now Capita). Prism is a &#8216;next generation&#8217; discovery interface &#8211; but built on Linked Data through and through. Slides available from http:\/\/www.slideshare.net\/philjohn\/linked-library-data-in-the-wild-8593328 Now moving to next phase of development &#8211; not going to be just about library catalogue data &#8211; but also journal [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[72,38,71],"class_list":["post-1231","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-ldal","tag-linked-data","tag-lodlam"],"_links":{"self":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1231","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/comments?post=1231"}],"version-history":[{"count":4,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1231\/revisions"}],"predecessor-version":[{"id":1245,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1231\/revisions\/1245"}],"wp:attachment":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/media?parent=1231"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/categories?post=1231"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/tags?post=1231"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}