{"id":1435,"date":"2012-07-11T10:21:54","date_gmt":"2012-07-11T09:21:54","guid":{"rendered":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/?p=1435"},"modified":"2012-07-11T10:27:49","modified_gmt":"2012-07-11T09:27:49","slug":"resourcesync-web-based-resource-synchronization","status":"publish","type":"post","link":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/2012\/07\/resourcesync-web-based-resource-synchronization\/","title":{"rendered":"ResourceSync: Web-based Resource Synchronization"},"content":{"rendered":"<p>Final paper in the &#8216;Repository Services&#8217; session at OR2012 is presented by Simeon Warner. This is the paper I really wanted to see this morning as I&#8217;ve seen various snippets on twitter about it (via\u00a0<a href=\"http:\/\/twitter.com\/azaroth42\">@azaroth42<\/a>\u00a0and\u00a0<a href=\"http:\/\/twitter.com\/hvdsomp\">@hvdsomp<\/a>). Simeon says so far it&#8217;s been lots of talking rather than doing \ud83d\ude42<\/p>\n<p>A lot of the stuff in this post is also available on the ResourceSync specification page\u00a0<a href=\"http:\/\/resync.github.com\/spec\/\">http:\/\/resync.github.com\/spec\/<\/a><\/p>\n<p><strong>Synchronize what?<\/strong><\/p>\n<ul>\n<li>Web resources &#8211; things with a URI that can be dereferences and are cache-able &#8211; not dependent on underlying OS or tech<\/li>\n<li>Small websites to large repositories &#8211; needs to work at all scales<\/li>\n<li>Need to deal with things that change slowly (weeks\/months) or quickly (seconds) and where latency needs may vary<\/li>\n<li>Focus on needs of research communicatoin and cultural heritage orgs &#8211; but aim for generality<\/li>\n<\/ul>\n<p><strong>Why?<\/strong><\/p>\n<p>Because lots of projects and services are doing synchronization but have to roll their own on a case by case basis. Lots of examples where local copies of objects are needed to carry out work (I think\u00a0<a href=\"http:\/\/core.kmi.open.ac.uk\/search\">CORE<\/a>\u00a0gives an example of this kind of application).<\/p>\n<p>OAI-PMH is over 10 years old as protocol, and was designed to do XML metadata &#8211; not files\/objects. (exactly\u00a0<a href=\"http:\/\/core-project.kmi.open.ac.uk\/node\/31\">the issue we&#8217;ve seen in CORE<\/a>)<\/p>\n<p>Rob Sanderson done work on a range of use cases \u00a0&#8211; including things like aggregation (multiple sources to one centre). Also ruled out some use cases &#8211; e.g. not going to deal with bi-directional synchronization at the moment. Some real life use cases they&#8217;ve looked at in detail:<\/p>\n<p>DBpedia live duplication &#8211; 20million entries updated @ 1 per second -= though sporadic. Low latency needed. This suggests has to be &#8216;push&#8217; mechanism &#8211; can&#8217;t have lots of services polling every second for updates<\/p>\n<p>arXin mirrory &#8211; 1million article versions &#8211; about 800 per day created. Need metadata and full-text for each article. Accuracy important. Want low barrier for others to use.<\/p>\n<p>Some terminology they have determined:<\/p>\n<ul>\n<li>Resource &#8211; an object ot be syncrhonized &#8211; a web resource<\/li>\n<li>Source &#8211; system with the original or master resource<\/li>\n<li>Destination &#8211; system to which resource from the source will be copied<\/li>\n<li>Pull &#8211; process to get information from source to destination, initiatnd by destination<\/li>\n<li>Push &#8211; process to get information from source to destination. Initiated by source<\/li>\n<li>Metadata &#8211;\u00a0information about Resources such as URI, modification time, checksum, etc. (Not to be confused with Resources that may themselves be metadata about another resource, e.g. a DC record)<\/li>\n<\/ul>\n<p>Three basic needs:<\/p>\n<ul>\n<li>Baseline synchronization &#8211; perform initial load or catchup between source and destination<\/li>\n<li>Incremental synchronization &#8211; deal with updated\/creates\/deletes<\/li>\n<li>Audit &#8211; is my current copy in sync with source<\/li>\n<\/ul>\n<p>Need to use &#8216;inventory&#8217; approach to know what is there and what needs updating. So Audit uses inventory to check what has changed between source and destination, then do incremental \u00a0synchronization. Don&#8217;t necessarily need to use full inventory &#8211; could use inventory change set to know what has changed.<\/p>\n<p>Once you&#8217;ve got agreement on the change set, need to get source and destination back in sync &#8211; whether by exchanging objects, or doing diffs and updates etc.<\/p>\n<p>Decided that for simplicity needed to focus on &#8216;pull&#8217; but with some &#8216;push&#8217; mechanism available so sources can push changes when necessary.<\/p>\n<p>What they&#8217;ve come up with is a framework based on Sitemaps &#8211; that Google uses to know what to crawl on a website. It&#8217;s a modular framework to allow selective deployment of different parts. For example &#8211; basic baseline sync looks like:<\/p>\n<p>Level zero -&gt; Publish a Sitemap<\/p>\n<p>Periodic publication of a sitemap is basic implementation. Sitemap lists at least a list of URLs &#8211; one for each resource. But you could add in more information &#8211; e.g. like a hashsum for the resource &#8211; which would enable better comparison.<\/p>\n<p>Another usecase &#8211; incremental sync. In this case use sitemap format but include information only for change events. One &lt;url&gt; element per change event.<\/p>\n<p>What about &#8216;push&#8217; notification. They believe XMPP best bet &#8211; this is used for messaging services (like Google\/Facebook chat systems). This allows rapid notification of change events. XMPP a bit &#8216;heavy weight&#8217; &#8211; but lots of libraries already available for this, so not going to have to implement from scratch.<\/p>\n<p>LANL Research Library ran a significant scale experiment in synchronization of the LiveDBpedia \u00a0database to two remote sites using XMPP to push changes. A couple of issues, but overall very successful.<\/p>\n<p>Sitemaps have some limitation on size (I think Simeon said 2.5billion URLs?) &#8211; but not hard to see how it could be extended beyond this if required.<\/p>\n<p>Dumps: a dump format is necessary &#8211; to avoid repeated HTTP GET requests for multiple resource. Use for baseline and changeset. Options are:<\/p>\n<ul>\n<li>Zip + sitemap &#8211; Zip very common, but would require custom mechanism to link<\/li>\n<li>WARC &#8211; designed for this purpose but not widely implemented<\/li>\n<\/ul>\n<p>Simeon guesses they will end up with hooks for both of these.<\/p>\n<p>Expecting a draft spec very very soon (July 2012). Code and other stuff already on GitHub\u00a0<a href=\"https:\/\/github.com\/resync\/\">https:\/\/github.com\/resync\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Final paper in the &#8216;Repository Services&#8217; session at OR2012 is presented by Simeon Warner. This is the paper I really wanted to see this morning as I&#8217;ve seen various snippets on twitter about it (via\u00a0@azaroth42\u00a0and\u00a0@hvdsomp). Simeon says so far it&#8217;s been lots of talking rather than doing \ud83d\ude42 A lot of the stuff in this [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1435","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1435","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/comments?post=1435"}],"version-history":[{"count":6,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1435\/revisions"}],"predecessor-version":[{"id":1442,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/posts\/1435\/revisions\/1442"}],"wp:attachment":[{"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/media?parent=1435"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/categories?post=1435"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.meanboyfriend.com\/overdue_ideas\/wp-json\/wp\/v2\/tags?post=1435"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}