Final paper in the ‘Repository Services’ session at OR2012 is presented by Simeon Warner. This is the paper I really wanted to see this morning as I’ve seen various snippets on twitter about it (via @azaroth42 and @hvdsomp). Simeon says so far it’s been lots of talking rather than doing 🙂
A lot of the stuff in this post is also available on the ResourceSync specification page http://resync.github.com/spec/
- Web resources – things with a URI that can be dereferences and are cache-able – not dependent on underlying OS or tech
- Small websites to large repositories – needs to work at all scales
- Need to deal with things that change slowly (weeks/months) or quickly (seconds) and where latency needs may vary
- Focus on needs of research communicatoin and cultural heritage orgs – but aim for generality
Because lots of projects and services are doing synchronization but have to roll their own on a case by case basis. Lots of examples where local copies of objects are needed to carry out work (I think CORE gives an example of this kind of application).
OAI-PMH is over 10 years old as protocol, and was designed to do XML metadata – not files/objects. (exactly the issue we’ve seen in CORE)
Rob Sanderson done work on a range of use cases – including things like aggregation (multiple sources to one centre). Also ruled out some use cases – e.g. not going to deal with bi-directional synchronization at the moment. Some real life use cases they’ve looked at in detail:
DBpedia live duplication – 20million entries updated @ 1 per second -= though sporadic. Low latency needed. This suggests has to be ‘push’ mechanism – can’t have lots of services polling every second for updates
arXin mirrory – 1million article versions – about 800 per day created. Need metadata and full-text for each article. Accuracy important. Want low barrier for others to use.
Some terminology they have determined:
- Resource – an object ot be syncrhonized – a web resource
- Source – system with the original or master resource
- Destination – system to which resource from the source will be copied
- Pull – process to get information from source to destination, initiatnd by destination
- Push – process to get information from source to destination. Initiated by source
- Metadata – information about Resources such as URI, modification time, checksum, etc. (Not to be confused with Resources that may themselves be metadata about another resource, e.g. a DC record)
Three basic needs:
- Baseline synchronization – perform initial load or catchup between source and destination
- Incremental synchronization – deal with updated/creates/deletes
- Audit – is my current copy in sync with source
Need to use ‘inventory’ approach to know what is there and what needs updating. So Audit uses inventory to check what has changed between source and destination, then do incremental synchronization. Don’t necessarily need to use full inventory – could use inventory change set to know what has changed.
Once you’ve got agreement on the change set, need to get source and destination back in sync – whether by exchanging objects, or doing diffs and updates etc.
Decided that for simplicity needed to focus on ‘pull’ but with some ‘push’ mechanism available so sources can push changes when necessary.
What they’ve come up with is a framework based on Sitemaps – that Google uses to know what to crawl on a website. It’s a modular framework to allow selective deployment of different parts. For example – basic baseline sync looks like:
Level zero -> Publish a Sitemap
Periodic publication of a sitemap is basic implementation. Sitemap lists at least a list of URLs – one for each resource. But you could add in more information – e.g. like a hashsum for the resource – which would enable better comparison.
Another usecase – incremental sync. In this case use sitemap format but include information only for change events. One <url> element per change event.
What about ‘push’ notification. They believe XMPP best bet – this is used for messaging services (like Google/Facebook chat systems). This allows rapid notification of change events. XMPP a bit ‘heavy weight’ – but lots of libraries already available for this, so not going to have to implement from scratch.
LANL Research Library ran a significant scale experiment in synchronization of the LiveDBpedia database to two remote sites using XMPP to push changes. A couple of issues, but overall very successful.
Sitemaps have some limitation on size (I think Simeon said 2.5billion URLs?) – but not hard to see how it could be extended beyond this if required.
Dumps: a dump format is necessary – to avoid repeated HTTP GET requests for multiple resource. Use for baseline and changeset. Options are:
- Zip + sitemap – Zip very common, but would require custom mechanism to link
- WARC – designed for this purpose but not widely implemented
Simeon guesses they will end up with hooks for both of these.
Expecting a draft spec very very soon (July 2012). Code and other stuff already on GitHub https://github.com/resync/