A worked example of fixing problem MARC data: Part 2 – Text editor

This is the second post in a series of 5.

As described in Part 1 – I had 50K MARC records in a file, but due to the use of incorrect ‘delimiter’ and ‘record terminator’ characters the file wasn’t recognised as MARC by standard tools such as MarcEdit. My first job was to correct the file to the point that I could use MarcEdit to validate and manipulate the records.

If you haven’t looked at a MARC record before, they have a slightly odd structure which consists of the ‘leader’ and then the ‘directory’ which between them contain overall information about the structure of the record (how long the record is, what fields are present, where each field starts in the record, etc.), and then the content of the fields – essentially the actual content of the MARC record. TheMARC record structure documentation describes this in some detail.

Within the record a special character called the ASCII ‘Record Separator‘ which is used to indicate the end of the directory and variable fields within a record. The ASCII ‘Unit Separator‘ character is used to separate out subfields. Finally, because a single file can contain more than one MARC record, the records are separated from each other by another special character called the ASCII ‘Group Separator‘.

What I’d got was MARC records that looked liked this:

00667nam _22002414_245__002000800000005001900008008004300027020001500070035001000085039001800095040002300113082001600136245005700152250001100209260004300220300002400263500002000287500003400307852002600341100002100367650002000388650001700408#$a13568#$_00000000000000.0#$_030129s1992____--------------000-y-eng-d#__$a0314922180#__$a13568#80$a2$b1$c_$d_$e1#__$aZA$aF101$c20030129#04$220$a621.395#10$aFundamentals of logic design /$cCharles H. Roth, Jr.#__$a4th ed#__$aSt. Paul, MN :$bWest Publishing,$c1992#__$axviii, 770 p :$bill#__$aIncludes index.#__$aIncludes answers to problems.#__$aF101$b_N$c621.395 ROT#$aRoth, Charles H._1#_0$aLogic circuits.#_0$aLogic design#*

Here instead of the special Record separator character, a simple ‘#’ has been used, instead of the Unit separator a ‘$’ has been used, and instead of a Group Separator, a ‘*’ has been used.

I could open the file is a text editor – I ended up (for no particular reason) using a combination of Sublime Text and Notepad++, but either could have probably done the job by itself. I wanted to replace the #, $ and * characters with their appropriate special ASCII characters. However, it seemed likely that at least the ‘$’ sign might also be used as an actual dollar sign in the file, as well as the subfield separator.

I used the fact that both Sublime Text and Notepad++ support find and replace using ‘regular expressions’. Regular expressions are a powerful way of matching text strings. As well as simply making a precise match to a word or piece of text, you can use Regular Expressions to match types of characters (e.g. ‘match a digit’, ‘match a lowercase letter’), or sequences of characters.

For example, you can see in the example above that there is a repeated pattern of a ‘#’ followed by two characters, followed by ‘$’ – this is the record separator (start of a new MARC field) followed by two indicators, followed by the unit separator – the first subfield in the MARC field. The regular expression that will match this is:

#..\$

The ‘#’ matches the # character – just a simple match. The ‘.’ (period/fullstop) character is a wildcard that will match any other character – so the two dots mean ‘match any two characters’. Finally the ‘\$’ matches the dollar sign – the ‘\’ in front of it is needed because a ‘$’ on its own in a regular expression has a special meaning of ‘end of the line’ – the ‘\’ forces it to match the dollar sign directly (using a ‘\’ in this way is called ‘escaping’ the special character).

I was pretty confident that this pattern wouldn’t appear randomly elsewhere in the file – so I could use this as a first find/replace. I wanted to replace the # and $ with the record separator and unit separator characters, but keep whatever two characters were between them. You can do this using regular expressions with something called a ‘capture group’ – where you put the characters you want to ‘capture’ (to be able to use in the ‘replace’ expression) using brackets ‘(‘ and ‘)’. So the Find expression becomes:

#(..)\$

You can capture multiple groups in a single Find expression, and these are just numbered from 1. The syntax for using the ‘captured’ group in the Replace command varies – in Sublime Text it is ‘\1’, while in Notepad++ it is ‘$1’ – but they do the same thing – take whatever value was captured, and use it in the replace expression.

Using this approach I was able to replace the ‘#’ and ‘$’ with the appropriate ASCII characters (which I copied and pasted from a valid MARC file I already had – these special characters didn’t actually show visibly in the Notepad++ Find/Replace dialogue – but they were there and do work in a replace statement).

Once I’d done this I was also able to replace the ‘#*’ that appeared at the end of every line with a Unit Separator followed by a Group Separator. I checked that this replace action replaced the expected number of characters to reassure myself that this had worked OK.

Working with 50,000 lines in a text editor like Sublime Text or Notepad++ can cause some problems. This kind of text editor typically holds the whole of the file in active memory while you are editing the file – and large files can take up large amounts of memory. Fifty thousand lines did make Sublime Text and Notepad++ slightly sluggish in their response times, but happily it didn’t prove too much for either of them.

If I’d hit problems with file size, I’d probably have tried using a tool called ‘sed’ instead. ‘sed’ is a ‘streaming editor’ (hence ‘sed’) which means it doesn’t try to hold the whole file in memory, and instead will run through the file bit by bit when you carry out a command. I have to admit that I don’t find sed very intuitive and often struggle to get it to do what I want. There is a good basic introduction to sed on Mac at http://www.maclife.com/article/columns/terminal_101_find_and_replace_using_sed, and sed is also available for Unix/Linux (natch) and Windows (a simple intro available at http://www.thoughtasylum.com/blog/2011/9/30/using-sed-on-windows.html)

To do the best job I could on the ‘replace’ I had to play around with various matching patterns. For the subfield markers, one of the approaches I used was to use a regular expression to see occurrences that didn’t fit the usual subfields. So for example the regular expression:
\$[^a-z]
Finds all occurrences of the dollar sign that is followed by anything except a lowercase letter. Using this I found immediately some numeric subfields, so I could add these in:
\$[^a-z2468]
I used this kind of approach to investigate oddities and make sure I wasn’t replacing a load of currency ‘dollar’ signs with the Unit separator used for subfields.
I didn’t expect (or achieve) perfection here – I’d probably end up at least accidentally replacing some genuine ‘dollar’ signs from price information with a Unit Separator instead. I also wasn’t trying to fix anything but the most basic problems here – for example there were many examples of “$_” subfields – clearly these aren’t valid MARC subfields, but for the moment I didn’t care about this I just wanted to do the minimum I needed to get to the point where I could open the file with MarcEdit. After some fiddling around with various search/replace strategies, I eventually managed this, and was able to go to the next step in the process which I’ll describe in Part 3 of this series.

3 thoughts on “A worked example of fixing problem MARC data: Part 2 – Text editor

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.