This is the third post in a series of 5.
In Part 2 I describe how I used a text editor to get a malformed file to the point where it could be read as a MARC file by MarcEdit. I knew that there would still be many issues in the file at this point, because I’d spotted them in my initial investigation, and when editing the file in a text editor – but I wanted to get a more structured list of the issues and happily the MarcEdit software has an option to validate files.
Like several other functions in MarcEdit, the ‘Validate MARC Records’ option can be accessed both from the MarcEdit opening screen, and from with the MarcEdit editor. To access the validation option without going through the editor look in the ‘Add-ins’ menu:
However, first I wanted to make sure that the file would open OK in the MarcEdit Editor, and see how it looked, so I used the ‘MarcEditor’ option and opened my file:
The layout of the MARC record in the editor is much easier to read than the native MARC format – for comparison:
The MarcEditor layout is called ‘mnemonic format‘ (for what seem like slightly obscure reasons to be honest, and as far as I can tell relate back to the origins of this format in the Library of Congress MarcMaker and MarcBreaker software)
The layout of this mnemonic format is reasonably easy to read if you are used to MARC records – each line contains:
- an equals sign as the first character on the line
- a three digit (or letter in the case of LDR) MARC field code
- two spaces
- two MARC field indiciators
- the content of the field – with subfields included where appropriate using the syntax ‘$’ followed by the subfield code
Even from a very brief examination of the MARC record in the editor I can immediately see there is a problem with the fixed fields (LDR, 001-009 fields) in that they all start with a subfield:
I can also see that I’ve got a ‘002’ field containing what seems to be a system number – which I’d expect to be in ‘001’.
I then ran the Validate Marc Records function, which can be accessed from inside the MarcEditor through the ‘Tools’ menu:
When you choose this option, you are prompted for the ‘rules’ file you want to use:
MarcEdit comes with a ready made rules file for validating MARC – but you can modify this, or design your own validation file if you have specific things you want to validate (or ignore) in different types of file.
(n.b. in the illustration above shows the option to select the ‘source’ file – that is the file you want to validate. This isn’t an option when using the Marc Validator from the Marc Editor, as it will always validate the file you are viewing in the editor. However when you access the Marc Validator directly without going via the Editor, you will be asked which file you want to validate)
There are different options for the record validation process, but in this case I want to use the default ‘Validate Record’ option. When I click ‘OK’, the validator runs (this will take a while on a large file) and then displays the results which in my case looked something like:
I can see now that my problems extend beyond the fixed field problems I identified by eye – I’ve got all kinds of problems with incorrect indicators (and many other problems not shown in this screenshot).
I used the clipboard icon to copy these results into the clipboard and pasted them into a text file so I could refer back to them.
At this point I have a file of MARC records that will at least open in MarcEdit. I also have a list of issues with the MARC records in the file. I now want to start fixing these errors. Of course I could start fixing these issues directly in MarcEdit, and there are some tools and approaches in MarcEdit that might help me – but with this volume of issues over a file of 50,000 records I’m not sure MarcEdit is the right tool.
Instead I’m going to use another tool to start fixing the records – a tool called OpenRefine which is designed specifically to help ‘fix messy data’. I’m a big fan of OpenRefine and use it a lot, so for me it is the obvious tool to use for this task.
However, OpenRefine doesn’t understand MARC records. It can use XML, and so converting to MARCXML might be one approach I could use – but to be honest I don’t think it is the right approach in this case, and I suspect trying to fix MARCXML in OpenRefine would be a very painful process.
Instead I’m going to use the ‘mnemonic’ format that is used by the MarcEdit editor. There are two ways of converting a MARC file into the mnemonic format in MarcEdit. You can use the ‘MARC Breaker’ function which can be accessed from the MarcEdit opening screen, or (and this is the approach I took) once you have a file open in the MARC editor you can simply save it in the Mnemonic format simply by using the File->Save option from the File menu. The mnemonic format is designated in Marc Edit by the ‘mrk’ file extension (as opposed to ‘mrc’ which designates a proper aka ‘compiled’ aka ‘binary’ MARC file). ‘mrk’ files are simple text files, and can be opened in any text editor, and happily also in OpenRefine.
I now have a file of errors (from the MARC Validator) and my MARC records in mnemonic format – the next step is to open the files in OpenRefine so I can see all of the different types of error that I need to fix and start to fix them – which I’ll describe in Part 4 of this series.