Writing an extension to add new GREL functions to OpenRefine

I’ve been an enthusiastic user of OpenRefine for a long time and think it is a great tool. However, I sometimes come across things that it doesn’t do, or doesn’t do easily, and end up wishing someone would add some function or other to make my life easier. In theory I’ve always been able to do this myself by writing an OpenRefine extension that adds to the functionality OpenRefine but in practice I’ve always struggled to understand how to write an extension.

I had previously read the “How to write an extension”, and “Extension Points” documentation on the wiki but not really understood how to actually write an extension. Since I tend to learn best from examples I also had a look at the Sample Extension documentation and the “OpenRefine’s Technical Documentation for Extension’s Writing” document by Giuliano Tortoreto (alongside two extensions Giuliano wrote).

Despite all this I still hadn’t quite got my head around writing an extension – until I started looking at an extension that added a simple GREL function called ‘extractHost’ which could be used to extract the hostname from a URL. This was part of a much bigger extension that was developed by Steve Osguthorpe and Ian Ibbotson of Knowledge Integration as part of a project called ‘GOKb’.

When I saw how this was written I realised that while OpenRefine extensions could be complex, writing one that simply added a new GREL function was quite straightforward with a little boilerplate code and the actual GREL function written in Java – and even with very little (pretty much zero) experience of writing Java I could manage to write a new function and bundle it in an extension with very little effort – this is my attempt to document that process – any corrections or additions to what I’ve written here are very welcome via the comments.

Create the basic files

The basic structure for an extension is:


build.xml
src/
    com/foo/bar/.../*.java source files
module/
    *.html, *.vt files
    scripts/... *.js files
	styles/... *.css and *.less files
	images/... image files
    MOD-INF/
        lib/*.jar files
        classes/... java class files
        module.properties
        controller.js

(where /foo/bar…/ is the path representing the namespace you want to use for your extension).

However, the only files you need to create an extension which adds new GREL functions are:


build.xml
src/
    com/foo/bar/... *.java (this is where the Java for your new GREL function will go)
module/
    MOD-INF/
        module.properties
        controller.js

The first step is to create the basic directory structure, and the only real decision to make here is the name and namespace for your extension. The name is entirely up to you, but the namespace would usually be based on a domain you own. In my case I’m going to call my extension “Overdue”, and the namespace I’m going to use is “com.ostephens.overdue.refine”. Additionally, I’m going to put my java source files which contain my new GREL functions into a directory called ‘functions’ – which means my directory structure looks like this:


build.xml
src/
    com/ostephens/overdue/refine/functions
module/
    MOD-INF/
        module.properties
        controller.js

The next step is to populate the basic information for your extension into the module.properties file and the build.xml file.

module.properties

The module.properties file is a simple text file which contains the name, description and any dependencies for the extension. The only dependency in this case is the OpenRefine core module, so my module.properties looks like:


name = overdue
description = Collection of small additions to OpenRefine functionality
requires = core

Just change the name and description to whatever you want for your extension.

build.xml

The ‘build.xml’ is more complex than module.properties, and to be honest I don’t fully understand it. It isn’t part of the extension proper, but is used as part of the ‘build’ process that compiles the java code in your extension into executable code.

However, starting from the example build.xml file given by Giuliano I only had to edit a few lines to get this working for my extension. The content of my build.xml is:


<?xml version="1.0" encoding="UTF-8"?>

<project name="overdue" default="build" basedir=".">
    <property name="name" value="overdue"/>
    <property environment="env"/>

    <condition property="version" value="trunk">
        <not><isset property="version"/></not>
    </condition>

    <condition property="revision" value="r1">
        <not><isset property="revision"/></not>
    </condition>

    <condition property="full_version" value="0.0.0.1">
        <not><isset property="full_version"/></not>
    </condition>

    <condition property="dist.dir" value="dist">
        <not><isset property="dist.dir"/></not>
    </condition>

    <property name="fullname" value="${name}-${version}-${revision}" />
    <property name="refine.dir" value="${basedir}/../../main" />
    <property name="refine.webinf.dir" value="${refine.dir}/webapp/WEB-INF" />
    <property name="refine.modinf.dir" value="${refine.dir}/webapp/modules/core/MOD-INF" />
    <property name="refine.classes.dir" value="${refine.webinf.dir}/classes" />
    <property name="refine.lib.dir" value="${refine.webinf.dir}/lib" />
    <property name="server.dir" value="${basedir}/../../server" />
    <property name="server.lib.dir" value="${server.dir}/lib" />

    <property name="src.dir" value="${basedir}/src" />
    <property name="module.dir" value="${basedir}/module" />
    <property name="modinf.dir" value="${module.dir}/MOD-INF" />
    <property name="lib.dir" value="${modinf.dir}/lib" />
    <property name="classes.dir" value="${modinf.dir}/classes" />

    <path id="class.path">
        <fileset dir="${lib.dir}" erroronmissingdir="false">
            <include name="**/*.jar" />
        </fileset>
        <fileset dir="${refine.lib.dir}">
            <include name="**/*.jar" />
        </fileset>
        <fileset dir="${server.lib.dir}">
            <include name="**/*.jar" />
        </fileset>
        <pathelement path="${refine.classes.dir}"/>
    </path>

    <target name="build_java">
        <mkdir dir="${classes.dir}" />
        <javac encoding="utf-8" destdir="${classes.dir}" debug="true" includeAntRuntime="no">
            <src path="${src.dir}"/>
            <classpath refid="class.path" />
        </javac>
    </target>

    <target name="build" depends="build_java"/>

    <target name="clean">
        <delete dir="${classes.dir}" />
    </target>
</project>

The only parts I edited were:

The project name property (mine is called ‘overdue’)
The revision value (mine is ‘r1’ but may change as I further develop the extension)

That’s it – everything else here is exactly the same as in Giuliano’s example build.xml (which in turn is pretty much identical to the sample extension build.xml).

The only other property you might want to edit is the full_version value – which you can keep updating as you develop the extension.

Writing a new GREL function

This is the meat of the extension (everything else is just scaffolding of the extension) – here you get to define a new GREL function. To do this you need to be able to write some Java, but the complexity of the Java you need will depend on what your want your function to do. One really useful thing to do at this stage is look at the structure of the GREL functions in the core OpenRefine product, as your GREL function will follow the same pattern, and you may well be able to learn (and copy) from existing code.


package com.k_int.gokb.refine.functions;

import java.util.Properties;
import java.util.concurrent.ThreadLocalRandom;
import org.json.JSONException;
import org.json.JSONWriter;
import com.google.refine.expr.EvalError;
import com.google.refine.grel.ControlFunctionRegistry;
import com.google.refine.grel.Function;

public class RandomNumber implements Function {
    
    @Override
    public Object call(Properties bindings, Object[] args) {
        if (args.length == 2 && args[0] != null && args[0] instanceof Number
                && args[1] != null && args[1] instanceof Number && ((Number) args[0]).intValue()<((Number) args[1]).intValue()) {
            int randomNum = ThreadLocalRandom.current().nextInt(((Number) args[0]).intValue(), ((Number) args[1]).intValue()+1);
            return randomNum;
        }
        return new EvalError(ControlFunctionRegistry.getFunctionName(this) + " expects two numbers, the first must be less than the second");
    }

    
    @Override
    public void write(JSONWriter writer, Properties options)
        throws JSONException {
    
        writer.object();
        writer.key("description"); writer.value("Returns a pseudo-random integer between the lower and upper bound (inclusive)");
        writer.key("params"); writer.value("lower bound,upper bound");
        writer.key("returns"); writer.value("number");
        writer.endObject();
    }
}

I don’t really write Java, and I relied heavily on looking at existing GREL functions to understand how to get this working. This file defines a ‘RandomNumber’ class which is a GREL function. The first bit of the code:


    @Override
    public Object call(Properties bindings, Object[] args) {
        if (args.length == 2 && args[0] != null && args[0] instanceof Number
                && args[1] != null && args[1] instanceof Number && ((Number) args[0]).intValue()<((Number) args[1]).intValue()) {
            int randomNum = ThreadLocalRandom.current().nextInt(((Number) args[0]).intValue(), ((Number) args[1]).intValue()+1);
            return randomNum;
        }
        return new EvalError(ControlFunctionRegistry.getFunctionName(this) + " expects two numbers, the first must be less than the second");
    }

This is the actual function – it takes two arguments (a lower and upper bound to the range in which to generate a random integer). It also includes an error message which will be displayed in the GREL/Transform dialogue in OpenRefine if the user does not enter the two required parameters (upper and lower bound)

The second bit of the code:


    @Override
    public void write(JSONWriter writer, Properties options)
        throws JSONException {
    
        writer.object();
        writer.key("description"); writer.value("Returns a pseudo-random integer between the lower and upper bound (inclusive)");
        writer.key("params"); writer.value("lower bound,upper bound");
        writer.key("returns"); writer.value("number");
        writer.endObject();
    }
}

This is the documentation for the function – this is what will appear in the ‘Help’ tab on the GREL/Transform dialogue screen.

Once the function has been written, what remains is to make sure that this new function is registered with OpenRefine when the extension is installed. This is done in ‘controller.js’.

Registering and initialising new GREL functions in controller.js

The controller.js is the file in which you tell OpenRefine about the new functions, commands and operations your extension adds to OpenRefine. However, in this case we are only creating new GREL functions (and not, for example, adding any new menu items or export formats). This means you only need to register functions. The content of my controller.js file looks like:


function registerFunctions() {
    Packages.java.lang.System.out.print("Registering Overdue utilities functions...");
    var FR = com.google.refine.grel.ControlFunctionRegistry
    FR.registerFunction("randomNumber", new com.ostephens.overdue.refine.functions.RandomNumber());
}

function init() {
    Packages.java.lang.System.out.println("Initializing Overdue utilities...");
    Packages.java.lang.System.out.println(module.getMountPoint());
    registerFunctions();
   );
}

The first part of this file (the ‘registerFunctions’ function) registers my new GREL function – called ‘RandomNumber’. The second part of the file (thie ‘init’ function) ensures that these functions are added when OpenRefine starts up.

The key line is:


FR.registerFunction("randomNumber", new com.ostephens.overdue.refine.functions.RandomNumber());

This tells OpenRefine that the GREL command will be ‘randomNumber’ (case sensitive) – this is what the user will type in the GREL/Transform dialogue to use the function. It then points to the class that I created in the Java file above (RandomNumber – again case sensitive).

If I wanted to add another new function, I’d just need to edit this file to include any other functions I’ve written. For example:


FR.registerFunction("newGrelFunction", new com.ostephens.overdue.refine.functions.newGrelFunction());

Now all the pieces are in place, all that remains is to ‘build’ the extension, and you are ready to use it.

Building the extension

OpenRefine is built using Apache Ant, a piece of software which can be used to compile, assemble, test and run Java applications. The same software is used to build extensions, and this must be done from while the extension code is sitting within the larger OpenRefine extension. So to get this working you need to:

Install Ant – there are general installation instructions on the Ant websites, but searching for a guide on installing it on your specific platform may also be helpful
Download the development version of OpenRefine – use the source code links from https://github.com/OpenRefine/OpenRefine/releases
Put the directory containing your extension under the OpenRefine ‘extensions’ directory
Go to the directory containing your extension and type ‘ant build’ (note that this is not done from the OpenRefine root folder, but at the root of your extension)
If everything is OK, this will build your extension with no errors

If there are errors in your Java class or your build.xml file, this is the point where you will find out as the extension won’t build successfully if it has problems at this point. However it is worth noting that other issues may not be caught at this point, since the extension can build even if there are (for example) errors in controller.js. Errors in controller.js may not become apparent until you start using the extension.

Using the extension

Once you’ve built the extension, you can copy the directory containing your extension into your working OpenRefine installation (again, in the appropriate ‘extensions’ directory), and run OpenRefine. You should see (in the command window) any output your extension is programmed to print on initialisation (in my case: “Initializing Overdue utilities” and “Registering Overdue utilities functions”)

Open up a project in OpenRefine and open a GREL/Transform dialogue – and you should be able to use your new GREL command – in my case by typing something like:


randomNumber(1,10)

Which would generate a random integer between 1 and 10.

If you look in the ‘Help’ tab you should be able to see the ‘description’, ‘params’ and ‘returns’ information from the Java file.

If you hit problems with your extension working at this point it may be worth checking the javascript console to see if controller.js has generated any errors. However, tracking this back is slightly complicated by the way OpenRefine builds all the javascript together when it runs – so you’ll have to check for errors and then work out if they relate to your controller.js or something else.

Conclusion

Once I’d understood how the various bits of an OpenRefine extension interacted through the documentation and (really importantly for me) looking at examples of clearly written code, I could see how adding a new GREL function was essentially as simple as writing a new Java class and then wiring it all together.

I wouldn’t have got this far without the documentation and examples written by Giuliano Tortoreto and the GOKb extension written by Steve Osguthorpe and Ian Ibbotson of Knowledge Integration – but of course any mistakes or misunderstanding in the above remain my own!

Overdue Ideas

Ideas linking Libraries, Computing, E-learning, and anything else that springs to mind.