OpenCL in Java

My first experiment with OpenCL was in XCode on Mac and I was wondering if there is also support for Java. My C/C++ skills are not what they used to be and I feel more comfortable with Java. Luckily there is a binding for Java which is called Javacl and I have tried to make the shortest working example to learn how OpenCL is working and how the library must be used. I’m sharing this example because I found that there are not many simple short examples.

At the end of the post you find the complete sourcecode I’ve used and I will describe the code in a few blocks to make clear what it does and why. You should be familiar with some of the OpenCL terminology, this helps in understanding the code.

What you need

  • An OpenCL compatible device
  • Netbeans or eclipse (I use Netbeans)
  • Maven

Maven configuration
Group ID: com.nativelibs4java
Artifact ID: javacl

It should also be possible to use another way of include the jar files. You can find more help at the javacl website.

Source code step by step

The first two rows are for initializing the context and queue for the best device possible on your machine. According to the javadoc this is the device with the most computing units. At my machine it is my Nvidia GPU. The selection of the device is a little implicit here because the call createDefaultQueue() selects a device and a queue for that device.

CLContext context = JavaCL.createBestContext();
final CLQueue queue = context.createDefaultQueue();

After that we can proceed with creating buffers to contain the in- and output. A GPU has its own local memory and we send a command to the device to allocate memory we are going to need. First we define dataSize so we are using the same memory size everywhere in the code. Then we allocate a Pointer<Float> buffer. With the use of the Java Native Access type Pointer we create a datastructure we can fill and use on the device. We also need to set the byteorder because this can vary on different devices (little endian, big endian).

The inputData buffer is filled with some data on the first 3 postions. With input.write() we copy the local inputData buffer to the remote input buffer on the device. Remember that inputData is on the memory of the CPU and input is on the memory of the GPU.

The ouput buffer is only created on the GPU and we are telling the GPU to reserve a block of 128 floats to be used by our program.

int dataSize = 128;

final Pointer<Float> inputData = Pointer.allocateFloats(dataSize).order(context.getByteOrder());
inputData.set(0, 1.0f);
inputData.set(1, 1.1f);
inputData.set(2, 1.2f);

CLBuffer input<Float> = context.createFloatBuffer(CLMem.Usage.Input, inputData, true);
input.write(queue, inputData, true);

final CLBuffer<Float> output = context.createFloatBuffer(CLMem.Usage.Output, dataSize);
float multFactor = 2f;

Now we have the data and pointers to the input and output created we can build the kernel. The kernel source is in defined string for brevity but it would make more sense to define the kernel in a seperate file. We create a CLProgram with the method call createProgram().build() that compiles are code and makes it available for running.
With the program ready we create a CLKernel with createKernel() and we also give our arguments to the method. The arguments are the same as our kernel. In fact the OpenCL float* input corresponds to our Java Pointer input definition.

String sources = ""
   + "__kernel void myKernel(__global const float* input, __global float* output, float multFactor) {\n"
   + "   int i = get_global_id(0);\n"
   + "   output[i] = input[i] * multFactor;\n"
   + "}";

CLProgram program = context.createProgram(sources).build();

// Create kernel and set arguments
CLKernel kernel = program.createKernel("myKernel", input, output, multFactor);

We are ready to run the kernel and we do that with enqueueNDRange(), this method accepts some arguments determing how large the dimensions and workitems should be. This method queues the kernel for execution on the GPU and the program will proceed without waiting for the result. To receive a signal when the work is done I register a Callback.

To read the data we first must copy the results from the GPU memory back to the local memory. This is done with output.read(queue) and copies the data into newResults. I print the first result to show it is working.
With a call to waitFor() we keep waiting until the kernel is executed. For this example this is needed.

CLEvent kernelCompletion = kernel.enqueueNDRange(queue, new int[]{dataSize});

kernelCompletion.setCompletionCallback(new CLEvent.EventCallback() {

  public void callback(int executionStatus) {
    // Read the whole buffer
    Pointer<Float\> newResults = output.read(queue);

    System.out.println("Data " + newResults.get(0));
  }
});

kernelCompletion.waitFor();

Full source

Below is the full source code to make a working example. It should be put in a class and method that is executed. Please be aware that the code is working on my local machine on Ubuntu but I have not tried it anywhere else. Also exception handling is not implemented yet.

CLContext context = JavaCL.createBestContext();
final CLQueue queue = context.createDefaultQueue();

int dataSize = 128;

Pointer results<Float> = Pointer.allocateFloats(dataSize).order(context.getByteOrder());
final Pointer<Float> inputData = Pointer.allocateFloats(dataSize).order(context.getByteOrder());
inputData.set(0, 1.0f);
inputData.set(1, 1.1f);
inputData.set(2, 1.2f);

CLBuffer input<Float> = context.createFloatBuffer(CLMem.Usage.Input, inputData, true);
input.write(queue, inputData, true);

final CLBuffer<Float> output = context.createFloatBuffer(CLMem.Usage.Output, results, true);
float multFactor = 2f;

String sources = ""
   + "__kernel void myKernel(__global const float* input, __global float* output, float multFactor) {\n"
   + "   int i = get_global_id(0);\n"
   + "   output[i] = input[i] * multFactor;\n"
   + "}";

CLProgram program = context.createProgram(sources).build();

// Create kernel and set arguments
CLKernel kernel = program.createKernel("myKernel", input, output, multFactor);

CLEvent kernelCompletion = kernel.enqueueNDRange(queue, new int[]{dataSize});

kernelCompletion.setCompletionCallback(new CLEvent.EventCallback() {

  public void callback(int executionStatus) {
    // Read the whole buffer
    Pointer<Float\> newResults = output.read(queue);

    System.out.println("Data " + newResults.get(0));
  }
});

kernelCompletion.waitFor();

Links