C++ OpenCL matrix library with temporary elimination

737 Views Asked by At

The armadillo matrix library writes

Armadillo employs a delayed evaluation approach to combine several operations into one and reduce (or eliminate) the need for temporaries. Where applicable, the order of operations is optimised. Delayed evaluation and optimisation are achieved through recursive templates and template meta-programming.

This means that you can write operations like

arma::mat A, B;
arma::vec c, d;
...
d=(A % B)*c;

and no temporary variables are created. (note that % is the element-wise product operation in armadillo)

I would like to be able to code in a similar style for an OpenCL application.

The libraries I've looked at are VexCL, ViennaCL, Boost.Compute, and clBLAS. VexCL and Boost.Compute don't even provide basic matrix functionality such as multiplication. clBLAS doesn't work as a template library, so you need to manually invoke the operations. ViennaCL provides all the operations I need, but it doesn't seem to be capable of chaining them together.

For example

    d= linalg::prod(linalg::element_prod(A,B), c);

fails to compile.

I think there might be some possibility of using VexCL to automatically generate kernels based on the operations Armadillo decides on, but I can't see any way of making that work straightforwardly.

Any suggestions?

2

There are 2 best solutions below

4
On BEST ANSWER

You might want to check out ArrayFire.

ArrayFire is a matrix based library with a JIT compilation engine which allows you to combine operations into a single kernel. This dramatically reduces the number of kernel calls for basic element wise operations that you posted above. For example the code you posted can be written as:

array A = randu(5, 5);       // 5x5 Matrix
array B = randu(5, 5);       // 5x5 Matrix
array c = constant(1, 1, 5); // 1x5 Matrix

array d = (A % B) + tile(c, 5);

In this example the modulus and the addition will be performed in one OpenCL kernel. No temporaries are created. We also have backends for single threaded CPU, CUDA, and OpenCL.

Disclosure: I am one of the developers of the ArrayFire library.

0
On

The operation from your example may be written in VexCL with a recently added tensordot() operation:

vex::vector<double> A(ctx, n * m), B(ctx, n * m);
vex::vector<double> c(ctx, n), d(ctx, n);

vex::slicer<1> s1(vex::extents[n]);    // shape of the vectors
vex::slicer<2> s2(vex::extents[n][m]); // shape of the matrices

using vex::_;
d = vex::tensordot(s2[_](A * B), s1[_](c), vex::axes_pairs(1, 0));

This will result in a very straightforward implementation of the matrix-vector product (so probably not as effective as a vendor-provided BLAS kernel). But no temporaries will be involved and a single kernel will be launched.

One caveat: tensordot() may only be used with single-device contexts.