Is there a For-loop in Boost.Compute?

199 Views Asked by At

I am trying to do the following:

  1. Declare and initialise a compute::vector of size n with each of the element being 1^2, 2^2, 3^2 ... n^2
  2. While size of vector > 1:
    a. For i = 0 to (size of vector - 1): n[i] += n[i+1]
    b. Remove the last element from the vector
  3. Square root the remaining element in the vector.

I implemented this with Boost.Compute in this way:

BOOST_COMPUTE_FUNCTION(double, square, (double x),
{
   return x*x;
});

uint32_t cl_benchmark(const compute::device& gpu) {
    compute::context ctx(gpu);
    compute::command_queue queue(ctx, gpu);
    //Starts measuring time from this line
    compute::vector<double> v(size, ctx);

    compute::iota(v.begin(), v.end(), 1, queue);
    compute::transform(  // b^2 and c^2 in one function
        v.begin(), v.end(), v.begin(), square, queue
    );

    for (uint32_t temp_size = size; temp_size > 1; temp_size--) {
        compute::adjacent_difference( // b^2 + c^2
            v.begin(), v.end(), v.begin(), compute::plus<double>(), queue
        );
        v.erase(v.begin(), queue);
    }
    compute::transform( // sqrt(a)
        v.begin(), v.end(), v.begin(), compute::sqrt<double>(), queue
    );
    //Stops measuring time
}

Which works but upon testing it is very slow:
size 6000 time 551ms
size 7024 time 1870ms
size 8048 time 8262ms

I know that it takes time to compile kernels as well as copying data to the GPU, but the results still seem a bit off to me.
I implemented something similar with single-core CPU processing only and it can process this algorithm with size of ~150k in 2s.

I suspect that the for loop is the bottleneck, where it keeps compiling new adjacent_difference kernels. That's why I am wondering if it can be parallelised?
To my understanding, if it no longer compiles new adjacent_difference for each iteration, the CPU bottleneck should be drastically reduced.

0

There are 0 best solutions below