I am trying to write a CUDA kernel that is used to find min and max index of values in 1D array greater than a particular threshold Below is the pseudo code in CPU for doing the same
int min_index = 0, max_index = length;
for(int i = 0; i < length; i++)
{
if(arr[i] >= threshold)
{
min_index = i;
break;
}
}
for(int i = length - 1 ; i >= min_index; i--)
{
if(arr[i] >= threshold)
{
max_index = i;
break;
}
}
One way of doing this is GPU is using atomic operations, but i would not prefer to use them. I was thinking of using reduction, but unable to figure out the logic with which i need to do reduction. For min reduction a logic like below can be followed
for(int i = 8; i >= 1; i /= 2)
{
if(threadIdx.x < i)
sh_mem[threadIdx.x] = min(sh_mem[threadIdx.x], sh_mem[threadIdx.x + i]);
__syncthreads();
}
But what logic can i use to do for my case, where i need to compare the values in shared memory with a threshold and compute the min index in shared memory satisfying this criteria?
Would greatly appreciate any thoughts on how to solve the above problem
EDIT: It is not possible for me to use thrust library due to some constraints
The following is a sample of how to do this in Thrust with a single algorithm instead of two calls to
thrust::find_if(one utilizingreverse_iterators). This implementation is using a singletransform_reducewhere thetransformfilters out the values above the threshold by replacing them with neutral elements and prepares the followingminmaxreduction working on tuples.While it might be possible to get better performance using a custom kernel as discussed in the comments below the question, this should serve as a good reference solution (i.e. this isn't necessarily supposed to be the final/accepted answer).
Note that in case the minimum and maximum indices are far apart, the two
find_ifs should actually perform better, assuming they do implement early exit and avoid reading the whole range from global memory. In the (edge-) case where these indices are very close however a single reduction might be favorable as it avoids overheads.Because it is probably (did not benchmark) the better solution in general, I will include the (pretty simple)
find_ifsolution here as well: