The problem:
I want to implement a segmented scan in cupy. There doesn't appear to be any segmented scan algorithm in cupy, so I implemented it using a RawKernel, using a simple Hillis and Steele approach for now (See here at section 39.2.1 A Naive Parallel Scan).
This is my code
import cupy as cp
def raw_segmented_scan_cumprod_kernel():
BLOCK_DIMS = (10,) # HOW TO PICK THESE DIMS?
GRID_DIMS = (5,) # HOW TO PICK THESE DIMS?
segmented_scan_kernel = cp.RawKernel(r'''
extern "C" __global__
void scan(float* values, int* segment_heads) {
int tid = threadIdx.x;
for(int delta = 1; delta < 8; delta *=2) {
__syncthreads();
int previous_id = tid - delta;
if(previous_id < 0) continue;
if (1 - segment_heads[tid])
values[tid] = values[tid] * values[previous_id] ;
segment_heads[tid] = segment_heads[previous_id] || segment_heads[tid];
}
}
''', 'scan')
values = cp.arange(10, dtype=cp.float32) + 1
segment_heads = (((values - 1) % 3) == 0).astype(cp.int32)
segmented_scan_kernel(GRID_DIMS, BLOCK_DIMS, (values, segment_heads))
However in order to execute a raw kernel call it is necessary to select the grid and block dimensions. How do you pick these? I am torn between two ideas: 1. Max out both dims, and if that is not enough spread the operation over multiple chunks. 2. Be a bit more conservative in grid and block dimensions to let other processes still use the gpu.
How do you tackle this problem when creating a raw kernel in cupy? What criteria would you suggest and which do you think apply in my situation?
Update: I found some relevant information!Apparently there is a cuda function which calculates the maximum number if active blocks per multiprocessor: cudaOccupancyMaxActiveBlocksPerMultiprocessor which cupy bound to cp.cuda.driver.occupancyMaxActiveBlocksPerMultiprocessor. It is called here: https://github.com/cupy/cupy/blob/main/cupy/cuda/function.pyx#L189 I still do not know for sure how to pass my RawKernel function to it though. I will keep you all posted.
Update 2: I think this is what I need, but it is late and I am too tired to check it well, I will look at it this weekend: (GRID_DIMS, BLOCK_DIMS) = np.cuda.driver.occupancyMaxPotentialBlockSize(cuda_function_ptr, 0, 0)
segmented_scan_kernel(GRID_DIMS, BLOCK_DIMS, (values, segment_heads))