How to select the number of threads per block, blocks when writing a RawKernel in cupy

42 Views Asked by At

The problem:

I want to implement a segmented scan in cupy. There doesn't appear to be any segmented scan algorithm in cupy, so I implemented it using a RawKernel, using a simple Hillis and Steele approach for now (See here at section 39.2.1 A Naive Parallel Scan).

This is my code

import cupy as cp

def raw_segmented_scan_cumprod_kernel():
    BLOCK_DIMS = (10,)  # HOW TO PICK THESE DIMS?
    GRID_DIMS = (5,)  # HOW TO PICK THESE DIMS?

    segmented_scan_kernel = cp.RawKernel(r'''

    extern "C" __global__
    void scan(float* values, int* segment_heads) {

        int tid = threadIdx.x;
        
        for(int delta = 1; delta < 8; delta *=2) {
            __syncthreads();
    
            int previous_id = tid - delta;
            
            if(previous_id < 0) continue;
            
            if (1 - segment_heads[tid])
                values[tid] = values[tid] * values[previous_id] ;
            segment_heads[tid] = segment_heads[previous_id] || segment_heads[tid];

        }

    }

    ''', 'scan')

    values = cp.arange(10, dtype=cp.float32) + 1
    segment_heads = (((values - 1) % 3) == 0).astype(cp.int32)

    segmented_scan_kernel(GRID_DIMS, BLOCK_DIMS, (values, segment_heads))

However in order to execute a raw kernel call it is necessary to select the grid and block dimensions. How do you pick these? I am torn between two ideas: 1. Max out both dims, and if that is not enough spread the operation over multiple chunks. 2. Be a bit more conservative in grid and block dimensions to let other processes still use the gpu.

How do you tackle this problem when creating a raw kernel in cupy? What criteria would you suggest and which do you think apply in my situation?

Update: I found some relevant information!

Apparently there is a cuda function which calculates the maximum number if active blocks per multiprocessor: cudaOccupancyMaxActiveBlocksPerMultiprocessor which cupy bound to cp.cuda.driver.occupancyMaxActiveBlocksPerMultiprocessor. It is called here: https://github.com/cupy/cupy/blob/main/cupy/cuda/function.pyx#L189 I still do not know for sure how to pass my RawKernel function to it though. I will keep you all posted.

Update 2: I think this is what I need, but it is late and I am too tired to check it well, I will look at it this weekend:
    (GRID_DIMS, BLOCK_DIMS) = np.cuda.driver.occupancyMaxPotentialBlockSize(cuda_function_ptr, 0, 0)

segmented_scan_kernel(GRID_DIMS, BLOCK_DIMS, (values, segment_heads))
0

There are 0 best solutions below