I have a cuda kernel in python which operates on several arrays, based on applied masks.
The arrays are extremely large in the y dimension, so lots of columns, and a small number of rows.
The masks applied select indices from the arrays that are not necessarily, or often, adjacent to one another. And the condition for the masks are based on strings, so I can't do them within the kernel.
Is there a sensible approach to accessing these arrays in a coalesced way?
The function and the kernel it calls looks something like this:
threads_per_block = 256
def func_kernel(S, af, af0, data, i, mask_indices):
start = cuda.grid(1)
stride = cuda.gridsize(1)
for r in range(start, mask_indices.shape[0], stride):
id_r = mask_indices[r, 0]
id_c = mask_indices[r,1]
db1[id_r,id_c] = S[id_r, i] * ((af0[id_r] * af[id_r, i]))
mask_indices = None
def s_af0_af_func(mask, S, af, af0, data, i):
mask_indices = np.column_stack(np.where(mask))
if mask_indices.size == 0:
return
num_columns_mask = mask_indices.shape[0]
mask_indices = cuda.to_device(mask_indices)
func_threads = min(threads_per_block, num_columns_mask)
blockspergrid = (num_columns_mask + threads_per_block - 1) // threads_per_block
s_af0_af_func_kernel[blockspergrid, func_threads](S, af, af0, data, i, mask_indices)
mask_indices = None
So id_r and id_c are not necessarily adjacent. As implemented the kernels don't offer any speed-up over a numpy implementation, and I guess this is probably the reason.
Or is there another approach for applying the mask on strings in the kernel?
Any advice welcomed.