I’m trying to implement a runs-based kernel using Numba CUDA where I need to traverse the elements of a 3D matrix on a row per thread basis i.e. each thread is assigned a row that iterates over all elements of that row.
For example, if, for simplicity, I were to use a 2D matrix with 50 rows and 100 columns, I would need to create 50 threads that would go through the 100 elements of their respective row.
Can someone tell me how to do this?
Turns out it’s actually quite simple. You only need to launch as many threads as rows and have the kernel “point” its direction. Here’s a simple kernel that demonstrates how to do such an iteration over a 3D matrix (binary_image). The kernel itself is part of the CCL algorithm I’m implementing but that can safely be ignored: