I'm having trouble understanding how the stride loop actually works. For just generally iterating through arrays.
This is the example stride loop that I found. For a single block stride loop.
<<<1, 256>>>
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
I'm guessing that it only runs the += stride once per block, and then the inner code per thread. But there is nothing that actually specifies that, since from normal c++ logic it would run the stride calculation every time the loop looped.
Or does it just run the looping logic for every single instruction/thread, since it seems like that would impact performance.