The following code sums every 32
elements in an array to the very first element of each 32
element group:
int i = threadIdx.x;
int warpid = i&31;
if(warpid < 16){
s_buf[i] += s_buf[i+16];__syncthreads();
s_buf[i] += s_buf[i+8];__syncthreads();
s_buf[i] += s_buf[i+4];__syncthreads();
s_buf[i] += s_buf[i+2];__syncthreads();
s_buf[i] += s_buf[i+1];__syncthreads();
}
I thought I can eliminate all the __syncthreads()
in the code, since all the operations are done in the same warp. But if I eliminate them, I get garbage results back. It shall not affect performance too much, but I want to know why I need __syncthreads()
here.
Maybe have a look at these Slides from Mark Harris. Why reinvent the wheel.
www.uni-graz.at/~haasegu/Lectures/GPU_CUDA/Lit/reduction.pdf?page=35
Each reduction step is dependent on the other. So you can only leave out the synchronization in the last excecuted warp equals 32 active threads in the reduction phase. One step before you need 64 threads and hence need a synchronisation since parallel execution is not guaranteed since you use 2 warps.