How to do vector reduce of an array of size 64 in CUDA?
My code gives me half of the expected answer.
__global__ void Reduce(double* in3,double* r,int size)
{
int id=blockIdx.x*blockDim.x + threadIdx.x;
extern __shared__ double shareddata3[];
int tid=threadIdx.x;
if(id<size) {
shareddata3[tid] =in3[id];
}
__syncthreads();
for (unsigned int s3=(blockDim.x/2); s3 >0; s3 = s3 >>1) {
if (tid < s3) {
shareddata3[tid] = shareddata3[tid] + shareddata3[tid+s3];
}
__syncthreads();
}
if(tid==0) {
r[0]=shareddata3[0];
}
}
and my kernerl launch is:
Reduce<<<1,64,sharedmem3>>>(d_array,g,64);
The error was in a part of your code that you didn't show us. Here's a complete compilable example for your code.
Output: