I'm struggle now for a while to use the shared memory correct on my V100 GPU. I work with medical genome data to perform statistical analyses. The data itself can be visualised as a matrix where the x-axis describes the gene data and the y-axis the patients. So for each gene data set there are N genes and M patients. In addition, there is a vector with metadata for each patient.
The analyses between the gene data are independent. I therefore start a new CUDA thread for each gene. The kernel consists of 256 threads per block and N/256 blocks. Now I expect thread 1 to process gene 1 and thread 2 to process gene 2 and so on. In the first iteration step all threads in the block now use patient 1 and in the second iteration step patient 2.
My idea was that the metadata vector of patient 1 is loaded into the shared memory in parallel by all threads in the block. This makes sense because the data has to be processed first and cannot be used immediately as it is in the global memory. It is then available to all threads and they can access it in parallel. This means that they do not all have to request the memory in the global memory one after the other and calculate the data. In the second iteration step, the metadata vector of patient 2 is now loaded.
Obviously the memory does not seem to behave as I expect because somehow the values are overwritten.
__device__ void kernel(...) {
for (unsigned i = 0; i < samplecnt; i++, metavecptr+= GPUMAXCOVARIATECOUNT) {
__shared__ double metavec[GPUMAXCOVARIATECOUNT];
if(threadIdx.x < GPUMAXCOVARIATECOUNT) {
metavec[threadIdx.x] = metavecptr[threadIdx.x];
}
__syncthreads();
// check for correctness
double metavec2[GPUMAXCOVARIATECOUNT];
for(int k = 0; k < numCovars; k++) {
metavec2[k] = metavecptr[k];
}
if(metavec[1] != metavec2[1]) { //[1]
printf("Error\n");
}
// Use metavec
}
}
However, I have noticed that thread 0 does not find any errors for index 0. But it does for all others. Thread 1 finds no errors for index 1 but for all others and so on.
I don't understand what I'm doing wrong. Do I somehow have to consider how many blocks I start when building the shared memory? Or am I missing a sync or are the PCs of the threads not on the same patient at the same time?
I would appreciate for any tips.