It is a mystery for me how shared memory on CUDA devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program
#include <cuda_runtime.h>
#include <stdio.h>
#define nblc 13
#define nthr 1024
//------------------------@device--------------------
__device__ int inwarpD[nblc];
__global__ void kernel(){
__shared__ int mywarp;
mywarp=0;
for (int i=0;i<5;i++) mywarp += (10000*threadIdx.x+1);
__syncthreads();
inwarpD[blockIdx.x]=mywarp;
}
//------------------------@host-----------------------
int main(int argc, char **argv){
int inwarpH[nblc];
cudaSetDevice(2);
kernel<<<nblc, nthr>>>();
cudaMemcpyFromSymbol(inwarpH, inwarpD, nblc*sizeof(int), 0, cudaMemcpyDeviceToHost);
for (int i=0;i<nblc;i++) printf("%i : %i\n",i, inwarpH[i]);
}
and ran it on K80 GPU. Since several threads are having access to the same shared memory variable I was expecting that this variable will be updated 5*nthr
times, albeit not at the same cycle because of the bank conflict. However, the output indicates that the mywarp
shared variable was updated only 5 times. For each blocks different threads accomplished this task:
0 : 35150005
1 : 38350005
2 : 44750005
3 : 38350005
4 : 51150005
5 : 38350005
6 : 38350005
7 : 38350005
8 : 51150005
9 : 44750005
10 : 51150005
11 : 38350005
12 : 38350005
Instead, I was expecting
523776*10000 + 5*1024 = 5237765120
for each block. Can someone kindly explain me where my understanding of shared memory fails. I would like also to know how would it be possible that all threads in one block access (update) the same shared variable. I know it is not possible at the same MP cycle. Serialization is fine for me because it is going to be a rare event.
Lets walk through the ptx that it generates.
So basically
is being optimized down to
so you're not experiencing a bank-conflict. You are experiencing a race-condition.