I have 5 large size arrays A(N*5), B(N*5), C(N*5), D(N*5), E(N*2) number 5 and 2 represents the components of these variables in different planes/axes. That's why I have structured arrays in this fashion so I can visualize the data when I am writing my code. N ~ 200^3 ~ 8e06 nodes
For example : this is what my kernel looks like in its simplest form where I am doing all the calculations on global memory.
#define N 200*200*200
__global__ void kernel(doube *A, double *B, double *C,
double *D, double *E, double *res1, double *res2,
double *res3, double *res4 )
{
int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
if(idx>=N) {return;}
res1[idx]=0.; res2[idx]=0.;
res3[idx]=0.; res4[idx]=0.
for (a=0; a<5; a++)
{
res1[idx] += A[idx*5+a]*B[idx*5+a]+C[idx*5+a] ;
res2[idx] += D[idx*5+a]*C[idx*5+a]+E[idx*2+0] ;
res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a] ;
res4[idx] += C[idx*5+a]*E[idx*2+1]-D[idx*5+a] ;
}
}
I know that "for" loop can be eliminated but I have left it here as it is convenient to look at the code. This works but obviously it is extremely inefficient and slow for a Tesla K40 card even after removing the "for" loop. The arithmetics shown inside the "for" loop is just to give an idea, the actual calculations are much longer and convoluted with res1,res2... also getting in the mix.
I have implemented following with limited improvement, but i would like to improve it further with over-loading of shared memory.
#define THREADS_PER_BLOCK 256
__global__ void kernel_shared(doube *A, double *B, double *C,
double *D, double *E, double *res1, double *res2,
double *res3, double *res4 )
{
int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
int ix = threadIdx.x;
__shared__ double A_sh[5*THREADS_PER_BLOCK];
__shared__ double B_sh[5*THREADS_PER_BLOCK];
__shared__ double C_sh[5*THREADS_PER_BLOCK];
__shared__ double D_sh[5*THREADS_PER_BLOCK];
__shared__ double E_sh[2*THREADS_PER_BLOCK];
//Ofcourse this will not work for all arrays in shared memory;
so I am allowed to put any 2 or 3 variables (As & Bs) of
my choice in shared and leave rest in the global memory.
for(int a=0; a<5; a++)
{
A_sh[ix*5 + a] = A[idx*5 + a] ;
B_sh[ix*5 + a] = B[idx*5 + a] ;
}
__syncthreads();
if(idx>=N) {return;}
res1[idx]=0.; res2[idx]=0.;
res3[idx]=0.; res4[idx]=0.
for (a=0; a<5; a++)
{
res1[idx] += A_sh[ix*5+a]*B_sh[ix*5+a]+C[idx*5+a];
res2[idx] += B_sh[ix*5+a]*C[idx*5+a]+E[idx*2+0] ;
res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a] ;
res4[idx] += B_sh[ix*5+a]*E[idx*2+1]-D[idx*5+a] ;
}
}
This helps a little but I would like to implement one of those reduction approaches (without bank conflict)to improve performance where I can put all my variables in shared (may be tiling approach) and then do the calculation part. I saw the reduction example in the CUDA_Sample folder but that example works for sum over only one vector in share without any complex arithmetics involved over multiple arrays from shared memory. I would appreciate any help or suggestion to improve my existing kernel_shared approach to include the reduction approach.
1. What you need is not shared memory
Examining your initial kernel, we notice that for every value of
a
, you're using at most 12 values in your computation of the four deltas to add up (probably less than 12, I didn't count exactly). That all fits perfectly well in your register file - even for double values: 12 * sizeof(double) , plus 4 * sizeof(double) for intermediate results makes 32 4-byte registers per thread. Well beyond the limit even if you have 1024 threads per block.Now, the reasons your kernel runs slowly are mainly
2. Suboptimal memory access patterns
This is something you can read about in any presentation of CUDA programming; I'll just say briefly that instead of each thread handling several consecutive array elements on its own, you should instead interleave this among the lanes of a warp, or better yet the threads of a block. Thus instead of the thread global index idx handling
have it handle
so that whenever threads read or write, their reads and writes coalesce. In your case this might be a little more tricky because some of your accesses have a slightly different pattern but you get the idea.
3. Excessive addition to locations in global memory
This issue is more specific to your case. You see, you really don't need to change the
resN[idx]
value in global after every one of the additions, and you certainly don't care about reading the value that's in there whenever you're about to write. As your kernel stands, a single threads computes a new value forresN[idx]
- so it can just add up things in a register, and write toresN[idx]
when it's done (without even looking at its address).If you change your memory access pattern as I've suggested in point 1., implementing the suggestion in point 2. becomes trickier, since you'll need to add up values from multiple lanes in the same warp, and perhaps make sure you don't cross warp boundaries with the reads relevant to a single computation. To learn how to do that I suggest you have a look at this presentation about shuffle-based reductions.