CUDA: overloading of shared memory to implement reduction approach with multiple arrays

430 Views Asked by At

I have 5 large size arrays A(N*5), B(N*5), C(N*5), D(N*5), E(N*2) number 5 and 2 represents the components of these variables in different planes/axes. That's why I have structured arrays in this fashion so I can visualize the data when I am writing my code. N ~ 200^3 ~ 8e06 nodes

For example : this is what my kernel looks like in its simplest form where I am doing all the calculations on global memory.

#define N 200*200*200
__global__ void kernel(doube *A, double *B, double *C, 
            double *D, double *E, double *res1, double *res2, 
            double *res3, double *res4 )
    {
       int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
        if(idx>=N) {return;}
        res1[idx]=0.; res2[idx]=0.; 
        res3[idx]=0.; res4[idx]=0.

        for (a=0; a<5; a++)
        {
            res1[idx] += A[idx*5+a]*B[idx*5+a]+C[idx*5+a] ;
            res2[idx] += D[idx*5+a]*C[idx*5+a]+E[idx*2+0] ;
            res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a] ;
            res4[idx] += C[idx*5+a]*E[idx*2+1]-D[idx*5+a] ;
        }

    }

I know that "for" loop can be eliminated but I have left it here as it is convenient to look at the code. This works but obviously it is extremely inefficient and slow for a Tesla K40 card even after removing the "for" loop. The arithmetics shown inside the "for" loop is just to give an idea, the actual calculations are much longer and convoluted with res1,res2... also getting in the mix.

I have implemented following with limited improvement, but i would like to improve it further with over-loading of shared memory.

    #define THREADS_PER_BLOCK 256
    __global__ void kernel_shared(doube *A, double *B, double *C, 
               double *D, double *E, double *res1, double *res2, 
               double *res3, double *res4  )
    {
       int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
       int ix = threadIdx.x;
       __shared__ double A_sh[5*THREADS_PER_BLOCK];
       __shared__ double B_sh[5*THREADS_PER_BLOCK];
       __shared__ double C_sh[5*THREADS_PER_BLOCK];
       __shared__ double D_sh[5*THREADS_PER_BLOCK];
       __shared__ double E_sh[2*THREADS_PER_BLOCK];

       //Ofcourse this will not work for all arrays in shared memory; 
        so I am allowed  to put any 2 or 3 variables (As & Bs) of  
         my choice in shared and leave rest in the global memory. 

       for(int a=0; a<5; a++)
     {
        A_sh[ix*5 + a] = A[idx*5 + a] ;
        B_sh[ix*5 + a] = B[idx*5 + a] ;
     }
            __syncthreads();



    if(idx>=N) {return;}
        res1[idx]=0.; res2[idx]=0.; 
        res3[idx]=0.; res4[idx]=0.
    for (a=0; a<5; a++)
    {
        res1[idx] += A_sh[ix*5+a]*B_sh[ix*5+a]+C[idx*5+a];
        res2[idx] += B_sh[ix*5+a]*C[idx*5+a]+E[idx*2+0]  ;
        res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a]    ;
        res4[idx] += B_sh[ix*5+a]*E[idx*2+1]-D[idx*5+a]  ;
    }

}

This helps a little but I would like to implement one of those reduction approaches (without bank conflict)to improve performance where I can put all my variables in shared (may be tiling approach) and then do the calculation part. I saw the reduction example in the CUDA_Sample folder but that example works for sum over only one vector in share without any complex arithmetics involved over multiple arrays from shared memory. I would appreciate any help or suggestion to improve my existing kernel_shared approach to include the reduction approach.

1

There are 1 best solutions below

3
On

1. What you need is not shared memory

Examining your initial kernel, we notice that for every value of a, you're using at most 12 values in your computation of the four deltas to add up (probably less than 12, I didn't count exactly). That all fits perfectly well in your register file - even for double values: 12 * sizeof(double) , plus 4 * sizeof(double) for intermediate results makes 32 4-byte registers per thread. Well beyond the limit even if you have 1024 threads per block.

Now, the reasons your kernel runs slowly are mainly

2. Suboptimal memory access patterns

This is something you can read about in any presentation of CUDA programming; I'll just say briefly that instead of each thread handling several consecutive array elements on its own, you should instead interleave this among the lanes of a warp, or better yet the threads of a block. Thus instead of the thread global index idx handling

5 * idx
5 * idx + 1
...
5 * idx + 4

have it handle

5 * blockDim.x * blockIdx.x + threadIdx.x
5 * blockDim.x * blockIdx.x + threadIdx.x + blockDim.x
...
5 * blockDim.x * blockIdx.x + threadIdx.x + 4 * blockDim.x

so that whenever threads read or write, their reads and writes coalesce. In your case this might be a little more tricky because some of your accesses have a slightly different pattern but you get the idea.

3. Excessive addition to locations in global memory

This issue is more specific to your case. You see, you really don't need to change the resN[idx] value in global after every one of the additions, and you certainly don't care about reading the value that's in there whenever you're about to write. As your kernel stands, a single threads computes a new value for resN[idx] - so it can just add up things in a register, and write to resN[idx] when it's done (without even looking at its address).


If you change your memory access pattern as I've suggested in point 1., implementing the suggestion in point 2. becomes trickier, since you'll need to add up values from multiple lanes in the same warp, and perhaps make sure you don't cross warp boundaries with the reads relevant to a single computation. To learn how to do that I suggest you have a look at this presentation about shuffle-based reductions.