The next kernel performs the multiplication of the matrices matA and matB and stores the result in the matrix matC (the size of all matrices is N) using a shared memory region with dimensions tiledim x tiledim. Since this dimension is specified by the user at run time, the dynamic shared memory allocation in GPU is used, and therefore the kernel function, after the initialization of the tiledim variable, is called in the form
matMultWithBlockShared <<<blocks,threads,2*tiledim*tiledim>>>
(gpuA, gpuB, gpuC, N, tiledim);
cudaError_t err = cudaGetLastError();
if (err!=cudaSuccess) {
printf("CUDA Error: %s\\n", cudaGetErrorString(err));
exit (-1);}
(since we need tiledim*tiledim shared memory for each one of the matrices A and B). The other kernel parameters are initialized as
threads.x=threads.y=tiledim;
blocks.x=blocks.y=32/tiledim;
while the source code of the kernel is shown below.
__global__ void matMultWithBlockShared (double * matA, double * matB,
double * matC, int N, int tiledim) {
int row = blockIdx.y, column = blockIdx.x, tiles = N/tiledim;
int tileRow = threadIdx.y, tileColumn = threadIdx.x;
double sum = 0.0;
extern __shared__ double shGPUMem[];
double *sharedA = shGPUMem;
double *sharedB = (double *)&sharedA[tiledim*tiledim];
double *cellA, *cellB, *cellC = &matC[N*tiledim*row+tiledim*column];
for (int i=0;i<tiles;++i) {
cellA = &matA[N*tiledim*row+tiledim*i];
cellB = &matB[N*tiledim*i+tiledim*column];
sharedA[tileRow*tiledim+tileColumn] = cellA[tileRow*N+tileColumn];
sharedB[tileRow*tiledim+tileColumn] = cellB[tileRow*N+tileColumn];
__syncthreads();
for (int j=0;j<tiledim;j++)
sum += sharedA[tileRow*tiledim+j]*sharedB[j*tiledim+tileColumn];
__syncthreads(); }
cellC[tileRow*N+tileColumn] = sum; }
The code works excellent, estimates the correct results and behaves correctly (namely, the larger the tiledim value, the smaller the execution time), for values of tiledim equal to 1, 2, 4 and 8 (for the sake of simplicity, square matrices are used with dimensions powers of 2 and greater of 32). However, when the tiledim gets the values 16 and 32 the program does not work. The values of the matC are equal to their initial values of 0 and the elapsed time has also a zero value. A careful debugging process identifies that the problem is associated with the lines
sharedA[tileRow*tiledim+tileColumn] = cellA[tileRow*N+tileColumn];
sharedB[tileRow*tiledim+tileColumn] = cellB[tileRow*N+tileColumn];
that are not executed at all and seems that the program aborts. For example a printf before the first line is executed without problem, but a printf between the two lines is not executed at all. Therefor I suppose that something is wrong with these lines but i can't figure out why, since for tiledim values 1, 2, 4 and 8 the program works excellent. I suppose that there are not issues regarding the availability of shared memory, since as I read my RTX 3060 NVIDIA card has 48 Kbytes shared memory and the required size of these operations is much smaller (the cudaGetLastError function following the kernel, does not report any error). Can anyone figure out what happens? I can copy and paste all the app code if this is necessary. Thanks a lot !!
The program works correctly for small tiledim values (1,2,4,8) but not for the values 16 and 32. I want some help to solve the issue