Shared memory accessing garbage values in CUDA

284 Views Asked by At

I am trying to implement a Navier-Stokes solver in 2D using CUDA. I am using Jacobi's method to solve the system of difference equations. I am dividing the code in 4x4 blocks consisting of 16x16 threads. As every inner point in my matrix (of dimension 64x64) requires its top, bottom, left and right element to compute its new value, I create a new shared matrix of 18x18 dimension for every block. I read all the values into the matrix in this fashion - The thread with indices (0, 0) will write its value into the (1, 1) element in the matrix and will also attempt to read the element above it and the one to its left if this access is not exceeding the boundary. Once this read is done, I update the values of all the internal points and then write them back into memory.

I end up getting garbage values in the matrix pn, even though all the values are initialized correctly. I honestly cannot see where I'm going wrong. Can someone help me with this?

My kernel:

__global__ void red_psi (float *psi_o, float *psi_n, float *e, float *omega, float l1)
{
    // m = n = 64
    int i1 = blockIdx.x;
    int j1 = blockIdx.y;
    int i2 = threadIdx.x;
    int j2 = threadIdx.y;
    int i = (i1 * blockDim.x) + i2; // Actual row of the element
    int j = (j1 * blockDim.y) + j2; // Actual column of the element
    int l = i * n + j;

    // e_XX --> variables refers to expanded shared memory location in order to accomodate halo elements
    //Current Local ID with radius offset.
    int e_li = i2 + 1;
    int e_lj = j2 + 1;

    // Variable pointing at top and bottom neighbouring location
    int e_li_prev = e_li - 1;
    int e_li_next = e_li + 1;

    // Variable pointing at left and right neighbouring location
    int e_lj_prev = e_lj - 1;
    int e_lj_next = e_lj + 1;

    __shared__ float po[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
    __shared__ float pn[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
    __shared__ float oo[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
    //__shared__ float ee[BLOCK_SIZE + 2][BLOCK_SIZE + 2];

    if (i2 < 1) // copy top and bottom halo
    {
        //Copy Top Halo Element
        if (blockIdx.y > 0) // Boundary check
        {
            po[i2][e_lj] = psi_o[l - n];
            //pn[i2][e_lj] = psi_n[l - n];
            oo[i2][e_lj] = omega[l - n];
            //printf ("i_pn[%d][%d] = %f\n", i2, e_lj, oo[i2][e_lj]);
        }

        //Copy Bottom Halo Element
        if (blockIdx.y < (gridDim.y - 1)) // Boundary check
        {
            po[1 + BLOCK_SIZE][e_lj] = psi_o[l + n];
            //pn[1 + BLOCK_SIZE][e_lj] = psi_n[l + n];
            oo[1 + BLOCK_SIZE][e_lj] = omega[l + n];
            //printf ("j_pn[%d][%d] = %f\n", 1 + BLOCK_SIZE, e_lj, oo[1 + BLOCK_SIZE][e_lj]);
        }

    }

    if (j2 < 1) // copy left and right halo
    {
        if (blockIdx.x > 0) // Boundary check
        {
            po[e_li][j2] = psi_o[l - 1];
            //pn[e_li][j2] = psi_n[l - 1];
            oo[e_li][j2] = omega[l - 1];
            //printf ("k_pn[%d][%d] = %f\n", e_li, j2, oo[e_li][j2]);
        }

        if (blockIdx.x < (gridDim.x - 1)) // Boundary check
        {
            po[e_li][1 + BLOCK_SIZE] = psi_o[l + 1];
            //pn[e_li][1 + BLOCK_SIZE] = psi_n[l + 1];
            oo[e_li][1 + BLOCK_SIZE] = omega[l + 1];
            //printf ("l_pn[%d][%d] = %f\n", e_li, 1 + BLOCK_SIZE, oo[e_li][BLOCK_SIZE + 1]);
        }
    }

    // copy current location
    po[e_li][e_lj] = psi_o[l];
    //pn[e_li][e_lj] = psi_n[l];
    oo[e_li][e_lj] = omega[l];
    //printf ("o_pn[%d][%d] = %f\n", e_li, e_lj, oo[e_li][e_lj]);

    __syncthreads ();

    // Checking whether we have an internal point.
    if ((i >= 1 && i < (m - 1)) && (j >= 1 && j < (n - 1)))
    {
        //printf ("Calculating for - (%d, %d)\n", i, j);
        pn[e_li][e_lj] = 0.25 * (po[e_li_next][e_lj] + po[e_li_prev][e_lj] + po[e_li][e_lj_next] + po[e_li][e_lj_prev] + h*h*oo[e_li][e_lj]);
        //printf ("n_pn[%d][%d] (%d, %d), a(%d, %d) = %f\n", e_li_prev, e_lj, i1, j1, i, j, po[e_li_prev][e_lj]);
        pn[e_li][e_lj] = po[e_li][e_lj] + 1.0 * (pn[e_li][e_lj] - po[e_li][e_lj]);

        __syncthreads ();

        psi_n[l] = pn[e_li][e_lj];
        e[l] = po[e_li][e_lj] - pn[e_li][e_lj];
    }
}

This is how I invoke the kernel:

dim3 threadsPerBlock (4, 4);
dim3 numBlocks (4, 4);
red_psi<<<numBlocks, threadsPerBlock>>> (d_xn, d_xx, d_e, d_w, l1);

d_xx, d_xn, d_e and d_w are all float arrays of size 4096.

1

There are 1 best solutions below

0
On

I switched the blockDim.x and blockDim.y when I was copying the top / bottom and the left / right halo elements.