CUDALink: Launching multiple blocks in z-dimension

177 Views Asked by At

Im currently running a simple 3D stencil transformation on my GPU (GTX560Ti) using the CUDALink wrapper provided by Mathematica. Block dimensions don't really matter to me right now, because I'm not using any shared memory or looking for optimization (right now).

Thus I can set any reasonable number for blockDim.x and blockDim.y. The wrapper will launch the appropriate amount of blocks no matter what dimension I set, no problem. However in the z-dimension only a single block is launched. Thus blockDim.z limits the total amount of points I can calculate in that direction.

Why is there only one block in z-direction? How can I work around that?

For reference, here is the kernel I'm using:

__global__ void conv(Real_t in[48][48][48], Real_t out[48][48][48], mint stencil[13][13][13], mint length, mint rad) {
    int x = threadIdx.x + blockIdx.x*blockDim.x;
    int y = threadIdx.y + blockIdx.y*blockDim.y;
    int z = threadIdx.z + blockIdx.z*blockDim.z;
    while (x<length||y<length||z<length) {
        out[x][y][z] = 0;
        for (int ix = -rad; ix <= rad; ix++) {
        for (int iy = -rad; iy <= rad; iy++) {
        for (int iz = -rad; iz <= rad; iz++) {
            if ( (fminf(x,fminf(y,z))-rad >= 0)
                && (fmaxf(x,fmaxf(y,z))+rad < length) )
                {out[x][y][z] += stencil[ix+rad][iy+rad][iz+rad]*in[ix+x][iy+y][iz+z];}
        }   }   }
        if (x<length) {
            x+= blockDim.x * gridDim.x;
        } else if (y<length) {
            y+= blockDim.y * gridDim.y;
        } else if (z<length) {
            z+= blockDim.z * gridDim.z;
        }
    }
}

Please note: The variable length corresponds to the dimensions of the arrays (eg 48). rad is related to the stencil and smaller than length. stencil is just an array of 0s and 1s to select the stuff from in I want to sum up into out.

I'm running the kernel in Mathematica using the following code:

Needs["CUDALink`"];
conv = CUDAFunctionLoad[code (*the kernel above, stored as a string*), "conv", {{_Real, _, "Input"}, {_Real, _, "Output"}, {_Integer , _, "Input"}, _Integer, _Integer}, {4, 4, 10}, "TargetPrecision" -> "Single", "XCompilerInstallation" -> "/usr/local/gcc44/bin/", "CleanIntermediate" -> False];
output = ConstantArray[1, {length, length, length}];
result =  conv[input, output, stencil, length, rad];

To illustrate my problem, here is a slice of my output (apparently I cant post images):

0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.  0.000578704 0.00173611  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.000289352 0.000868056 0.00173611  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.000578704 0.00144676  0.00260417  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.00115741  0.00202546  0.00347222  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.00115741  0.00202546  0.00347222  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.000578704 0.00144676  0.00289352  1.  1.  1.
0.  0.  0.  0.  0.  0.  0.  0.000578704 0.00144676  0.00289352  1.  1.  1.

This was produced with blockDim.z = 10. The zeroes and fractions are useful values, but the ones are just the values I initialized the out array with. Only the first 10 columns are computed, corresponding to a single block in z-direction. ( This behaviour is reproducible for any value of blockDim.z between 1 and 64 (the upper limit for Fermi GPUs).

1

There are 1 best solutions below

0
On

OK, I guess this behaviour was just a bug in CUDAResources and not an actual programming issue. (Still, there is only one block though. What I have now is a workaround.)

I removed CUDAResources with CUDAResourcesUninstall[] , restarted Mathematica, reinstalled using CUDAResourcesInstall["/path/to/paclet/file",Update->True] and restarted Mathematica again.

Then I changed my kernel to the following code:

__global__ void conv(Real_t in[48][48][48], Real_t out[48][48][48], \
mint stencil[13][13][13], mint length, mint rad) {
    int x = threadIdx.x + blockIdx.x*blockDim.x;
    int y = threadIdx.y + blockIdx.y*blockDim.y;
    int z = threadIdx.z + blockIdx.z*blockDim.z;
    while (z<length) {
        out[x][y][z] = 0;
        for (int ix = -rad; ix <= rad; ix++) {
        for (int iy = -rad; iy <= rad; iy++) {
        for (int iz = -rad; iz <= rad; iz++) {
            if ( (fminf(x,fminf(y,z))-rad >= 0)
                && (fmaxf(x,fmaxf(y,z))+rad < length) )
                {out[x][y][z] += stencil[ix+rad][iy+rad][iz+rad]*in[ix+x][iy+y][iz+z];}
        }   }   }
        if (z<length) {
            z+= blockDim.z * gridDim.z;
        }
    }
}

And now it works. Hopefully it stays that way. This of course means that there is less parallelisation going on in z-direction, because there is basically one block of threads going sequentially over the grid instead of multiple blocks doing work in parallel. But that's fine, the code is fast enough for my purposes.

Big thanks to everyone who helped.