CUDA (JCUDA) shared memory (?) problems / undefined behaviour

819 Views Asked by At

I'm working on my game project (tower defense) and I'm trying to compute the distance between all critters and a tower with JCuda using shared memory. For each tower I run 1 block with N threads, where N equals the number of critters on the map. I'm computing the distance between all critters and that tower for a given block, and I store the smallest found distance so far in the block's shared memory. My current code looks like that:

extern "C"

__global__ void calcDistance(int** globalInputData, int size, int
critters, int** globalQueryData, int* globalOutputData) {

  //shared memory
  __shared__ float minimum[2];

  int x = threadIdx.x  + blockIdx.x * blockDim.x;
  int y = blockIdx.y;

  if (x < critters) {

    int distance = 0;
    //Calculate the distance between tower and criter
    for (int i = 0; i < size; i++) {
      int d = globalInputData[x][i] - globalQueryData[y][i];
      distance += d * d;

    if (x == 0) {        
      minimum[0] = distance;
      minimum[1] = x;


    if (distance < minimum[0]) {
      minimum[0] = distance;
      minimum[1] = x;
    globalOutputData[y * 2]     = minimum[0];
    globalOutputData[y] = minimum[1];


The problem is if I rerun the code using the same input multiple times (I free all the memory on both host and device after each run) I get different output each time I the code gets executed for blocks (tower) number > 27... I'm fairly sure it has something to do with the shared memory and the way I'm dealing with it, as rewriting the code to use global memory gives the same result whenever the code gets executed. Any ideas?


There are 1 best solutions below


There is a memory race problem (so read-after-write correctness) in that kernel here:

   if (distance < minimum[0]) {
     minimum[0] = distance;
     minimum[1] = x;

When executed, every thread in the block is going to try and simultaneously read and write the value of minimum. There are no guarantees what will happen when multiple threads in a warp try writing to the same shared memory location, and there are no guarantees what values that other warps in the same block will read when loading from a memory location to which is being written. Memory access is not atomic, and there is no locking or serialization which would ensure that code performed the type of reduction operation you seem to be trying to do.

A milder version of the same problem applies to the write back to global memory at the end of the kernel:

   globalOutputData[y * 2]     = minimum[0];
   globalOutputData[y] = minimum[1];

The barrier before the writes ensures that the writes to minimum will be completed prior that a "final" (although inconsistent) value will be stored in minimum, but then every thread in the block will execute the write.

If your intention is to have each thread compute a distance, and then for the minimum of the distance values over the block to get written out to global memory, you will have to either use atomic memory operations (for shared memory this is supported on compute 1.2/1.3 and 2.x devices only), or write an explicit shared memory reduction. After that, only one thread should execute the write back to global memory.

Finally, you also have a potential synchronization correctness problem that could cause the kernel to hang. __syncthreads() (which maps to the PTX bar instruction) demands that every thread in the block arrive and execute the instruction prior to the kernel continuing. Having this sort of control flow:

 if (x < critters) {

will cause the kernel to hang if some threads in the block can branch around the barrier and exit while others wait at the barrier. There should never be any branch divergence around a __syncthreads() call to ensure execution correctness of a kernel in CUDA.

So, in summary, back to the drawing board on at least three issues in the current code.