Following host code test.c and device code test0.cu are intended to give the same result.
test.c
$ cat test.c
#include <stdio.h>
#include <string.h>
int main()
{
int data[32];
int dummy[32];
for (int i = 0; i < 32; i++)
data[i] = i;
memcpy(dummy, data, sizeof(data));
for (int i = 1; i < 32; i++)
data[i] += dummy[i - 1];
memcpy(dummy, data, sizeof(data));
for (int i = 2; i < 32; i++)
data[i] += dummy[i - 2];
memcpy(dummy, data, sizeof(data));
for (int i = 4; i < 32; i++)
data[i] += dummy[i - 4];
memcpy(dummy, data, sizeof(data));
for (int i = 8; i < 32; i++)
data[i] += dummy[i - 8];
memcpy(dummy, data, sizeof(data));
for (int i = 16; i < 32; i++)
data[i] += dummy[i - 16];
printf("kernel : ");
for (int i = 0; i < 32; i++)
printf("%4i ", data[i]);
printf("\n");
}
$
test0.cu
$ cat test0.cu
#include <stdio.h>
__global__ void kernel0(int *data)
{
size_t t_id = threadIdx.x;
if (1 <= t_id)
data[t_id] += data[t_id - 1];
if (2 <= t_id)
data[t_id] += data[t_id - 2];
if (4 <= t_id)
data[t_id] += data[t_id - 4];
if (8 <= t_id)
data[t_id] += data[t_id - 8];
if (16 <= t_id)
data[t_id] += data[t_id - 16];
}
int main()
{
int data[32];
int result[32];
int *data_d;
cudaMalloc(&data_d, sizeof(data));
for (int i = 0; i < 32; i++)
data[i] = i;
dim3 gridDim(1);
dim3 blockDim(32);
cudaMemcpy(data_d, data, sizeof(data), cudaMemcpyHostToDevice);
kernel0<<<gridDim, blockDim>>>(data_d);
cudaMemcpy(result, data_d, sizeof(data), cudaMemcpyDeviceToHost);
printf("kernel0 : ");
for (int i = 0; i < 32; i++)
printf("%4i ", result[i]);
printf("\n");
}
$
If I compile and run them, they do give the same result as I expected.
$ gcc -o test test.c
$ ./test
kernel : 0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190 210 231 253 276 300 325 351 378 406 435 465 496
$ nvcc -o test_dev0 test0.cu
$ ./test_dev0
kernel0 : 0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190 210 231 253 276 300 325 351 378 406 435 465 496
$
However, if I use shared memory instead of global memory in the device code, as in test1.cu, it gives different result.
test1.cu
$ cat test1.cu
#include <stdio.h>
__global__ void kernel1(int *data)
{
__shared__ int data_s[32];
size_t t_id = threadIdx.x;
data_s[t_id] = data[t_id];
if (1 <= t_id)
data_s[t_id] += data_s[t_id - 1];
if (2 <= t_id)
data_s[t_id] += data_s[t_id - 2];
if (4 <= t_id)
data_s[t_id] += data_s[t_id - 4];
if (8 <= t_id)
data_s[t_id] += data_s[t_id - 8];
if (16 <= t_id)
data_s[t_id] += data_s[t_id - 16];
data[t_id] = data_s[t_id];
}
int main()
{
int data[32];
int result[32];
int *data_d;
cudaMalloc(&data_d, sizeof(data));
for (int i = 0; i < 32; i++)
data[i] = i;
dim3 gridDim(1);
dim3 blockDim(32);
cudaMemcpy(data_d, data, sizeof(data), cudaMemcpyHostToDevice);
kernel1<<<gridDim, blockDim>>>(data_d);
cudaMemcpy(result, data_d, sizeof(data), cudaMemcpyDeviceToHost);
printf("kernel1 : ");
for (int i = 0; i < 32; i++)
printf("%4i ", result[i]);
printf("\n");
}
$
If I compile test1.cu and run it, it gives different result from test0.cu or test.c.
$ nvcc -o test_dev1 test1.cu
$ ./test_dev1
kernel1 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
$
Is warp synchronization not supposed to work with shared memory?
Some investigation into this issue:
When using CUDA8.0, if I compile test1.cu with -arch=sm_61 option(I'm testing with GTX 1080), it gives same result as test0.cu and test.c.
$ nvcc -o test_dev1_arch -arch=sm_61 test1.cu
$ ./test_dev1_arch
kernel1 : 0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190 210 231 253 276 300 325 351 378 406 435 465 496
$
But this does not apply to newer versions of CUDA. If I use any newer version than 8.0, the test result is different even if I give the -arch=sm_61 option.
Your device code has undefined behavior due to race conditions in both cases, using shared memory or using global memory. You have multiple threads that concurrently read and modify the same
intobject.I don't see any warp synchronization in your code.
The fact that the hardware executes warps in lock step (which is not necessarily true to begin with) is completely irrelevant because it is not the hardware who reads your C++ code. It is whatever toolchain you use to translate your C++ code into the machine code that will actually run on your hardware. And C++ compilers are allowed to optimize based on the abstract rules of the C++ language.
Let's look at the machine code that's actually generated for your example (using CUDA 10 here on my machine):
As you can see, the compiler (in this particular case, the "culprit" was actually the PTX assembler) has translated your sequence of ifs into a bunch of instructions that set up predicates based on the if conditions. It first fetches all the values it's ever going to need from shared memory into registers using conditional loads. Only after that, it performs all the additions and conditional stores using the already loaded values. This is a perfectly legal interpretation of your C++ code. Since you did not specify any synchronization or memory ordering constraints, the compiler can operate under the assumption that there are no potentially concurrent conflicts, and all these loads and stores can be reordered in whatever way it sees fit.
To fix your code, use explicit warp synchronization:
The reason why this problem only manifests starting with CUDA 9.0 is that warp-level synchronization was only really introduced in CUDA 9.0 when Volta and "independent thread scheduling" made it a necessity. Before CUDA 9.0, warp-synchronous programming was not officially supported. But compilers used to be rather conservative when it came to actually breaking code like in your example above. The reason probably being that such "warp-synchronous" programming (note the quotes) was often the only way to even get close to peak performance, there was no real alternative and, thus, people were doing it all the time. It still was undefined behavior, though, and NVIDIA kept warning us. It just happened to just work in many cases…