I want to achieve the effect of the below code, which means using flags to control kernel behavior from the host. So far the flags allocated by unified memory worked as I expected, but when I want to update data from the host and copy it to the device, it does not work.
So my question is, could CUDA achieve this effect, that is, update data from the host and copy it to an executing device side kernel function, and then informed the kernel to process the data by updating a data-ready flag?
More details
cudaMemcpy:
When I use
cudaMemcpy, thedata_readyflag could not be changed and kept printingx.cudaMemcpyAsync:
While using
cudaMemcpyAsyncto copy the updated data, the program can finish since thedata_readycould be changed, but the value ofdataremains the same.Unified memory for data:
I also think about using unified memory for my
data, but the size of the data could be really large (more than 1GB) in a more complex scenario, and I don't think my unified memory could take that.
#include <iostream>
#include <cstdio>
#include <cuda_runtime.h>
using namespace std;
__global__ void test (int *flag, int *data_ready, int *data) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (true) {
if (*flag == 0) {
// wait for data transfer
while (true) {
if (*data_ready == 0) {
printf("x");
}
else {
break;
}
}
printf("data %d\n", *data);
__syncthreads();
}
else {
break;
}
}
printf("gpu finish %d\n", tid);
}
int main() {
// flags
int *flag;
cudaMallocManaged(&flag, sizeof(int));
*flag = 0;
int *data_ready;
cudaMallocManaged(&data_ready, sizeof(int));
*data_ready = 0;
// data
int *data = (int *)malloc(sizeof(int));
int *data_device;
*data = 777;
cudaMalloc(&data_device, sizeof(int));
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
// launch kernel
int block = 8, grid = 1;
test<<<grid, block>>> (flag, data_ready, data_device);
// random host code
for (int i = 0; i < 1e5; i++);
printf("host do something\n");
// update data
*data = 987;
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
printf("host copied\n");
*data_ready = 1;
// update flag
*flag = 1;
cudaDeviceSynchronize();
// free memory
cudaFree(flag);
printf("host finish\n");
}
The general topic of "how to communicate data to a running kernel" is covered already in various posts such as here and here. There are many other examples, see the items linked to that first example for a list of relevant material.
Several concepts are needed to make it work.
Possibly the most important concept is understanding what CUDA streams are. Even if you don't explicitly use CUDA streams, you are launching work into a particular stream, the so-called "null" stream. Stream semantics dictate that work issued into the same stream will serialize. Item B, issued into stream s (s may be the null stream) will not begin working until item A, previously issued into stream s completes. So these two items you have issued will never run concurrently. The
cudaMemcpyoperation will wait (forever) for the kernel to complete:because both are issued into the same (null) stream.
Furthermore, in your case, you are using managed memory to facilitate some of the communication. In this case, you must be on a system for which the
concurrentManagedAccessattribute is true. Managed memory is not a suitable vehicle for this otherwise. I don't intend to give a tutorial on UM usage, but there are many resources online.Finally, in some cases it is necessary to mark items that will be used for global communication to a kernel with the
volatilequalifier, so as to prevent the compiler from doing any optimizations that would affect "visibility" of that item, since it is being communicated to by a separate entity (the host, in this case).The following code has some of these items addressed and seems to finish in a sensible way, for me: