Nvidia OpenCL hangs on blocking buffer access

419 Views Asked by At

I have an OpenCL program that copies a bunch of values to an input buffer, processes these values, and copies the results back.

// map input data buffer, has CL_MEM_ALLOC_HOST_PTR
cl_float* data = clEnqueueMapBuffer(queue, data_buffer, CL_TRUE, CL_MAP_WRITE, 0, data_size, 0, NULL, NULL, NULL);

// set input values
for(size_t i = 0; i < n; ++i)
    data[i] = values[i];

// unmap input buffer
clEnqueueUnmapMemObject(queue, data_buffer, data, 0, NULL, NULL);

// run kernels
...

// map results buffer, has CL_MEM_ALLOC_HOST_PTR
cl_float* results = clEnqueueMapBuffer(queue, results_buffer, CL_TRUE, CL_MAP_READ, 0, results_size, 0, NULL, NULL, NULL);

// processing
...

// unmap results buffer
clEnqueueUnmapMemObject(queue, results_buffer, results, 0, NULL, NULL);

(In the real code, I check for errors etc.)

This works great on AMD and Intel architectures (both CPU and GPU). On Nvidia GPUs, the code is incredibly slow. A program that takes normally takes 10 seconds to run (5 seconds host, 5 seconds device) will run for more than two and a half minutes on Nvidia cards.

However, I have found that this is not a straightforward optimisation problem or zero-copy speed difference. Using a profiler, I see that the host time of the program is 5 seconds, as in the normal case. And using OpenCL profiling events, I see that the device time is also 5 seconds, as in the normal case!

So I used the poor mans' profiler trick to figure out where the program spends its time on Nvidia GPUs. And it shows that the program just waits idly on both of the clEnqueueMapBuffer calls. I find this especially incomprehensible on the first instance, as the queue is empty at that point.

I repeat, I have profiled every map/unmap and kernel invocation, and the extra time does not show up there, so it's not spent on the device, and neither on the host. I can see from the stack profile that it is waiting on a semaphore instead. Anyone knows what's causing this hang?

0

There are 0 best solutions below