OpenCL multiple GPU buffer read fails

280 Views Asked by At

I am trying to make 2 Nvidia GPUs work side by side to do n-body simulation (source). I do proper detection, and store all of the OpenCL stuff in one struct:

struct ocl_wrap {
  cl_event event;
  cl_program program;
  cl_kernel kernel;
  cl_command_queue command_queue;
  cl_device_id device_id;
  cl_context context;

  cl_mem masses;
  cl_mem bodies;
  cl_mem speeds;
  cl_mem newBodies;

  cl_int ret;
};

So now, every device has assigned its own struct (its own context, queue etc.) and per each step I run this 2 functions:

void writeGPU() {
      clCreateBuffer() //4* 
      clSetKernelArg() //5*
      clEnqueueNDRangeKernel()
}
void readGPU() {
      clEnqueueReadBuffer() //2*
      clFlush()
      clReleaseMemObject() //4*
}

And one step looks like this:

void step() {
  for each gpu
    writeGPU();
  runCPU();
  for each gpu
    readGPU();
}

Where every device is given a subset of problems to solve.

I have a problem, that the first 64 (sometimes 128) floats from one or the other GPU, that I try to copy back to CPU will not actually copy. Otherwise, everything is working correctly, the first GPU works flawlessly. Sometimes it just works, but just at random the bug appears and it doesn't go away. Any suggestions?

2

There are 2 best solutions below

0
On

My guess at this point is that you are probably not utilizing the OpenCL event system and perhaps even OpenCL Memory Barriers/Fences to get notified on whether the I/O reads-writes have reached their destination and to coordinate your program by setting up breakpoints and wait-lists. If the OpenCL distribution kit you have on your system works as it should and you're utilizing the event system, then the program sequence presented above should resemble

// setup global event objects
// setup global markers/barriers
void writeGPU() {
      // hook event listeners to APIs
      clCreateBuffer() //4* 
      clSetKernelArg() //5*
      clEnqueueNDRangeKernel()
      // place appropriate markers/barriers
}
void readGPU() {
      // Many OpenCL APIs listen to events and proceed only 
      // when the `wait` condition is satisfied or 
      // barrier conditions are met.
      clEnqueueReadBuffer() //2*
      clFlush()
      clReleaseMemObject() //4*
}
0
On

You are probably looking at the data before the read finishes. clFlush only ensures the command left the host, not that the command finished on the device. Solution: Use a blocking read, or use clFinish instead of clFlush, or use OpenCL events.