CUDA P2P memory access and __constant__ memory

282 Views Asked by At

I cannot find the answer anywhere and I may have overlooked it but it seems that one cannot use __constant__ memory (along with cudaMemcpyToSymbol) and peer-to-peer access with UVA.

I've tried the simpleP2P nvidia sample code which works fine on the 4 NV100 with nvlink I have, but as long as I declare the factor 2 in the kernel as:

__constant__ float M_; // in global space

float M = 2.0;
cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);

the results are basically zero. If I define it using C preprocessor (e.g. #define M_ 2.0), it works fine.

So I'm wondering, is that true or am I doing something wrong? and are there any other kind of memory that also cannot be accessed this way (texture memory for example)?

1

There are 1 best solutions below

1
On

The relation between your question of why "the results are basically zero" and P2P access with UVA is not immediately clear to me.

is that true or am I doing something wrong?

It's hard to say as your question is a bit vague and no complete example is shown.

__constant__ float M_ allocates a variable M_ on the constant memory of all CUDA visible devices. In order to set the value on multiple devices you should do something like:

__constant__ float M_; // <= This declares M_ on the constant memory of all CUDA visible devices

__global__ void showMKernel() {
    printf("****** M_ = %f\n", M_);
}

int main()
{

float M = 2.0;

 // Make sure that the return values are properly checked for cudaSuccess ...

int deviceCount = -1;
cudaGetDeviceCount(&deviceCount);

// Set M_ on the constant memory of each device:
for (int i = 0; i < deviceCount; i++) {
  cudaSetDevice(i);
  cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);
}

// Now, run a kernel to show M_:
for (int i = 0; i < deviceCount; i++) 
{
  cudaSetDevice(i);
  printf("Device %g :\n", i);
  showMKernel<<<1,1>>>();
  cudaDeviceSynchronize();
}

}

which returns:

Device 0 :
****** M = 2.000000
Device 1 :
****** M = 2.000000
// so on for other devices

Now, if I replace

// Set M_ on the constant memory of each device:
for (int i = 0; i < deviceCount; i++) {
  cudaSetDevice(i);
  cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);
}

with

cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);

this will only set the value of M_ on the active device and therefore returns

Device 0 :
****** M = 2.000000
Device 1 :
****** M = 0.000000 // <= I assume this is what you meant by 'the results are basically zero'
// M = 0 for other devices too

are there any other kind of memory that also cannot be accessed this way (texture memory for example)?

Again I'm not entirely sure what this way is. I think generally you cannot access the constant memory or the texture memory of one device from any other devices, though I am not 100% certain.

UVA assigns one address space for CPU and GPU memories such that memory copying between host and the global memory of multiple devices become easily accessible through the use of cudaMemcpy with kind cudaMemcpyDefault.

Also, P2P communication between devices allows for direct accesses and transfers of data between the global memory of multiple devices.

Similar to the __constant__ example above, when you declare a texture like texture <float> some_texture, some_texture will be defined for each visible device, however you would need to explicitly bind some_texture to your texture reference on each device when working with multiple devices.