I am new to CUDA/C++ and I am studying about Unified Memory. I have found this introduction to this topic. However, I have a question regarding one of the examples.
To mitigate migration overhead there is one example in which the data is initialized in a kernel:
#include <iostream>
#include <math.h>
// initialize arrays on device
__global__ void init(int n, float *x, float *y) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride) {
x[i] = 1.0f;
y[i] = 2.0f;
}
}
// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y){
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride){
y[i] = x[i] + y[i];
}
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
init<<<numBlocks, blockSize>>>(N, x, y);
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
In the link I have put previously it is said that for this case "There are still device-to-host page faults, but this is due to the loop at the end of the program that checks the results on the CPU.". However, I have deleted the loop at the end and the profiling for this is
==4242== NVPROF is profiling process 4242, command: /content/src/add_unifmem_initonkernel
==4242== Profiling application: /content/src/add_unifmem_initonkernel
==4242== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 96.00% 1.4178ms 1 1.4178ms 1.4178ms 1.4178ms init(int, float*, float*)
4.00% 59.070us 1 59.070us 59.070us 59.070us add(int, float*, float*)
API calls: 99.21% 263.47ms 2 131.74ms 54.879us 263.42ms cudaMallocManaged
0.54% 1.4273ms 1 1.4273ms 1.4273ms 1.4273ms cudaDeviceSynchronize
0.15% 401.83us 2 200.91us 197.33us 204.49us cudaFree
0.05% 120.55us 101 1.1930us 139ns 50.860us cuDeviceGetAttribute
0.04% 96.692us 2 48.346us 40.043us 56.649us cudaLaunchKernel
0.01% 28.565us 1 28.565us 28.565us 28.565us cuDeviceGetName
0.00% 6.9460us 1 6.9460us 6.9460us 6.9460us cuDeviceGetPCIBusId
0.00% 2.0890us 3 696ns 225ns 1.5490us cuDeviceGetCount
0.00% 1.0370us 2 518ns 314ns 723ns cuDeviceGet
0.00% 502ns 1 502ns 502ns 502ns cuDeviceTotalMem
0.00% 500ns 1 500ns 500ns 500ns cuModuleGetLoadingMode
0.00% 230ns 1 230ns 230ns 230ns cuDeviceGetUuid
==4242== Unified Memory profiling result:
Device "Tesla T4 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
13 - - - - 1.695805ms Gpu page fault groups
There is still some GPU page faults happening, but if I have got it correctly it should not happen for this case.
What am I missing here?
Your
init
kernel is still experiencing page faults. You can get an additional clue of this by noting the huge time disparity between the duration of theinit
kernel (~1400 microseconds) and theadd
kernel (~60 microseconds).The reason for this is that page faults may occur in at least two cases, related to the same core issue: the page touched by the code is not present in device memory. Perhaps the typical case for this is when the data is physically present on some other processor, and needs to be migrated. In this case the page fault serves the purpose to trigger the migration, and when that happens
nvprof
will usually report additional data associated with the faults, such as the amount of data migrated, size of the blocks, number of migrations, etc.But all that is missing in your report. This is a second kind of clue that these page faults have a slightly different origin and purpose. The basic idea is that some allocators are so-called "lazy allocators". The allocator creates the possibility for the data to exist, including an address range, but does not actually assign or "map" physical memory to store it.
cudaMallocManaged
is a lazy allocator in this respect. The assignment of memory will happen on "first touch". And in your case, first touch takes place in theinit
kernel. Since the data is not actually present or fully allocated in device memory at that point, page faults occur, and these page faults have the purpose of "bringing pages into existence" as opposed to migration of data.If you want to make this effect disappear altogether, you will need to actually instantiate the data somewhere. In typical programmatic usage, you would normally do this by initializing the data "somewhere", so if you do it in the
init
kernel you will get these kind of GPU page faults and if you do it in host code and then allow the data to be migrated to the GPU, you will get these kind of page faults in CPU code.For your particular program here, one approach you could take to remove this effect is to insert the following before your first (
init
) kernel call:Here is a full example:
We see that the duration of the
init
kernel drops to something that is roughly comparable to the duration of theadd
kernel, and it removes all page fault reporting.You can get additional discussion of managed memory behavior in unit 6 of this online training series.