In the following test code, we init data by GPU, and then access data by CPU. I have 2 questions about the profiling result from nvprof.
Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
Why the H->D count is 2? I think it should be 1, because the data is in one page.
Thanks in advance!
my enviroment
- Driver Version: 418.87.00
- CUDA Version: 10.1
- ubuntu 18.04
#include <cuda.h>
#include <iostream>
using namespace std;
__global__ void setVal(char* data, int idx)
{
data[idx] = 'd';
}
int main()
{
const int count = 8;
char* data;
cudaMallocManaged((void **)&data, count);
setVal<<<1,1>>>(data, 0); //GPU page fault
cout<<" cpu read " << data[0] <<endl;
cudaFree(data);
return 0;
}
==28762== Unified Memory profiling result:
Device "GeForce GTX 1070 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 32.000KB 4.0000KB 60.000KB 64.00000KB 11.74400us Host To Device
1 - - - - 362.9440us Gpu page fault groups
Total CPU Page faults: 1
You are thrashing data between host and device. Because the GPU kernel launch is asynchronous, your host code, issued after the kernel launch is actually accessing the data before the GPU code.
Put a
cudaDeviceSynchronize()after your kernel call, so that the CPU code does not attempt to read the data until after the kernel is complete.I don't have an answer for your other question. The profiler is often not able to resolve perfectly very small amounts of activity. It does not necessarily instrument all SMs during a profiling run and some of its results may be scaled for the size of a GPC, a TPC and/or the entire GPU. That would be my guess, although it is just speculation. I generally don't expect perfectly accurate results from the profiler when profiling tiny bits of code doing almost nothing.