Why data migrate from Host to Device when CPU try to read a managed memory initialized by GPU?

Question

Why data migrate from Host to Device when CPU try to read a managed memory initialized by GPU?

194 Views Asked by englefly At 18 August 2020 at 07:09

In the following test code, we init data by GPU, and then access data by CPU. I have 2 questions about the profiling result from nvprof.

Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
Why the H->D count is 2? I think it should be 1, because the data is in one page.

Thanks in advance!

my enviroment

Driver Version: 418.87.00
CUDA Version: 10.1
ubuntu 18.04

#include <cuda.h>
#include <iostream>

using namespace std;

__global__ void setVal(char* data, int idx)
{
    data[idx] = 'd';
}


int main()
{
    const int count =  8;

    char* data;
    cudaMallocManaged((void **)&data, count);

    setVal<<<1,1>>>(data, 0); //GPU page fault

    cout<<" cpu read " << data[0] <<endl; 
  
    cudaFree(data);

    return 0;
}

==28762== Unified Memory profiling result:
Device "GeForce GTX 1070 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       2  32.000KB  4.0000KB  60.000KB  64.00000KB  11.74400us  Host To Device
       1         -         -         -           -  362.9440us  Gpu page fault groups
Total CPU Page faults: 1

Original Q&A

There are 1 best solutions below

**Robert Crovella** · Answer 1 · 2020-08-18T13:47:54.253000

Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.

You are thrashing data between host and device. Because the GPU kernel launch is asynchronous, your host code, issued after the kernel launch is actually accessing the data before the GPU code.

Put a cudaDeviceSynchronize() after your kernel call, so that the CPU code does not attempt to read the data until after the kernel is complete.

I don't have an answer for your other question. The profiler is often not able to resolve perfectly very small amounts of activity. It does not necessarily instrument all SMs during a profiling run and some of its results may be scaled for the size of a GPC, a TPC and/or the entire GPU. That would be my guess, although it is just speculation. I generally don't expect perfectly accurate results from the profiler when profiling tiny bits of code doing almost nothing.

Why data migrate from Host to Device when CPU try to read a managed memory initialized by GPU?

There are 1 best solutions below

Related Questions in CUDA

Related Questions in UNIFIED-MEMORY

Trending Questions

Popular # Hahtags

Popular Questions