Do I need to externally call flush if using cuda api to copy from GPU Memory to Persistent Memory?

315 Views Asked by At

I am using Cuda API: cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )

to copy data from GPU memory from CPU memory. In case copying the data from CPU memory to Persistent Memory using memcpy(), we need to explicitly call the flush operation(eg. clflush()) to make sure data is flushed from CPU caches. Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();

2

There are 2 best solutions below

2
On BEST ANSWER

Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();

No.

However, you are calling a potentially asynchronous API, so you may need to use one of the synchronization APIs (stream or device scope) in order to ensure data consistency between operations that can potentially overlap and need to access the same memory area.

0
On

Intel processors with the server uncore design starting with Sandy Bridge support Data Direct I/O (DDIO), which is enabled by default. With DDIO, an inbound PCIe write targeting system memory location of type WB is an allocating write transaction.

For a full write (that writes to an entire cache line), the IIO first obtains ownership of the target cache line by invalidating all copies in the coherence domain except in the L3 that exists in the same NUMA node to which the originating device is attached. If the line doesn't already exist in the target L3, an L3 entry is allocated, which may require evicting another line to make space. The write is performed in the L3 and the coherence state of the line becomes M. This means that the data is not sent to the memory controller to which its address is mapped. Partial writes are buffered in the IIO (which is in the coherence domain) until they are eventually evicted to be written into the LLC (allocate or update). In DDIO, reads are never allocating.

Even if DDIO is disabled, PCIe writes can be buffered in the DDIO. When cudaMemcpyAsync or even cudaMemcpy returns, there is no guarantee that all writes have reached the persistence domain on Intel processors (unless you have Whole System Persistence). In addition, the memory copy is not guaranteed to be persistently atomic and there is no guarantee in what order the bytes will move from the IIO to the target memory controllers. You need a flag to tell you whether the entire data was persisted or not.

You can use a barrier (cudaStreamSynchronize() or cudaDeviceSynchronize()) to wait on the host until the data copy operation is complete, and then flush each cache line, followed by writing a flag, in that order.