Do I need to externally call flush if using cuda api to copy from GPU Memory to Persistent Memory?

304 Views Asked by Ashikee AbHi At 27 July 2025 at 21:15

I am using Cuda API: cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )

to copy data from GPU memory from CPU memory. In case copying the data from CPU memory to Persistent Memory using memcpy(), we need to explicitly call the flush operation(eg. clflush()) to make sure data is flushed from CPU caches. Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();

Original Q&A

There are 2 best solutions below

talonmies On 08 January 2022 at 09:05 BEST ANSWER

Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();

No.

However, you are calling a potentially asynchronous API, so you may need to use one of the synchronization APIs (stream or device scope) in order to ensure data consistency between operations that can potentially overlap and need to access the same memory area.

Hadi Brais On 11 January 2022 at 22:21

Intel processors with the server uncore design starting with Sandy Bridge support Data Direct I/O (DDIO), which is enabled by default. With DDIO, an inbound PCIe write targeting system memory location of type WB is an allocating write transaction.

For a full write (that writes to an entire cache line), the IIO first obtains ownership of the target cache line by invalidating all copies in the coherence domain except in the L3 that exists in the same NUMA node to which the originating device is attached. If the line doesn't already exist in the target L3, an L3 entry is allocated, which may require evicting another line to make space. The write is performed in the L3 and the coherence state of the line becomes M. This means that the data is not sent to the memory controller to which its address is mapped. Partial writes are buffered in the IIO (which is in the coherence domain) until they are eventually evicted to be written into the LLC (allocate or update). In DDIO, reads are never allocating.

Even if DDIO is disabled, PCIe writes can be buffered in the DDIO. When cudaMemcpyAsync or even cudaMemcpy returns, there is no guarantee that all writes have reached the persistence domain on Intel processors (unless you have Whole System Persistence). In addition, the memory copy is not guaranteed to be persistently atomic and there is no guarantee in what order the bytes will move from the IIO to the target memory controllers. You need a flag to tell you whether the entire data was persisted or not.

You can use a barrier (cudaStreamSynchronize() or cudaDeviceSynchronize()) to wait on the host until the data copy operation is complete, and then flush each cache line, followed by writing a flag, in that order.

Do I need to externally call flush if using cuda api to copy from GPU Memory to Persistent Memory?

There are 2 best solutions below

Related Questions in CUDA

Related Questions in CPU-ARCHITECTURE

Related Questions in FLUSH

Related Questions in PERSISTENT-MEMORY

Related Questions in CLFLUSH

Trending Questions

Popular # Hahtags

Popular Questions