NPP: Overlapping computation and data transfer

815 Views Asked by At

CUDA allows to overlap computation and data transfer using cuMemcpy async functions and streams. But is it possible with NPP(Performance Primitives)?

A little background. I am trying to utilize GPU using NPP image resize functions (in our case it is nppiResize_8u_C3R). I am using pinned memory and successfully transfer data to GPU using cuMemcpy2DAsync_v2 and per thread stream. The problem is that nppiResize_8u_C3R and all other computation functions do not accept streams.

When I run Nvidia Visual Profiler I see the next:

  1. Pinned memory allows me to transfer data faster - ~6.524 GB/s.
  2. The percentage of time when memcpy is being performed in parallel with compute is 0%.
1

There are 1 best solutions below

1
On

The problems [sic] is that nppiResize_8u_C3R and all other computation functions do not accept streams.

NPP is fundamentally a stateless API. However, to use streams with NPP, you use nppSetStream to set the default stream for subsequent operations. There are several caveats noted on page 2 of the documentation about using NPP with streams and recommended synchronization practices when switching streams.