I have a device vector that needs to be transformed in multiple ways (e.g. creating 20 new arrays from it) and then reduce all (sum/accumulate), returning those sums in a host vector. The code is working with thrust::transform_reduce but looking at nvvp it makes a lot of cudaMalloc/cudaFree/cudaMemcpy that slow down the algorithm. This will run in a loop so my idea is to have some cache memory pre-allocated and cudaMemcpy all the results once in the end for each iteration. What I need to make it work is a reduce that works in-place, in that pre-allocated memory.
The cub::DeviceReduce::Sum almost does it, but it seems to have the input on the host memory, copy it to device temp storage and copy back the result to host in the end, I wanna avoid all those copies. My next choice is cub inclusive_scan but I don't need all those partial sums, only the final one, but even with that it may be faster since won't do any malloc/memcpy.
Is there any way to do this reduce in-place with those libraries (CUB/Thrust) to save malloc and memcpy times? Or the way is to code some custom kernel for it?