On a Fortran program accelerated with OpenACC, I need to duplicate an array on GPU. The duplicated array will only be used on GPU and will never be copied on host. The only way I know to create it would be to declare and allocate it on host, then acc data create
it:
program test
implicit none
integer, parameter :: n = 1000
real :: total
real, allocatable :: array(:)
real, allocatable :: array_d(:)
allocate(array(n))
allocate(array_d(n))
array(:) = 1e0
!$acc data copy(array) create(array_d) copyout(total)
!$acc kernels
array_d(:) = array(:)
!$acc end kernels
!$acc kernels
total = sum(array_d)
!$acc end kernels
!$acc end data
print *, sum(array)
print *, total
deallocate(array)
deallocate(array_d)
end program
This is an illustration code, as the program in question is much more complex.
The problem with this solution is that I have to allocate
the duplicated array on host, even if I do not use it here. Some host memory would be wasted, especially for large arrays (even if I know I would run out of device memory before running out of host memory). On CUDA Fortran, I know I can declare a device only array, but I do not know if this is possible with OpenACC.
Is there a better way to perform this?
The OpenACC spec has the "acc declare device_resident" which allocates a device only array which you'd use instead of a "data create". Something like:
Though due to complexity in implementation and lack of compelling use case, our compiler (NVHPC aka PGI) treats device_resident as a create, i.e the host array is still allocated. So if you're using NVHPC and truly need a device only array, then you'll want to use a CUDA Fortran "device" attribute on the array. CUDA Fortran and OpenACC are interoperable, so it's fine to mix them.
However, wasting a bit of host memory isn't an issue for the vast majority of codes, and since no data is copied, there's no performance impact. Hence if you kept the code as is, it shouldn't be a problem.