OpenACC duplicate array on device

293 Views Asked by At

On a Fortran program accelerated with OpenACC, I need to duplicate an array on GPU. The duplicated array will only be used on GPU and will never be copied on host. The only way I know to create it would be to declare and allocate it on host, then acc data create it:

program test
    implicit none
    integer, parameter :: n = 1000
    real :: total
    real, allocatable :: array(:)
    real, allocatable :: array_d(:)

    allocate(array(n))
    allocate(array_d(n))

    array(:) = 1e0

    !$acc data copy(array) create(array_d) copyout(total)

    !$acc kernels
    array_d(:) = array(:)
    !$acc end kernels

    !$acc kernels
    total = sum(array_d)
    !$acc end kernels

    !$acc end data

    print *, sum(array)
    print *, total

    deallocate(array)
    deallocate(array_d)
end program

This is an illustration code, as the program in question is much more complex.

The problem with this solution is that I have to allocate the duplicated array on host, even if I do not use it here. Some host memory would be wasted, especially for large arrays (even if I know I would run out of device memory before running out of host memory). On CUDA Fortran, I know I can declare a device only array, but I do not know if this is possible with OpenACC.

Is there a better way to perform this?

1

There are 1 best solutions below

4
On

The OpenACC spec has the "acc declare device_resident" which allocates a device only array which you'd use instead of a "data create". Something like:

    implicit none
    integer, parameter :: n = 1000
    real :: total
    real, allocatable :: array(:)
    real, allocatable :: array_d(:)
    !$acc declare device_resident(array_d)
    allocate(array(n))
    allocate(array_d(n))

    array(:) = 1e0

    !$acc data copy(array) copyout(total)

    !$acc kernels
    array_d(:) = array(:)
    !$acc end kernels

    !$acc kernels
    total = sum(array_d)
    !$acc end kernels

    !$acc end data

    print *, sum(array)
    print *, total

    deallocate(array)
    deallocate(array_d)
end program

Though due to complexity in implementation and lack of compelling use case, our compiler (NVHPC aka PGI) treats device_resident as a create, i.e the host array is still allocated. So if you're using NVHPC and truly need a device only array, then you'll want to use a CUDA Fortran "device" attribute on the array. CUDA Fortran and OpenACC are interoperable, so it's fine to mix them.

However, wasting a bit of host memory isn't an issue for the vast majority of codes, and since no data is copied, there's no performance impact. Hence if you kept the code as is, it shouldn't be a problem.