Which OpenACC directive will tell compiler to execute a statement on device only?

137 Views Asked by At

I am learning OpenACC with Fortran (with a suite of tools from Nvidia) and am doing it by porting my implementation of the Conjugate Gradient (CG) solver to GPUs.

Clearly, I am trying to keep as much data as possible on the device (GPU memory), with the following commands:

27   ! Copy matrix (a_sparse), vectors (ax - b) and scalars (alpha - pap) to GPU                                     
28   !$acc enter data copyin(a_sparse)                                             
29   !$acc enter data copyin(a_sparse % row(:))                                    
30   !$acc enter data copyin(a_sparse % col(:))                                    
31   !$acc enter data copyin(a_sparse % val(:))                                    
32   !$acc enter data copyin(ax(:), ap(:), x(:), p(:), r(:), b(:))                                   
33   !$acc enter data copyin(alpha, beta, rho, rho_old, pap)                       

From that point on, all operations constituting the solution algorithm of the CG solver, are done with the present clause. For a vector operation, an excerpt looks like:

49   !$acc  parallel loop      &                                                   
50   !$acc& present(r, b, ax)                                                      
51   do i = 1, n                                                                   
52     r(i) = b(i) - ax(i)                                                         
53   end do                                                                        

I do the same things with scalars, for example:

87     !$acc kernels present(alpha, rho, pap)                                      
88     alpha = rho / pap                                                           
89     !$acc end kernels                                                           

All scalar variables are on the device. With lines 87-89 I am trying to execute the command alpha = rho / pap on device only, avoiding any data transfer from or to host, but nsight-sys profiler shows me the following:

enter image description here

To my astonishment, there seems to be data transfer at line 87, both before (red "Enter Data" square) and after (red "Exit Data" square) the compute construct (blue "Cg.f90: 87" square).

Could anyone tell me what is going on? Are the lines 87-89 executed on device? Moreover, why are there no corresponding CUDA commands for these "Enter Data" and "Exit Data" fields? If so, why there seems to be data transfer between the host and the device? If not, is there an OpenACC command which would direct compiler to execute a programming line, which is not necessarily a loop, on the device only?

I noticed the same for the array operations, such as the ones I wrote above in lines 49-53, there is some data transfer there too, but I could attribute it to the variable n which should be passed to device.

1

There are 1 best solutions below

0
On

It could a few things. The Fortran specifies that the right hand side of an array syntax operation needs to be fully evaluated before assignment to the left hand side, so the compiler may be allocating a temp array to hold the result of the evaluation. Though often the compiler can optimize away the need for the temp, so it may or may not be the issue. Try making this an explicit loop, rather than use array syntax to see if it solves the issue.

A second possibility, is that the compiler is needing to copy the array descriptors since it can't tell if they've changed or not. Though, I'd expect to see some data movement rather than just the enter/exit regions.

The third possibility is that this is just the present check itself which does still call the enter/exit runtime calls. Instead of copying data, the call looks up the device pointer which is later passed to the kernel call and the reference counter is incremented/decremented.