I'm porting a big Fortran code to OpenACC and I'm struggling with some (rather simple) operations. I tried to summarize the problem in the test example that can be found below. I do know that the code is not optimal (it is splitting the array along the 2nd dimension, making the data non-contiguous), but this is not the point. I'd like to use this rather simple example to better understand OpenACC procedures. Here is the code.
First a module, where I define a structure with some allocatable data (in the form of pointers) and the dimensions, plus the subroutines I use to copy the data and to get the indices for the chunking of the data
module test
implicit none
!
type test_type
integer :: n1,n2,n3
integer, pointer, dimension(:,:,:) :: dat
end type test_type
!
contains
subroutine copydat(datain,dataout,indi,inde)
!$ACC ROUTINE VECTOR
implicit none
integer, intent(in) :: indi, inde
type(test_type), intent(in) :: datain
type(test_type), intent(out) :: dataout
!
dataout%dat(:,indi:inde,:) = datain%dat(:,indi:inde,:)
end subroutine copydat
!
subroutine get_indices(n2,i_chunk,n_chunks,indi,inde)
!$ACC ROUTINE SEQ
implicit none
integer, intent(in) :: n2, i_chunk, n_chunks
integer, intent(out) :: indi, inde
!
indi = 1 + (i_chunk - 1)*n2/n_chunks
inde = i_chunk*n2/n_chunks
end subroutine get_indices
end module test
Second, the main program, where we read the dimensions of data from a file ('num.txt'), we initialize data1 with some data, we copy the data from data1 to data2 on the device, we update the data in data2 on the host, and finally we check the results.
program main
use openacc
use test
implicit none
integer :: n_chunks, i_chunk, n1, n2, n3, i1, i2, i3, indi, inde
type(test_type) :: data1, data2
open(10,file='num.txt')
read(10,*) n1
read(10,*) n2
read(10,*) n3
read(10,*) n_chunks
close(10)
write(*,'(4(A,I3),A)') 'Data of size ', n1 , ', ', n2, ', ', n3, ', splitted in ', n_chunks, ' chunks along dim2'
data1%n1 = n1
data1%n2 = n2
data1%n3 = n3
data2%n1 = n1
data2%n2 = n2
data2%n3 = n3
allocate(data1%dat(n1,n2,n3),data2%dat(n1,n2,n3))
do i3 = 1,n3
do i2 = 1,n2
do i1 = 1,n1
data1%dat(i1,i2,i3) = i1 + (i2-1)*n1 + (i3-1)*n1*n2
end do
end do
end do
!$ACC ENTER DATA COPYIN(data1,data2)
!$ACC ENTER DATA CREATE(data1%dat,data2%dat)
!$ACC UPDATE DEVICE(data1%dat)
!$ACC PARALLEL DEFAULT(present) PRIVATE(indi,inde)
!$ACC LOOP GANG INDEPENDENT PRIVATE(indi,inde)
do i_chunk=1,n_chunks
! --- Current chunk indices ---
call get_indices(n2,i_chunk,n_chunks,indi,inde)
print *, "#", __pgi_gangidx(), ": ", indi, inde
data2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:)
!call copydat(data1,data2,indi,inde)
end do
!$ACC END LOOP
!$ACC END PARALLEL
!$ACC UPDATE SELF(data2%dat)
!$ACC EXIT DATA DELETE(data1%dat,data2%dat)
!$ACC EXIT DATA DELETE(data1,data2)
print *, maxval(abs(data1%dat-data2%dat))
deallocate(data1%dat,data2%dat)
end program main
The idea is that we want to split the data in the array along the 2nd dimension in n_chunks parts. Each gang should copy the data from data1 to data2. For testing I'm using in num.txt: 128 256 32 64
This means that each gang should copy a block of 128 x 4 x 32 from data1 to data2. The code is compiled on a Tesla P100 with nvfortran -acc -ta=tesla:cc60 -fast -Minfo=accel -o main.exe test.F90 Here are my questions:
- First, why if I remove the condition
PRIVATE(indi,inde)from the directive!$ACC PARALLEL DEFAULT(present)I get the following runtime error?
Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
I thought it was enough to declare the private variables on the loop directive, but in this example it seems like either I declare the two scalars as private on the parallel region, or I don't declare them as private at all (is the latter a problem, or since they are declared as intent(in) in get_indices, they are kept private?).
- If I use
data2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:)in the loop, I get the correct result and everything works fine. The loop is scheduled with 64 gangs and vectorized with a vector_length of 128. On the other hand, if I comment out the linedata2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:)and activate the commandcall copydat(data1,data2,indi,inde), the code tries to vectorize the operation with a vector_length of 32 and crashes with the error
Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
If I decrease n2 from 256 to 128, the code is still vectorized with a length of 32, but at least it doesn't crash. I don't understand this behaviour. Can someone explain?
EDIT:
Concerning question 1., I tried 4 different combinations of the private clause:
Try A:
!$ACC PARALLEL DEFAULT(present)
!$ACC LOOP GANG INDEPENDENT
Try B:
!$ACC PARALLEL DEFAULT(present) PRIVATE(indi,inde)
!$ACC LOOP GANG INDEPENDENT PRIVATE(indi,inde)
Try C:
!$ACC PARALLEL DEFAULT(present) PRIVATE(indi,inde)
!$ACC LOOP GANG INDEPENDENT
Try D:
!$ACC PARALLEL DEFAULT(present)
!$ACC LOOP GANG INDEPENDENT PRIVATE(indi,inde)
Try A, B, and C all work correctly, while Try D crashes with the error
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
I'm a bit confused, since I was expecting "Try D" to force indi and inde private among gangs, but shared among workers and threads, whereas I was expecting the other three cases to have indi and inde private for gans and also for threads. What am I missing here? What is the behaviour of indi and inde (private/shared) for the 4 different cases outlined above?
Concerning question 2., I'd like to report also an additional strange behaviour: if instead of having deallocate(data1%dat,data2%dat) in program, I introduce the subroutine
subroutine deldat(data0)
implicit none
type(test_type), intent(inout) :: data0
!
deallocate(data0%dat)
end subroutine deldat
in the module test, and I call it in program with
call deldat(data1)
call deldat(data2)
(i.e., I replace the explicit deallocate with the subroutine calls), the code also crashes with the error
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
One thing that I noticed is that when I compile the code with deallocate(data1%dat,data2%dat), the compiler gives
87, Loop is parallelizable
Loop carried reuse of data2%dat$p prevents parallelization
One the other hand, if I use the routine deldat to deallocate the arrays, the compiler writes
87, Loop is parallelizable
Loop carried reuse of dat$f prevents parallelization
Loop is parallelizable
Loop carried reuse of data2%dat$p prevents parallelization
that is, it adds the two lines
Loop is parallelizable
Loop carried reuse of dat$f prevents parallelization
to the message. I don't know if this is a bug of the compiler (the fact that the code crashes when I deallocate within the subroutine), or if it is something to be expected.
In the code I'm porting to GPU, the derived datatypes have a lot of different fields (not just one as in the example above). Therefore, we created three subroutines to allocate, copy, and deallocate the members of the derived datatypes. Unfortunately, at present I am unable to use these routines with OpenACC directives because of these problems.
By default, scalars are thread (vector) private, but when calling a seq routine from a gang region, only one thread executes the routine. Hence the other thread's "indi" and "inde" variables are uninitialized.
By putting the "private" on a gang loop, the scalars are gang private but shared by the vectors, hence all threads have initialized values.
This looks like a compiler issue to me so I went ahead and filed a problem report, TPR #35448, and sent it to engineering for investigation.
I know we've had other issues with derived types in the past (they're tricky to get correct).
The work arounds would be to inline the routine via the "-Minline" flag, or pass in the "dat" arrays directly rather than through the derived type.
I'll need to wait for engineering to understand the root cause, though for the vector length 32, that's expected. We lower the vector length to 32 when there's vector routines in order to support reductions in vector routines as well as to remove the need to synchronize threads. Given the kernel will use 150 registers for calls, the occupancy already gets lowered, so the additional threads don't really help or at least offset by the additional thread syncs.
If possible, you'll want to try inlining routines to avoid this overhead.
EDIT:
Adding more info since Fabio is still having issues while the code works for me. One possible difference is that I put the test module in the same source as the main program. In this case, the compiler is able to inline "copydat". If split into separate sources, it doesn't appear to get inlined.
To fix this, I changed the array syntax to use explicit loops. We may want to do this anyway so we can apply the "loop" directive to all dimensions. As seen in the compiler feedback, only the middle dimension is getting implicitly parallelized with the array syntax.
Also with array syntax, per the Fortran standard, the right-hand side must be fully evaluated before assignment, which means the compiler must create temp arrays to hold the intermediary results. While the compiler can often optimize away the temp arrays, it may not be in this case. Hence it's possible the issues he's seeing are with a heap overflow due to the allocation of the temp array. The device heap is quite small by default plus allocation can hurt performance.
Here's my updated version.
test_mod.F90:
test.F90:
Compilation:
Note that I'm using NVHPC 24.3, but tested with earlier versions of the compiler as well. Also, this is on a Pascal device to match what Fabio's using, but I tested on Ampere and Volta as well.
If using separate compilation, then we need to perform a two pass compilation. The first pass to collect the inline information into an inline library, then compile using this info in the second pass.