Accelerating a Fortran code with OpenACC using the PGI compiler, I got problems with a matmul
call in an accelerated loop.
In the simplified example, I apply the identity matrix on two vectors, so the input and the output values should be the same:
program test
implicit none
integer :: a(3, 3)
integer :: v1(3, 2), v2(3, 2)
integer :: i
a = reshape([1, 0, 0, 0, 1, 0, 0, 0, 1], [3, 3])
v1 = reshape([1, 2, 3, 4, 5, 6], [3, 2])
print *, v1
!$acc kernels copyin(a, v1) copyout(v2)
!$acc loop independent
do i = 1, 2
v2(:, i) = matmul(a, v1(:, i))
enddo
!$acc end kernels
print *, v2
endprogram
When compiling with the PGI compiler version 20.9, I got these information:
test:
12, Generating copyin(a(:,:),v1(:,:)) [if not already present]
Generating implicit copyout(z_a_0(:)) [if not already present]
Generating copyout(v2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
15, Loop is parallelizable
Running the code gives the following values:
1 2 3 4 5 6
4 5 6 4 5 6
the second line should be like the first one, which is the case on sequential execution. What is wrong in the code?
Looks to be a compiler issue. The problem line being:
"z_a_0" is compiler temp array being created to hold the intermediary result from the call to matmul. It's declaration is being hoisted out of the loop and then copied back in as shared array. Since it's shared, it then causes a race condition.
I've submitted a problem report (TPR #29482) and sent it to our engineers for further investigation.