Disrepancy in results between OpenMP/OpenACC implementation and gcc/PGI compilers

248 Views Asked by At

I have a larger Fortran program that I am trying to convert so that the computationally intensive part will run on an NVidia GPU using OpenMP and/or OpenACC. During development I had some issues to understand how variables declared in a module can be used within subroutines that are executed on the GPU (and some of them also on the CPU). Therefore, I created a small example and worked on that, by experimenting and adding the corresponding OpenMP and OpenACC directives. I have included the three files that comprise my example at the end of this message.

Just as I thought that I had understood things and that my example program works, I noticed the following:

  • I compile the program with gcc 10.2 using the OpenMP directives:
gfortran -O3 -fopenmp -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link

The results are as expected, i.e. all elements of array XMO are 1, of DCP are 2, of IS1 are 3 and of IS2 are 24.

  • I compile the program with PGI compiler 19.10 community edition using the OpenACC directives:
pgfortran -O4 -acc -ta=tesla,cc35 -Minfo=all,mp,accel -Mcuda=cuda10.0 test_link.f90 common_vars.f90 parameters.f90 -o test_link

The results are the same as above.

  • I compile the program with gcc 10.2 using the OpenACC directives:
gfortran -O3 -fopenacc -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link

The results for arrays XMO, DCP and IS1 are correct, but all elements of IS2 are 0. It is easy to verify that variable NR has a value of 0 to get this result.

My understanding is that the OpenMP and OpenACC version of my example are equivalent, but I cannot figure out why the OpenACC version works only for the PGI compiler and not for gcc.

If possible, please provide solutions that do not require changes in the code but only in the directives. As I mentioned, my original code is much larger, contains many more module variables and calls many more subroutines in the code to be executed on the GPU. Changes in that code will be much more difficult to do and obviously I would prefer to do that only if really necessary.

Thank you in advance!

The files of my example follow.

File parameters.f90
MODULE PARAMETERS
  IMPLICIT NONE
  INTEGER, PARAMETER :: MAX_SOURCE_POSITIONS = 100
END MODULE PARAMETERS
File common_vars.f90
MODULE COMMON_VARS
  USE PARAMETERS
  IMPLICIT NONE

!$OMP DECLARE TARGET TO(NR)
  INTEGER :: NR
!$ACC DECLARE COPYIN(NR)

END MODULE COMMON_VARS
File test_link.f90
      SUBROUTINE TEST()
       USE COMMON_VARS
        IMPLICIT NONE
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
        INTEGER I
        I = NR
      END SUBROUTINE TEST


      PROGRAM TEST_LINK

      USE COMMON_VARS
      USE PARAMETERS

      IMPLICIT NONE

      INTERFACE
        SUBROUTINE TEST()
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
        END SUBROUTINE TEST
      END INTERFACE

      REAL    :: XMO(MAX_SOURCE_POSITIONS), DCP(MAX_SOURCE_POSITIONS)
      INTEGER :: IS1(MAX_SOURCE_POSITIONS), IS2(MAX_SOURCE_POSITIONS)

      INTEGER :: X, Y, Z, MAX_X, MAX_Y, MAX_Z, ISOUR

      MAX_X = 3
      MAX_Y = 4
      MAX_Z = 5
      NR    = 6

!$OMP TARGET UPDATE TO(NR)
!$OMP TARGET MAP(TOFROM:IS1,IS2,DCP,XMO)
!$OMP TEAMS DISTRIBUTE PARALLEL DO COLLAPSE(3)
!$ACC UPDATE DEVICE(NR)
!$ACC PARALLEL LOOP GANG WORKER COLLAPSE(3) INDEPENDENT &
!$ACC COPY(IS1,IS2,DCP,XMO)
      DO X = 1, MAX_X
         DO Y = 1, MAX_Y
            DO Z = 1, MAX_Z

               ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z

               XMO(ISOUR) = 1.0
               DCP(ISOUR) = 2.0
               IS1(ISOUR) = 3
               IS2(ISOUR) = 4   * NR

               CALL TEST()

            ENDDO  ! End of z loop
         ENDDO     ! End of y loop
      ENDDO        ! End of x loop
!$ACC END PARALLEL LOOP
!$OMP END TEAMS DISTRIBUTE PARALLEL DO
!$OMP END TARGET

      DO X = 1, MAX_X
         DO Y = 1, MAX_Y
            DO Z = 1, MAX_Z

               ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z

               WRITE(*, *) 'ISOUR = ', ISOUR, 'XMO = ', XMO(ISOUR), 'DCP = ', DCP(ISOUR), 'IS1 = ', IS1(ISOUR), 'IS2 = ', IS2(ISOUR)

            ENDDO  ! End of z loop
         ENDDO     ! End of y loop
      ENDDO        ! End of x loop

      END PROGRAM TEST_LINK
0

There are 0 best solutions below