How to translate an openMP parallel region into MPI in fortran

85 Views Asked by At

I coded a Discrete Element Model that computes the interaction between particles. As a first step, I used openMP, and did some scaling analysis on a supercomputer cluster, and I am now ready to upgrade to MPI, as I determined that I could use many nodes efficiently. However, I don't know anything about MPI, and the examples I looked up online are either to easy, or too complicated. I just want to dispatch each loop calculation on a new processor on different nodes, and when that loop calculation is over, the next one is line is given to the newly idle processor. (For reference, I plan on running on 4 nodes with each 64 cpus. Currently, the openMP version runs well on 1 node with 64 cpus.)

My code only has one parallel region in openMP: what I want to do is to split the parallel nested loop (on i and j) on multiple nodes (using all the cpus on each). But I am not sure how to do this. Should I precalculate the exact number of loop elements to send to each processor? Is there a way to dispatch each loop calculation as it comes, mimicking the schedule-guided behavior in openMP? Should I use BCast or Scatter? Should they be used in both loops or will just once work? These are the kind of questions I am asking myself.

The relevant (openMP) part of the code is:

subroutine stepper (tstep)

    use omp_lib
    use m_allocate, only: allocate
    use m_deallocate, only: deallocate
    use m_KdTree, only: KdTree, KdTreeSearch
    use dArgDynamicArray_Class, only: dArgDynamicArray
    use m_strings, only: str

    implicit none

    include "parameter.h"
    include "CB_variables.h"
    include "CB_const.h"
    include "CB_bond.h"
    include "CB_forcings.h"
    include "CB_options.h"

    integer :: i, j, k
    integer, intent(in) :: tstep
    type(KdTree) :: tree
    type(KdTreeSearch) :: search
    type(dArgDynamicArray) :: da

    ! Build the tree
    tree = KdTree(x, y)
    
    ! reinitialize force arrays for contact and bonds
    do i = 1, n
        mc(i)    = 0d0
        mb(i)    = 0d0
        fcx(i)   = 0d0
        fcy(i)   = 0d0
        fbx(i)   = 0d0
        fby(i)   = 0d0
        ! and total force arrays
        m(i)   = 0d0
    tfx(i) = 0d0
        tfy(i) = 0d0
    end do
    
    ! put yourself in the referential of the ith particle
    ! loop through all j particles and compute interactions

    !$omp parallel do schedule(guided) &
    !$omp private(i,j,da) &
    !$omp reduction(+:fcx,fcy,fbx,fby,mc,mb)
    do i = n, 1, -1
        ! Find all the particles j near i
        da = search%kNearest(tree, x, y, xQuery = x(i), yQuery = y(i), &
                            radius = r(i) + rtree)
        ! loop over the nearest neighbors except the first because this is the particle i
        do k = 1, size(da%i%values) - 1
            j = da%i%values(k + 1)

            ! only compute lower triangular matrices
            if (i .ge. j) then
                cycle
            end if

        ! compute relative position and velocity
            call rel_pos_vel (j, i)

        ! bond initialization
        if ( cohesion .eqv. .true. ) then
                if ( tstep .eq. 1 ) then
                    if ( deltan(j, i) .ge. -5d-1 ) then ! can be fancier
                        bond (j, i) = 1
                    end if
                    call bond_properties (j, i)
                end if
        end if

            ! verify if two particles are colliding
            if ( deltan(j,i) .gt. 0 ) then
                
                call contact_forces (j, i)
        !call bond_creation (j, i) ! to implement

        ! update contact force on particle i by particle j
                fcx(i) = fcx(i) - fcn(j,i) * cosa(j,i)
                fcy(i) = fcy(i) - fcn(j,i) * sina(j,i)

                ! update moment on particule i by particule j due to tangent contact 
                mc(i) = mc(i) - r(i) * fct(j,i) - mcc(j,i)

                ! Newton's third law
                ! update contact force on particle j by particle i
                fcx(j) = fcx(j) + fcn(j,i) * cosa(j,i)
                fcy(j) = fcy(j) + fcn(j,i) * sina(j,i)

                ! update moment on particule j by particule i due to tangent contact 
                mc(j) = mc(j) - r(j) * fct(j,i) + mcc(j,i)

            else
            
                call reset_contact (j, i)

            end if

        ! compute forces from bonds between particle i and j
        if ( bond (j, i) .eq. 1 ) then

        call bond_forces (j, i)
        call bond_breaking (j, i)

                if ( bond (j, i) .eq. 1 ) then
                    ! update force on particle i by j due to bond
                    fbx(i) = fbx(i) - fbn(j,i) * cosa(j,i) +    &
                                        fbt(j,i) * sina(j,i)
                    fby(i) = fby(i) - fbn(j,i) * sina(j,i) -    &
                                        fbt(j,i) * cosa(j,i)

                    ! update moment on particule i by j to to bond
                    mb(i) = mb(i) - r(i) * fbt(j,i) - mbb(j, i)

                    ! Newton's third law
                    ! update force on particle j by i due to bond
                    fbx(j) = fbx(j) + fbn(j,i) * cosa(j,i) -    &
                                        fbt(j,i) * sina(j,i)
                    fby(j) = fby(j) + fbn(j,i) * sina(j,i) +    &
                                        fbt(j,i) * cosa(j,i)
    
                    ! update moment on particule j by i due to bond
                    mb(j) = mb(j) - r(j) * fbt(j,i) + mbb(j, i)
                end if

            else

                call reset_bond (j, i)

        end if

        ! compute sheltering height for particule j on particle i for air and water drag
            call sheltering(j, i)

        end do

        ! compute the total forcing from winds, currents and coriolis on particule i
        call forcing (i)
        call coriolis(i)

    end do
    !$omp end parallel do

    ! deallocate tree memory
    call tree%deallocate()

    ! sum all forces together on particule i
    do i = 1, n
        tfx(i) = fcx(i) + fbx(i) + fax(i) + fwx(i) + fcorx(i)
        tfy(i) = fcy(i) + fby(i) + fay(i) + fwy(i) + fcory(i)

        ! sum all moments on particule i together
        m(i) =  mc(i) + mb(i) + ma(i) + mw(i)
    end do

    ! forces on side particles for experiments
    !call experiment_forces

    ! integration in time
    call velocity
    call euler

end subroutine stepper

Note that I know how to compile and run the code. My issue is with the actual logical translation from openMP to MPI. Obviously, it is not compiling or running because I don't understand how to use MPI properly. Here is what I tried so far:

subroutine stepper (tstep)

    use mpi
    
    ... ! same as above

    integer :: p, np , ierr
    integer, parameter :: master = 0
    double precision :: fcx_temp(n), fcy_temp(n), fbx_temp(n), fby_temp(n), mc_temp(n), mb_temp(n)

    ! Build the tree
    tree = KdTree(x, y)
    
    ! reinitialize force arrays for contact and bonds
    do i = 1, n
        mc(i)    = 0d0
        mb(i)    = 0d0
        fcx(i)   = 0d0
        fcy(i)   = 0d0
        fbx(i)   = 0d0
        fby(i)   = 0d0
        ! and total force arrays
        m(i)   = 0d0
    tfx(i) = 0d0
        tfy(i) = 0d0
    end do
    
    ! put yourself in the referential of the ith particle
    ! loop through all j particles and compute interactions

    call MPI_INIT(ierr) 
    call MPI_COMM_RANK(MPI_COMM_WORLD, p, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)

    call MPI_BCAST(n, 1, MPI_INTEGER, master, MPI_COMM_WORLD, ierr)

    do i = n, 1, -1

        ! Find all the particles j near i
        da = search%kNearest(tree, x, y, xQuery = x(i), yQuery = y(i), &
                            radius = r(i) + rtree)
        ! loop over the nearest neighbors except the first because this is the particle i
        do k = 1, size(da%i%values) - 1
            j = da%i%values(k + 1)

            ! only compute lower triangular matrices
            if (i .ge. j) then
                cycle
            end if

        ... ! sa me as above with: variables -> variables_temp

        end do

        ... ! same as above

        call MPI_REDUCE(fcx_temp, fcx, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
        call MPI_REDUCE(fcy_temp, fcy, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
        call MPI_REDUCE(fbx_temp, fbx, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
        call MPI_REDUCE(fby_temp, fby, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
        call MPI_REDUCE(mc_temp, mc, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
        call MPI_REDUCE(mb_temp, mb, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)

    end do
    
    call MPI_FINALIZE(ierr)

    ... ! same as above

end subroutine stepper

Thanks!

1

There are 1 best solutions below

0
On

The big difference between OpenMP and MPI is that with OpenMP, to first order, you only have to think about division of work, because all threads have access to the same memory. Later, for performance fine tuning you want to think a little harder but for how: in OpenMP all threads share the same memory.

In MPI, where each process has its own data space, the division of work is often induced by a division of data. You have to decide which process gets which subset of the data. So your loop of n elements becomes n/nprocs elements, and each process decided which elements it gets.

The big problem is that now you've divided the data, but processes may need data from a neighbor: the first and last point on a process have nearest neighbors that are owned by a neighbor process. This is known as "boundary exchange" or "halo region" or things like that. This is a standard example in basically every MPI course.