I coded a Discrete Element Model that computes the interaction between particles. As a first step, I used openMP, and did some scaling analysis on a supercomputer cluster, and I am now ready to upgrade to MPI, as I determined that I could use many nodes efficiently. However, I don't know anything about MPI, and the examples I looked up online are either to easy, or too complicated. I just want to dispatch each loop calculation on a new processor on different nodes, and when that loop calculation is over, the next one is line is given to the newly idle processor. (For reference, I plan on running on 4 nodes with each 64 cpus. Currently, the openMP version runs well on 1 node with 64 cpus.)
My code only has one parallel region in openMP: what I want to do is to split the parallel nested loop (on i and j) on multiple nodes (using all the cpus on each). But I am not sure how to do this. Should I precalculate the exact number of loop elements to send to each processor? Is there a way to dispatch each loop calculation as it comes, mimicking the schedule-guided behavior in openMP? Should I use BCast or Scatter? Should they be used in both loops or will just once work? These are the kind of questions I am asking myself.
The relevant (openMP) part of the code is:
subroutine stepper (tstep)
use omp_lib
use m_allocate, only: allocate
use m_deallocate, only: deallocate
use m_KdTree, only: KdTree, KdTreeSearch
use dArgDynamicArray_Class, only: dArgDynamicArray
use m_strings, only: str
implicit none
include "parameter.h"
include "CB_variables.h"
include "CB_const.h"
include "CB_bond.h"
include "CB_forcings.h"
include "CB_options.h"
integer :: i, j, k
integer, intent(in) :: tstep
type(KdTree) :: tree
type(KdTreeSearch) :: search
type(dArgDynamicArray) :: da
! Build the tree
tree = KdTree(x, y)
! reinitialize force arrays for contact and bonds
do i = 1, n
mc(i) = 0d0
mb(i) = 0d0
fcx(i) = 0d0
fcy(i) = 0d0
fbx(i) = 0d0
fby(i) = 0d0
! and total force arrays
m(i) = 0d0
tfx(i) = 0d0
tfy(i) = 0d0
end do
! put yourself in the referential of the ith particle
! loop through all j particles and compute interactions
!$omp parallel do schedule(guided) &
!$omp private(i,j,da) &
!$omp reduction(+:fcx,fcy,fbx,fby,mc,mb)
do i = n, 1, -1
! Find all the particles j near i
da = search%kNearest(tree, x, y, xQuery = x(i), yQuery = y(i), &
radius = r(i) + rtree)
! loop over the nearest neighbors except the first because this is the particle i
do k = 1, size(da%i%values) - 1
j = da%i%values(k + 1)
! only compute lower triangular matrices
if (i .ge. j) then
cycle
end if
! compute relative position and velocity
call rel_pos_vel (j, i)
! bond initialization
if ( cohesion .eqv. .true. ) then
if ( tstep .eq. 1 ) then
if ( deltan(j, i) .ge. -5d-1 ) then ! can be fancier
bond (j, i) = 1
end if
call bond_properties (j, i)
end if
end if
! verify if two particles are colliding
if ( deltan(j,i) .gt. 0 ) then
call contact_forces (j, i)
!call bond_creation (j, i) ! to implement
! update contact force on particle i by particle j
fcx(i) = fcx(i) - fcn(j,i) * cosa(j,i)
fcy(i) = fcy(i) - fcn(j,i) * sina(j,i)
! update moment on particule i by particule j due to tangent contact
mc(i) = mc(i) - r(i) * fct(j,i) - mcc(j,i)
! Newton's third law
! update contact force on particle j by particle i
fcx(j) = fcx(j) + fcn(j,i) * cosa(j,i)
fcy(j) = fcy(j) + fcn(j,i) * sina(j,i)
! update moment on particule j by particule i due to tangent contact
mc(j) = mc(j) - r(j) * fct(j,i) + mcc(j,i)
else
call reset_contact (j, i)
end if
! compute forces from bonds between particle i and j
if ( bond (j, i) .eq. 1 ) then
call bond_forces (j, i)
call bond_breaking (j, i)
if ( bond (j, i) .eq. 1 ) then
! update force on particle i by j due to bond
fbx(i) = fbx(i) - fbn(j,i) * cosa(j,i) + &
fbt(j,i) * sina(j,i)
fby(i) = fby(i) - fbn(j,i) * sina(j,i) - &
fbt(j,i) * cosa(j,i)
! update moment on particule i by j to to bond
mb(i) = mb(i) - r(i) * fbt(j,i) - mbb(j, i)
! Newton's third law
! update force on particle j by i due to bond
fbx(j) = fbx(j) + fbn(j,i) * cosa(j,i) - &
fbt(j,i) * sina(j,i)
fby(j) = fby(j) + fbn(j,i) * sina(j,i) + &
fbt(j,i) * cosa(j,i)
! update moment on particule j by i due to bond
mb(j) = mb(j) - r(j) * fbt(j,i) + mbb(j, i)
end if
else
call reset_bond (j, i)
end if
! compute sheltering height for particule j on particle i for air and water drag
call sheltering(j, i)
end do
! compute the total forcing from winds, currents and coriolis on particule i
call forcing (i)
call coriolis(i)
end do
!$omp end parallel do
! deallocate tree memory
call tree%deallocate()
! sum all forces together on particule i
do i = 1, n
tfx(i) = fcx(i) + fbx(i) + fax(i) + fwx(i) + fcorx(i)
tfy(i) = fcy(i) + fby(i) + fay(i) + fwy(i) + fcory(i)
! sum all moments on particule i together
m(i) = mc(i) + mb(i) + ma(i) + mw(i)
end do
! forces on side particles for experiments
!call experiment_forces
! integration in time
call velocity
call euler
end subroutine stepper
Note that I know how to compile and run the code. My issue is with the actual logical translation from openMP to MPI. Obviously, it is not compiling or running because I don't understand how to use MPI properly. Here is what I tried so far:
subroutine stepper (tstep)
use mpi
... ! same as above
integer :: p, np , ierr
integer, parameter :: master = 0
double precision :: fcx_temp(n), fcy_temp(n), fbx_temp(n), fby_temp(n), mc_temp(n), mb_temp(n)
! Build the tree
tree = KdTree(x, y)
! reinitialize force arrays for contact and bonds
do i = 1, n
mc(i) = 0d0
mb(i) = 0d0
fcx(i) = 0d0
fcy(i) = 0d0
fbx(i) = 0d0
fby(i) = 0d0
! and total force arrays
m(i) = 0d0
tfx(i) = 0d0
tfy(i) = 0d0
end do
! put yourself in the referential of the ith particle
! loop through all j particles and compute interactions
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, p, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
call MPI_BCAST(n, 1, MPI_INTEGER, master, MPI_COMM_WORLD, ierr)
do i = n, 1, -1
! Find all the particles j near i
da = search%kNearest(tree, x, y, xQuery = x(i), yQuery = y(i), &
radius = r(i) + rtree)
! loop over the nearest neighbors except the first because this is the particle i
do k = 1, size(da%i%values) - 1
j = da%i%values(k + 1)
! only compute lower triangular matrices
if (i .ge. j) then
cycle
end if
... ! sa me as above with: variables -> variables_temp
end do
... ! same as above
call MPI_REDUCE(fcx_temp, fcx, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
call MPI_REDUCE(fcy_temp, fcy, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
call MPI_REDUCE(fbx_temp, fbx, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
call MPI_REDUCE(fby_temp, fby, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
call MPI_REDUCE(mc_temp, mc, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
call MPI_REDUCE(mb_temp, mb, 1, MPI_DOUBLE_PRECISION, MPI_SUM, master, MPI_COMM_WORLD, ierr)
end do
call MPI_FINALIZE(ierr)
... ! same as above
end subroutine stepper
Thanks!
The big difference between OpenMP and MPI is that with OpenMP, to first order, you only have to think about division of work, because all threads have access to the same memory. Later, for performance fine tuning you want to think a little harder but for how: in OpenMP all threads share the same memory.
In MPI, where each process has its own data space, the division of work is often induced by a division of data. You have to decide which process gets which subset of the data. So your loop of
n
elements becomesn/nprocs
elements, and each process decided which elements it gets.The big problem is that now you've divided the data, but processes may need data from a neighbor: the first and last point on a process have nearest neighbors that are owned by a neighbor process. This is known as "boundary exchange" or "halo region" or things like that. This is a standard example in basically every MPI course.