I was going through the accumulate and atomic MPI RMA calls which are introduced in MPI-3. After reading I found out that there is a MPI_REPLACE operator which can be used in MPI_Accumulate to perform a similar functionality as that of MPI_PUT. And from what I understood after reading that concurrent MPI_ACCUMULATE calls are not erroneous unilke concurrent MPI_PUT calls. Hence in my application whenever I want to update data I am using EXCLUSIVE_LOCK for MPI_PUT. But this causes severe performance degradation as even updates to different memory locations on target process happen sequentially. Hence as SHARED_LOCK is valid with MPI_ACCUMULATE is using MPI_ACCUMULATE with MPI_REPLACE inside a SHARED_LOCK always a better alternative than using MPI_PUT with a EXCLUSIVE_LOCK? Or am I misunderstanding something? Also simillary on a minor note is MPI_GET_ACCUMULATE with MPI_NO_OP always better than MPI_GET?
So basically my question is will removing all MPI_PUT calls which are currently synced by EXLUSIVE LOCK and replacing those with a MPI_ACCUMULATE with MPI_REPLACE synced by SHARED_LOCK a valid and better alternative ... as it removes the need for getting an EXCLUSIVE LOCK on the whole target process window.
MPI_ACCUMULATEwithMPI_REPLACEis an atomic put and is neither better nor worse in general but they are almost certainly better thanMPI_PUTusing exclusive locks when one requires element-wise atomicity.The recommended model for MPI-3 RMA is to use
MPI_WIN_LOCK_ALLfor the lifetime of the window, and use element-wise RMA operations or some form of mutual exclusive (mentioned in https://stackoverflow.com/a/75927929/2189128) for anything else.Use
MPI_WIN_FLUSH(_LOCAL)(_ALL)to achieve the appropriate synchronization without terminating the epoch. Use_LOCALversions if you only care about reusing the buffer, or if the RMA operation has round-trip semantics (e.g. get, get_accumulate, fetch_and_op, compare_and_swap). Use_ALLversions to complete at all targets in the window, as opposed to just one.