Order of summation in MPI-reduce operations

1.2k Views Asked by At

We know that the different summation order of floating-point numbers could lead to the different results.

Consider the MPI function MPI_reduce called with the MPI_SUM operation.

#include <mpi.h>
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count,
               MPI_Datatype datatype, MPI_Op op, int root,
               MPI_Comm comm)

Is it guaranteed by the MPI standard or MPI implementations that every time we run the function with the same input and output data, the results will be the same?

That's what I found in the documentation

The ‘‘canonical’’ evaluation order of a reduction is determined by the ranks of the processes in the group. However, the implementation can take advantage of associativity, or associativity and commutativity, in order to change the order of evaluation.

But this does not give any insight on the repeatability.

2

There are 2 best solutions below

2
On BEST ANSWER

The actual standard gives some further insight:

Advice to implementors. It is strongly recommended that MPI_REDUCE be implemented so that the same result be obtained whenever the function is applied on the same arguments, appearing in the same order. Note that this may prevent optimizations that take advantage of the physical location of ranks. (End of advice to implementors.)

So, while there is no guarantee, I would expect that implementations follow this recommendation and do produce reproducible results.

1
On

If you have the same number of ranks with an identical physical placement across nodes and cores each time you run then you would probably expect the same result each time (though, as you see above, the standard does not guarantee this).

In practice, on shared use HPC systems, you do not often achieve exactly the same placement so the reduction order usually differs and you see small differences due to the different order of reduction operations.

I should also say: even if you consistently replicate physical layout, operation order can still potentially differ due to different conditions on shared infrastructure (interconnect or disk, even nodes if their use is non-exclusive). If other users are loading the system in different ways it can change the ordering of data reaching each rank and thus the order of the operations (depending on the parallel reduction algorithm).