I want to balance a set of matrix blocks between processes. The matrix blocks have different sizes, although typically one single block dominates, being of similar size or even larger than all the other blocks combined. The number of processes may be anywhere between much larger and much smaller than the number of blocks. Each block can be stored either on a single process or distributed as a ScaLAPACK array. The balancing should qualitatively fulfill the following conditions:
- No process should receive much more matrix elements than target_load = sum(size(blocks[:])) / n_procs
- No block should be distributed over much more processes than size(block) / target_load. MPI communicators may be split off from mpi_comm_world
- MPI communicators cannot overlap (blocks 1 and 2 both being distributed over processes 0:4 is fine, but block 1 being distributed over processes 0:3 and block 2 being distributed over processes 2:5 is not ok; undistributed blocks may be stacked arbitrarily on top of distributed blocks)
I am aware that such a distribution will depend on how strongly and in which priority the first two conditions are applied (the third condition should apply strictly). Nonetheless, is there any algorithm that facilitates some interpretation of these conditions?
This is a pretty common problem in computer science, so an off-the-shelf library should be able to help you. You should check out metis and/or SCOTCH to see if either of these will be suitable for your needs.
Your first condition is 'load balance' and your second condition is something like 'communication cost' (i.e. the cost of MPI communication in divided blocks).
The proper balance between these two conditions will totally depend on the nature of your problem, but using SCOTCH or metis you should be able to tweak these parameters till you find the best combination.