H2O cluster uneven distribution of performance usage

389 Views Asked by At

I set up a cluster with a 4 core (2GHz) and a 16 core (1.8GHz) virtual machine. The creation and connection to the cluster works without problems. But now I want to do some deep learning on the cluster, where I see an uneven distribution for the performance usage of those two virtual machines. The one with 4 cores is always at 100% CPU usage while the 16 core machine is idle most of the time.

Do I have to make additional configuration during the cluster generation? Because it is odd for me that the stronger machine of the two is idle while the weaker one does all the work.

Best regards, Markus

1

There are 1 best solutions below

0
On BEST ANSWER

Two things to keep in mind here.

  1. Your data needs to be large enough to take advantage of data parallelism. In particular, the number of chunks per column needs to be large enough for all the cores to have work to do. See this answer for more details: H2O not working on parallel

  2. H2O-3 assumes your nodes are symmetric. It doesn't try to load balance work across the cluster based on capability of the nodes. Faster nodes will finish their work first and wait idle for the slower nodes to catch up. (You can see this same effect if you have two symmetric nodes but one of them is busy running another process.)

Asymmetry is a bigger problem for memory (where smaller nodes can run out of memory and fail entirely) than it is for CPU (where some nodes are just waiting around). So always make sure to start each H2O node with the same value of -Xmx.

You can limit the number of cores H2O uses with the -nthreads option. So you can try giving each of your two nodes -nthreads 4 and see if they behave more symmetrically with each using roughly four cores. In the case you describe, that would mean the smaller machine is roughly 100% utilized and the larger machine is roughly 25% utilized. (But since the two machines probably have different chips, the cores are probably not identical and won't balance perfectly, which is OK.)

[I'm ignoring the virtualization aspect completely, but CPU shares could also come into the picture depending on the configuration of your hypervisor.]