I'm trying to make distributed learning with 2 nvidia docker. When I tried with 2 hosts it did not work. How do I fix this problem?
I tried this command:
horovodrun -np 3 -H localhost:1 -p 12345 python keras_mnist_advanced.py
It worked, but when I tried:
horovodrun -np 3 -H localhost:1,192.168.0.20:2 -p 12345 python keras_mnist_advanced.py
I got this error:
Launching horovodrun task function was not successful: horovod.run.common.util.network.NoValidAddressesFound: Unable to connect to the horovodrun task service #1 on any of the addresses:{'lo': [('127.0.0.1', 30871)], 'docker0': [('172.17.0.1', 30871)], 'enp0s31f6': [('192.168.0.20', 30871)]}
Please look into these issues raised on the repository:
1) https://github.com/horovod/horovod/issues/975
2) https://github.com/horovod/horovod/issues/971