How to fix : horovod.run.common.util.network.NoValidAddressesFound

778 Views Asked by At

I'm trying to make distributed learning with 2 nvidia docker. When I tried with 2 hosts it did not work. How do I fix this problem?

I tried this command:

horovodrun -np 3 -H localhost:1 -p 12345  python keras_mnist_advanced.py

It worked, but when I tried:

horovodrun -np 3 -H localhost:1,192.168.0.20:2 -p 12345  python keras_mnist_advanced.py

I got this error:

Launching horovodrun task function was not successful: horovod.run.common.util.network.NoValidAddressesFound: Unable to connect to the horovodrun task service #1 on any of the addresses:{'lo': [('127.0.0.1', 30871)], 'docker0': [('172.17.0.1', 30871)], 'enp0s31f6': [('192.168.0.20', 30871)]}

1

There are 1 best solutions below

0
On

Please look into these issues raised on the repository:

1) https://github.com/horovod/horovod/issues/975

2) https://github.com/horovod/horovod/issues/971