Interference between Pyro4 and DDP process communication

22 Views Asked by At

I'm running a distributed machine learning training loop inside the remote function. The distributed training code is using pytorch-lightning and communication between the calling function and remote function is with a Pyro4 callback. The problem happens when training with multiple GPUs.

When the Multi-GPU training takes place successfully, the callback isn't received. When the training fails, the callback takes place as intended. Is there a way to force the Pyro4 callback?

I have tried varying the code in the lightning trainer but each time the trainer.log call happens, the callback doesn't take place.

0

There are 0 best solutions below