I'm running a distributed machine learning training loop inside the remote function. The distributed training code is using pytorch-lightning
and communication between the calling function and remote function is with a Pyro4 callback. The problem happens when training with multiple GPUs.
When the Multi-GPU training takes place successfully, the callback isn't received. When the training fails, the callback takes place as intended. Is there a way to force the Pyro4 callback?
I have tried varying the code in the lightning trainer but each time the trainer.log
call happens, the callback doesn't take place.