It seems like I have to configure cluster_resolver before running training to enable distributed training on multiple worker

But how does that work with autoscaling and node failures?

https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy

I am using databricks for reference

1

There are 1 best solutions below

0
On

On Databricks, It is a best practice to disable autoscaling during any sort of distributed training, whether using multi-worker mirror strategy on Tensorflow or Data Parallel processing on Pytorch. Or scaling training using Horovod. The same applies for hyperparameter tuning with hyperopt.

For these sort of tasks(distributed training and hyperparameter optimization) on Databricks, it will be helpful to avoid using Spot instances as well or at least switch to instance types where preemption is readily available.