How does tensorflow MultiWorkerMirroredStrategy work during autoscaling and failure if you have to configure cluster_resolver?

184 Views Asked by olaf At 07 June 2025 at 09:51

It seems like I have to configure cluster_resolver before running training to enable distributed training on multiple worker

But how does that work with autoscaling and node failures?

https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy

I am using databricks for reference

Original Q&A

There are 1 best solutions below

AviS On 14 November 2022 at 20:43

On Databricks, It is a best practice to disable autoscaling during any sort of distributed training, whether using multi-worker mirror strategy on Tensorflow or Data Parallel processing on Pytorch. Or scaling training using Horovod. The same applies for hyperparameter tuning with hyperopt.

For these sort of tasks(distributed training and hyperparameter optimization) on Databricks, it will be helpful to avoid using Spot instances as well or at least switch to instance types where preemption is readily available.

How does tensorflow MultiWorkerMirroredStrategy work during autoscaling and failure if you have to configure cluster_resolver?

There are 1 best solutions below

Related Questions in TENSORFLOW

Related Questions in DATABRICKS

Related Questions in DISTRIBUTED-TRAINING

Trending Questions

Popular # Hahtags

Popular Questions