Cloud ML Engine distributed training default type for custom tf.estimator

355 Views Asked by reese0106 At 31 August 2017 at 16:51

This article suggests there are three options for distributed training

Data-parallel training with synchronous updates.
Data-parallel training with asynchronous updates.
Model-parallel training.

The tutorial then goes on to suggest that the code that follows performs data-parallel training with asynchronous updates on Cloud ML Engine which behaves as "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches."

However, it's not clear what portion of the code actually specifies that this is using data-parallel training with asynchronous updates. Is this simply the default for ML engine if you run it in distributed training mode with a custom tf.estimator?

Original Q&A

There are 2 best solutions below

rhaertel80 On 31 August 2017 at 18:39 BEST ANSWER

The short answer is that tf.estimator is currently mostly built around Data-parallel training (2).

You get Model-parallel training simply by using with tf.device() statements in your code.

You could try to use SyncReplicasOptimizer and probably accomplish synchronous training (1).

All of the above applies generally to tf.estimator; nothing is different for CloudML Engine.

Jeremy Lewi On 31 August 2017 at 17:22

Cloud ML Engine doesn't determine the mode of distributed training. This depends on how the user sets up training using the TensorFlow libraries. In the mnist example linked from the article, the code is using TF Learn classes specifically an Estimator is constructed in model.py

That code selects the optimizer which in this case is the AdamOptimizer which uses asynchronous updates. If you wanted to do synchronous updates you'd have to use a different optimizer such as SyncReplicasOptimizer.

For more information on how to setup synchronous training you can refer to this doc.

Cloud ML Engine distributed training default type for custom tf.estimator

There are 2 best solutions below

Related Questions in TENSORFLOW

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-CLOUD-ML

Related Questions in GOOGLE-CLOUD-ML-ENGINE

Trending Questions

Popular # Hahtags

Popular Questions